706

LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Load More

Popular Comments

Immoral Mazes

The book Moral Mazes, is an exploration of middle manager hell. Managers must abandon all other values, in favor of spending their resources on manipulating the system. Only those who let this process entirely consume them can survive. 

Zvi's Immoral Mazes sequence explores how that hell comes to be, and what to do about it.

niplav1d710
RSPs are pauses done right
I'm revisiting this post after listening to this section of this recent podcast with Holden Karnofsky. Seems like this post was overly optimistic in what RSPs would be able to enforce/not quite clear on different scenarios for what "RSP" could refer to. Specifically, this post was equivocating between "RSP as a regulation that gets put into place" vs. "RSP as voluntary commitment"—we got the latter, but not really the former (except maybe in the form of the EU Codes of Practice). Even at Anthropic, the way the RSP is put into practice is now basically completely excluding a scaling pause from the picture: > RSPs are pauses done right: if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end. In that case, just advocate for that condition being baked into RSPs! Interview: > That was never the intent. That was never what RSPs were supposed to be; it was never the theory of change and it was never what they were supposed to be... So the idea of RSPs all along was less about saying, 'We promise to do this, to pause our AI development no matter what everyone else is doing' and > But we do need to get rid of some of this unilateral pause stuff. Furthermore, what apparently happens now is that really difficult commitments either don't get made or get walked back on: > Since the strictest conditions of the RSPs only come into effect for future, more powerful models, it's easier to get people to commit to them now. Labs and governments are generally much more willing to sacrifice potential future value than realized present value. Interview: > So I think we are somewhat in a situation where we have commitments that don't quite make sense... And in many cases it's just actually, I would think it would be the wrong call. In a situation where others were going ahead, I think it'd be the wrong call for Anthropic to sacrifice its status as a frontier company and > Another lesson learned for me here is I think people didn't necessarily think all this through. So in some ways you have companies that made commitments that maybe they thought at the time they would adhere to, but they wouldn't actually adhere to. And that's not a particularly productive thing to have done. I guess the unwillingness of the government to turn RSPs into regulation is what ultimately blocked this. (Though maybe today even a US-centric RSP-like regulation would be considered "not that useful" because of geopolitical competition). We got RSP-like voluntary commitments from a surprising number of AI companies (so good job on predicting the future on this one) but that didn't get turned into regulation.
Hastings15h2110
LLM-generated text is not testimony
In the discussion of the buck post and elewhere, I’ve seen the idea floated that if no-one can tell that a post is LLM generated, then it is necessarily ok that it is LLM generated. I don’t think that this necessarily follows- nor does its opposite. Unfortunately I don’t have the horsepower right now to explain why in simple logical reasoning, and will have to resort to the cudgel of dramatic thought experiment. Consider two lesswrong posts: a 2000 digit number that is easily verifiable as a collatz counterexample, and a collection of first person narratives of how human rights abuses happened, gathered by interviewing vietnam war vets at nursing homes. The value of one post doesn’t collapse if it turns out to be LLM output, the other collapses utterly- and this is unconnected from whether you can tell that they LLM output. The buck post is of course not at either end of this spectrum, but it contains many first person attestations- a large number of relatively innocent “I thinks,” but also lines like “When I was a teenager, I spent a bunch of time unsupervised online, and it was basically great for me.” and “A lot of people I know seem to be much more optimistic than me. Their basic argument is that this kind of insular enclave is not what people would choose under reflective equilibrium.” that are much closer to the vietnam vet end of the spectrum.   EDIT: Buck actually posted the original draft of the post, before LLM input, and the two first person accounts I highlighted are present verbatim, and thus honest. Reading the draft, it becomes a quite thorny question to adjucate whether the final post qualifies as “generated” by Opus, but this will start getting into definitions.
Thane Ruthenis21h184
Supervillain Monologues Are Unrealistic
> My point is, that if you are a nascent super-villain, don't worry about telling people about your master plan. Almost no one will believe you. So the costs are basically non-existent. Good try, but you won't catch me that easily!
Load More
First Post: Quotes from Moral Mazes
493Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
76
2025 NYC Secular Solstice & East Coast Rationalist Megameetup
No77e1d710
d_el_ez, fx
2
Accidental AI Safety experiment by PewDiePie: He created his own self-hosted council of 8 AIs to answer questions. They voted and picked the best answer. He noticed they were always picking the same two AIs, so he discarded the others, made the process of discarding/replacing automatic, and told the AIs about it. The AIs started talking about this "sick game" and scheming to prevent that. This is the video with the timestamp: 
Wei Dai3dΩ2810217
Thomas Kwa, jbash, and 8 more
24
Some of Eliezer's founder effects on the AI alignment/x-safety field, that seem detrimental and persist to this day: 1. Plan A is to race to build a Friendly AI before someone builds an unFriendly AI. 2. Metaethics is a solved problem. Ethics/morality/values and decision theory are still open problems. We can punt on values for now but do need to solve decision theory. In other words, decision theory is the most important open philosophical problem in AI x-safety. 3. Academic philosophers aren't very good at their jobs (as shown by their widespread disagreements, confusions, and bad ideas), but the problems aren't actually that hard, and we (alignment researchers) can be competent enough philosophers and solve all of the necessary philosophical problems in the course of trying to build Friendly (or aligned/safe) AI. I've repeatedly argued against 1 from the beginning, and also somewhat against 2 and 3, but perhaps not hard enough because I personally benefitted from them, i.e., having pre-existing interest/ideas in decision theory that became validated as centrally important for AI x-safety, and generally finding a community that is interested in philosophy and took my own ideas seriously. Eliezer himself is now trying hard to change 1, and I think we should also try harder to correct 2 and 3. On the latter, I think academic philosophy suffers from various issues, but also that the problems are genuinely hard, and alignment researchers seem to have inherited Eliezer's gung-ho attitude towards solving these problems, without adequate reflection. Humanity having few competent professional philosophers should be seen as (yet another) sign that our civilization isn't ready to undergo the AI transition, not a license to wing it based on one's own philosophical beliefs or knowledge! In this recent EAF comment, I analogize AI companies trying to build aligned AGI with no professional philosophers on staff (the only exception I know is Amanda Askell) with a company t
Algon8h90
PeterMcCluskey
1
Random things I learnt about ASML after wondering how critical they were to GPU progress. ASML makes specialized photolithography machines. They're about a decade ahead of competitors i.e. without ASML machines, you'd be stuck making 10nm chips.  They use 13.5nm "Extreme UV" to make 3nm scale features by using reflective optics to make interference patterns and fringe. Using low res light to make higher res features has been going on since photolithography tech stalled at 28nm for a while. I am convinced this is wizardry. RE specialization: early photolithography community used to have co-development between companies, technical papers sharing tonnes of details, and little specialization in companies. Person I talked to says they don't know if this has stopped, but it feels like it has.  In hindsight, no-one in the optics lab at my uni talked about chip manufacturing: it was all quantum dots and lasers. So maybe  It's unclear how you can decrease wavelength and still use existing technology. Perhaps we've got 5 generations left. We might have to change to deep UV light then. Even when we reach the limits of shr ASML makes machines for photolithography, somehow using light with λ > chip feature size. If ASML went out of business, everyone wouldn't be doomed. Existing machines are made for particular gens, but can be used for "half-steps". Like from 5nm to 4nm. Everyone is building new fabs, and ASML is building new machines as fast as they can. Would prob trigger world recession if they stopped producing new things. Very common in tech for monopoly partners to let customer's get access to their tech if they go out of business. TSMC and Intel buys from ASML.      Don't seem to be trying to screw people over.      If they tried, then someone else would come in. Apple might be able to in like 10 or even twenty years.     China has tried hard to do this.  ASML have edges in some fabs, other companies have edges in different parts of the fab.     Some companies
Mo Putera2d594
Cole Wyeth, testingthewaters, and 5 more
8
Over a decade ago I read this 17 year old passage from Eliezer and idly wondered when that proto-Conway was going to show up and "blaze right past to places he couldn't follow".  I was reminded of this passage when reading the following exchange between Eliezer and Dwarkesh; his 15-year update was "nope that proto-Conway never showed up": This was sad to read. As an aside, "people are not dense in the incredibly multidimensional space of people" is an interesting turn of phrase, it doesn't seem nontrivially true for the vast majority of people (me included) but is very much the case at the frontier (top thinkers, entrepreneurs, athletes, etc) where value creation goes superlinear. Nobody thought about higher dimensions like Bill Thurston for instance, perhaps the best geometric thinker in the history of math, despite Bill's realisation that “what mathematicians most wanted and needed from me was to learn my ways of thinking, and not in fact to learn my proof of the geometrization conjecture for Haken manifolds” and subsequent years of efforts to convey his ways of thinking (he didn't completely fail obviously, I'm saying no Super Thurstons have showed up since). Ditto Grothendieck and so on. When I first read Eliezer's post above all those years ago I thought, what were the odds that he'd be in this reference class of ~unsubstitutable thinkers, given he was one of the first few bloggers I read? I guess while system of the world pontificators are a dime a dozen (e.g. cult leaders, tangentially I actually grew up within a few minutes of one that the police eventually raided), good builders of systems of the world are just vanishingly rare.
Tomás B.11h1113
James Camacho
2
There is the classic error of conflating the normative with the descriptive, presuming that what is good is also true. But the inverse is also a mistake I see people make all the time: conflating the descriptive for the normative. The descriptive is subject to change by human action, so maybe the latter is the worse of the two mistakes. Crudely, the stereotypical liberal makes the former mistake and the stereotypical reactionary makes the latter. 
Jay Bailey1d283
skybluecat, dscft, and 2 more
5
I recently read Anthropic's new paper on introspection in LLM's. (https://www.anthropic.com/research/introspection) In short, they were able to: * Extract what they called an ALL CAPS vector, the difference between a prompt with ALL CAPS and the same prompt without all caps. * Inject that vector into the activations of an unrelated prompt. * Ask the model if it could detect an injected thought. About 20% of the time it said yes, and identified it with something like "loud" or "shouting" which is quite similar to the all caps they were going for. They did this with a bunch of concepts. And Opus 4 was, of course, better than other models at this. To give some fairly obvious thoughts on what this means: * If a model can unreliably do this now when directly prompted to do so, I expect the model can reliably do this without being prompted in around 2-3 years from now. * If you have an alignment plan that involves altering the model's thoughts in some way, consider that the model will probably be aware by default that you did that, and plan accordingly. Finally a more personal thought: The idea of "Well, if you try to adjust a superintelligence's thoughts, it will know you did that and try to subvert it / route around the imposing thoughts" would have sounded weird to me, if I'd tried saying it out loud. We're talking about literally altering the model's mind to think what we want it to think, and you still think that's not enough for alignment? I note that I did think that, but I was thinking in terms of "The concepts you are altering are a proxy for the thing you are actually trying to alter, and the important performance-enhancing stuff like deception will migrate to the other parts" and not "The model will detect you altering its mind and likely actively work against you when it builds the situational awareness for that". And yet, here we are, with me insufficiently paranoid. Unfortunately, reality doesn't care about what sounds intuitively plausible. I
eggsyntax11h*90
Random Developer
1
Informal thoughts on introspection in LLMs and the new introspection paper from Jack Lindsey (linkposted here), copy/pasted from a slack discussion: (quoted sections are from @Daniel Tan, unquoted are my responses)   People seem to have different usage intuitions about what 'introspection' centrally means. I interpret it mainly as 'direct access to current internal state'. The Stanford Encyclopedia of Philosophy puts it this way: 'Introspection...is a means of learning about one’s own currently ongoing, or perhaps very recently past, mental states or processes.' @Felix Binder et al in 'Looking Inward' describe introspection in roughly the same way ('introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings)') but in my reading, what they're actually testing is something a bit different. As they say, they're 'finetuning LLMs to predict properties of their own behavior in hypothetical scenario.'  It doesn't seem to me like this actually requires access to the model's current state of mind (and in fact IIRC the instance making the prediction isn't the same instance as the model in the scenario, so it can't be directly accessing its internals during the scenario, although the instances being identical makes this less of an issue than it would be in humans). I would personally call this self-modeling. @Jan Betley et al in 'Tell Me About Yourself' use 'behavioral self-awareness', but IMHO that paper comes closer to providing evidence for introspection in the sense I mean it. Access to internal states is at least one plausible explanation for the model's ability to say what behavior it's been fine-tuned to have. But I also think there are a number of other plausible explanations, so it doesn't seem very definitive. Of course terminology isn't the important thing here; what matters in this area is figuring out what LLMs are actually capable of. In my writing I've been using 'direct introspection' to try to point more cl
Load More (7/38)
[Today]Flourish: Human–AI Unconference
[Today]Monthly Meeting - November 2nd
110
Cancer has a surprising amount of detail
Abhishaike Mahajan
3d
13
225
EU explained in 10 minutes
Martin Sustrik
6d
45
736The Company Man
Tomás B.
1mo
70
676The Rise of Parasitic AI
Adele Lopez
1mo
177
171An Opinionated Guide to Privacy Despite Authoritarianism
TurnTrout
3d
22
225On Fleshling Safety: A Debate by Klurl and Trapaucius.
Eliezer Yudkowsky
6d
40
67Re-rolling environment
Raemon
10h
0
353Hospitalization: A Review
Logan Riggs
24d
19
225EU explained in 10 minutes
Martin Sustrik
6d
45
117Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals
Ω
Alexa Pan, ryan_greenblatt
3d
Ω
13
53LLM-generated text is not testimony
TsviBT
17h
20
290Towards a Typology of Strange LLM Chains-of-Thought
1a3orn
20d
27
113Emergent Introspective Awareness in Large Language Models
Drake Thomas
3d
15
125The Memetics of AI Successionism
Jan_Kulveit
5d
21
197The Doomers Were Right
Algon
10d
25
Load MoreAdvanced Sorting/Filtering