Whoops, yes, thanks, edited.
Curated. While I don't agree with every single positive claim advanced in the post (in particular, I'm less confident that chain-of-thought monitoring will survive to be a useful technique in the regime of transformative AI), this is an excellent distillation of the reasons for skepticism re: interpretability as a cure-all for identifying deceptive AIs. I also happen to think that those reasons generalize to many other agendas.
Separately, it's virtuous to publicly admit to changing one's mind, especially when the incentives are stacked the way ...
Hey Shannon, please read our policy on LLM writing before making future posts consisting almost entirely of LLM-written content.
Curated. To the extent that we want to raise the sanity waterline, or otherwise improve society's ability to converge on true beliefs, it's important to understand the weaknesses of existing infrastructure. Being unable to reliably translate a prediction market's pricing directly into implied odds of an outcome seems like a pretty substantial weakness. (Note that I'm not sure how much I believe the linked tweet; nonetheless I observe that the odds sure did seem mispriced and the provided explanation seems sufficient to cause some mispricings sometimes.) "Acceptance is the first step," and all that.
@Dima (lain), please read our policy on LLM writing on LessWrong and hold off on submitting further posts until you've done that.
Also also, why are socialist-vibe blogposts so often relegated to "personal blogpost" while capitalist-vibe blogposts aren't? I mean, I get the automatic barrage of downvotes, but you'd think the mods would at least try to appear impartial.
Posts are categorized as frontpage / personal once or twice per day, and start out as personal by default. Your post hasn't been looked at yet. (The specific details of what object-level political takes a post has aren't an input to that decision. Whether a post is frontpaged or not is a function of its "timelessness" - i.e. whether we expect people will still find value in reading the post years later - and general interest to the LW userbase.)
Sorry, there was a temporary bug where we were returning mismatched reward indicators to the client. It's since been patched! I don't believe anybody actually rolled The Void during this period.
Sorry, there was a temporary bug where we were returning mismatched reward indicators to the client. It's since been patched! I don't believe anybody actually rolled The Void during this period.
If you had some vague prompt like "write an essay about how the field of alignment is misguided" and then proofread it you've met the criteria as laid out.
No, such outputs will almost certainly fail this criteria (since they will by default be written with the typical LLM "style").
"10x engineers" are a thing, and if we assume they're high-agency people always looking to streamline and improve their workflows, we should expect them to be precisely the people who get a further 10x boost from LLMs. Have you observed any specific people suddenly becoming 10x more prolific?
In addition to the objection from Archimedes, another reason this is unlikely to be true is that 10x coders are often much more productive than other engineers because they've heavily optimized around solving for specific problems or skills that other engineers are bottlenecked by, and most of those optimizations don't readily admit of having an LLM suddenly inserted into the loop.
Not at the moment, but it is an obvious sort of thing to want.
Thanks for the heads up, we'll have this fixed shortly (just need to re-index all the wiki pages once).
Curated. This post does at least two things I find very valuable:
And so I think that this post both describes and advances the canonical "state of the argument" with respect to the Sharp Left Turn (and similar concerns). I hope that other people will also find it helpful in improving their understanding of e.g. objections to basic evolutionary analogies (and why those objections shouldn't make you very optimistic).
Yes:
My model is that Sam Altman regarded the EA world as a memetic threat, early on, and took actions to defuse that threat by paying lip service / taking openphil money / hiring prominent AI safety people for AI safety teams.
In the context of the thread, I took this to suggest that Sam Altman never had any genuine concern about x-risk from AI, or, at a minimum, that any such concern was dominated by the social maneuvering you're describing. That seems implausible to me given that he publicly expressed concern about x-risk from AI 10 months before OpenAI was publicly founded, and possibly several months before it was even conceived.
Sam Altman posted Machine intelligence, part 1[1] on February 25th, 2015. This is admittedly after the FLI conference in Puerto Rico, which is reportedly where Elon Musk was inspired to start OpenAI (though I can't find a reference substantiating his interaction with Demis as the specific trigger), but there is other reporting suggesting that OpenAI was only properly conceived later in the year, and Sam Altman wasn't at the FLI conference himself. (Also, it'd surprise me a bit if it took nearly a year, i.e. from Jan 2nd[2] to Dec 11th...
I think it's quite easy to read as condescending. Happy to hear that's not the case!
I hadn't downvoted this post, but I am not sure why OP is surprised given the first four paragraphs, rather than explaining what the post is about, instead celebrate tree murder and insult their (imagined) audience:
so that no references are needed but those any LW-rationalist is expected to have committed to memory by the time of their first Lighthaven cuddle puddle
I don't think much has changed since this comment. Maybe someone will make a new wiki page on the subject, though if it's not an admin I'd expect it to mostly be a collection of links to various posts/comments.
re: the table of contents, it's hidden by default but becomes visible if you hover your mouse over the left column on post pages.
I understand the motivation behind this, but there is little warning that this is how the forum works. There is no warning that trying to contribute in good faith isn't sufficient, and you may still end up partially banned (rate-limited) if they decide you are more noise than signal. Instead, people invest a lot only to discover this when it's too late.
In addition to the New User Guide that gets DMed to every new user (and is also linked at the top of our About page), we:
Show this comment above the new post form to new users who haven't already had s
Apropos of nothing, I'm reminded of the "<antthinking>" tags originally observed in Sonnet 3.5's system prompt, and this section of Dario's recent essay (bolding mine):
...In 2024, the idea of using reinforcement learning (RL) to train models to generate chains of thought has become a new focus of scaling. Anthropic, DeepSeek, and many other companies (perhaps most notably OpenAI who released their o1-preview model in September) have found that this training greatly increases performance on certain select, objectively measurable tasks like math, coding c
When is the "efficient outcome-achieving hypothesis" false? More narrowly, under what conditions are people more likely to achieve a goal (or harder, better, faster, stronger) with fewer resources?
The timing of this quick take is of course motivated by recent discussion about deepseek-r1, but I've had similar thoughts in the past when observing arguments against e.g. hardware restrictions: that they'd motivate labs to switch to algorithmic work, which would be speed up timelines (rather than just reducing the naive expected rate of slowdown). S...
We have automated backups, and should even those somehow find themselves compromised (which is a completely different concern from getting DDoSed), there are archive.org backups of a decent percentage of LW posts, which would be much easier to restore than paper copies.
I learned it elsewhere, but his LinkedIn confirms that he started at Anthropic sometime in January.
I know I'm late to the party, but I'm pretty confused by https://www.astralcodexten.com/p/its-still-easier-to-imagine-the-end (I haven't read the post it's responding to, but I can extrapolate). Surely the "we have a friendly singleton that isn't Just Following Orders from Your Local Democratically Elected Government or Your Local AGI Lab" is a scenario that deserves some analysis...? Conditional on "not dying" that one seems like the most likely stable end state, in fact.
Lots of interesting questions in that situation! Like, money still ...
I was thinking the same thing. This post badly, badly clashes with the vibe of Less Wrong. I think you should delete it, and repost to a site in which catty takedowns are part of the vibe. Less Wrong is not the place for it.
I think this is a misread of LessWrong's "vibes" and would discourage other people from thinking of LessWrong as a place where such discussions should be avoided by default.
With the exception of the title, I think the post does a decent job at avoiding making it personal.
Well, that's unfortunate. That feature isn't super polished and isn't currently in the active development path, but will try to see if it's something obvious. (In the meantime, would recommend subscribing to fewer people, or seeing if the issue persists in Chrome. Other people on the team are subscribed to 100-200 people without obvious issues.)
FWIW, I don't think "scheming was very unlikely in the default course of events" is "decisively refuted" by our results. (Maybe depends a bit on how we operationalize scheming and "the default course of events", but for a relatively normal operationalization.)
Thank you for the nudge on operationalization; my initial wording was annoyingly sloppy, especially given that I myself have a more cognitivist slant on what I would find concerning re: "scheming". I've replaced "scheming" with "scheming behavior".
...It's somewhat sensitive to the exact objec
I'd like to internally allocate social credit to people who publicly updated after the recent Redwood/Anthropic result, after previously believing that scheming behavior was very unlikely in the default course of events (or a similar belief that was decisively refuted by those empirical results).
Does anyone have links to such public updates?
(Edit log: replaced "scheming" with "scheming behavior".)
FWIW, I don't think "scheming was very unlikely in the default course of events" is "decisively refuted" by our results. (Maybe depends a bit on how we operationalize scheming and "the default course of events", but for a relatively normal operationalization.)
It's somewhat sensitive to the exact objection the person came in with.
My guess is that most reasonable perspectives should update toward thinking scheming has at least a tiny of chance of occuring (>2%), but I wouldn't say a view of <<2% was decisively refuted.
One reason to be pessimistic about the "goals" and/or "values" that future ASIs will have is that "we" have a very poor understanding of "goals" and "values" right now. Like, there is not even widespread agreement that "goals" are even a meaningful abstraction to use. Let's put aside the object-level question of whether this would even buy us anything in terms of safety, if it were true. The mere fact of such intractable disagreements about core philosophical questions, on which hinge substantial parts of various cases for and against doo...
I agree that in spherical cow world where we know nothing about the historical arguments around corrigibility, and who these particular researchers are, we wouldn't be able to make a particularly strong claim here. In practice I am quite comfortable taking Ryan at his word that a negative result would've been reported, especially given the track record of other researchers at Redwood.
...at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values d
I mean, yes, but I'm addressing a confusion that's already (mostly) conditioning on building on it.
This doesn't seem like it'd do much unless you ensured that there were training examples during RLAIF which you'd expect to cause that kind of behavior enough of the time that there'd be something to update against. (Which doesn't seem like it'd be that hard, though I think separately that approach seems kind of doomed - it's falling into a brittle whack-a-mole regime.)
Indeed, we should get everyone to make predictions about whether or not this change would be sufficient, and if it isn't, what changes would be suffficient. My prediction would be that this change would not be sufficient but that it would help somewhat.
LessWrong doesn't have a centralized repository of site rules, but here are some posts that might be helpful:
https://www.lesswrong.com/posts/bGpRGnhparqXm5GL7/models-of-moderation
https://www.lesswrong.com/posts/kyDsgQGHoLkXz6vKL/lw-team-is-adjusting-moderation-policy
We do currently require content to be posted in English.
"It would make sense to pay that cost if necessary" makes more sense than "we should expect to pay that cost", thanks.
it sounds like you view it as a bad plan?
Basically, yes. I have a draft post outlining some of my objections to that sort of plan; hopefully it won't sit in my drafts as long as the last similar post did.
...(I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect no
What do people mean when they talk about a "long reflection"? The original usages suggest flesh-humans literally sitting around and figuring out moral philosophy for hundreds, thousands, or even millions of years, before deciding to do anything that risks value lock-in, but (at least) two things about this don't make sense to me:
I tried to make a similar argument here, and I'm not sure it landed. I think the argument has since demonstrated even more predictive validity with e.g. the various attempts to build and restart nuclear power plants, directly motivated by nearby datacenter buildouts, on top of the obvious effects on chip production.
Should be fixed now.
Good catch, looks like that's from this revision, which looks like it was copied over from Arbital - some LaTeX didn't make it through. I'll see if it's trivial to fix.
The page isn't dead, Arbital pages just don't load sometimes (or take 15+ seconds).
I understand this post to be claiming (roughly speaking) that you assign >90% likelihood in some cases and ~50% in other cases that LLMs have internal subjective experiences of varying kinds. The evidence you present in each case is outputs generated by LLMs.
The referents of consciousness for which I understand you to be making claims re: internal subjective experiences are 1, 4, 6, 12, 13, and 14. I'm unsure about 5.
Do you have sources of evidence (even illegible) other than LLM outputs that updated you that much? Those seem like very...
The evidence you present in each case is outputs generated by LLMs.
The total evidence I have (and that everyone has) is more than behavioral. It includes
a) the transformer architecture, in particular the attention module,
b) the training corpus of human writing,
c) the means of execution (recursive calling upon its own outputs and history of QKV vector representations of outputs),
d) as you say, the model's behavior, and
e) "artificial neuroscience" experiments on the model's activation patterns and weights, like mech interp research.
When I think about how...
My impression is that Yudkowsky has harmed public epistemics in his podcast appearances by saying things forcefully and with rather poor spoken communication skills for novice audiences.
I recommend reading the Youtube comments on his recorded podcasts, rather than e.g. Twitter commentary from people with a pre-existing adversarial stance to him (or AI risk questions writ large).
On one hand, I feel a bit skeptical that some dude outperformed approximately every other pollster and analyst by having a correct inside-view belief about how existing pollster were messing up, especially given that he won't share the surveys. On the other hand, this sort of result is straightforwardly predicted by Inadequate Equilibria, where an entire industry had the affordance to be arbitrarily deficient in what most people would think was their primary value-add, because they had no incentive to accuracy (skin in the game), and as soon as someo...
Norvid on Twitter made the apt point that we will need to see the actual private data before we can really judge. Not unusual for lucky people to backrationalize their luck as a sure win.
I'm pretty sure Ryan is rejecting the claim that the people hiring for the roles in question are worse-than-average at detecting illegible talent.
Depends on what you mean by "resume building", but I don't think this is true for "need to do a bunch of AI safety work for free" or similar. i.e. for technical research, many people that have gone through MATS and then been hired at or founded their own safety orgs have no prior experience doing anything that looks like AI safety research, and some don't even have much in the way of ML backgrounds. Many people switch directly out of industry careers into doing e.g. ops or software work that isn't technical research. Policy might seem a b...
(We switched back to shipping Calibri above Gill Sans Nova pending a fix for the horrible rendering on Windows, so if Ubuntu has Calibri, it'll have reverted back to the previous font.)
For the first, we have the Read History page. For the second, there are some recommendations underneath the comments section of each post, but they're not fully general. For the third - do you mean allowing authors on LessWrong to have paid subscribers?