LESSWRONG
LW

All of RobertM's Comments + Replies

In a lot of modern science, top-line research outputs often look like "intervention X caused 7% change in metric Y, p <0.03" (with some confidence intervals that intersect 0%). This kind of relatively gear-free model can be pathological when it turns out that metric Y was actually caused by five different things, only one of which was responsive to intervention X, but in that case the effect size was very large. (A relatively well-known example is the case of peptic ulcers, where most common treatments would often have no effect, because the... (read more)

the void

RobertM2d41

Also, I would still like an answer to my query for the specific link to the argument you want to see people engage with.

I haven't looked very hard, but sure, here's the first post that comes up when I search for "optimization user:eliezer_yudkowksky".

The notion of a "powerful optimization process" is necessary and sufficient to a discussion about an Artificial Intelligence that could harm or benefit humanity on a global scale. If you say that an AI is mechanical and therefore "not really intelligent", and it outputs an action sequence that hacks into

... (read more)

5Ebenezer Dukakis2d

Arguably ChatGPT has already been a significant benefit/harm to humanity without being a "powerful optimization process" by this definition. Have you seen teachers complaining that their students don't know how to write anymore? Have you seen junior software engineers struggling to find jobs? Shouldn't these count as a points against Eliezer's model? In an "AI as electricity" scenario (basically continuing the current business-as-usual), we could see "AIs" as a collective cause huge changes, and eat all the free energy that a "powerful optimization process" would eat. In any case, I don't see much in your comment which engages with "agency by default" as I defined it earlier. Maybe we just don't disagree. OK, but no pre-ASI evidence can count against your model, according to you? That seems sketchy, because I'm also seeing people such as Eliezer claim, in certain cases, that things which have happened support their model. By conservation of expected evidence, it can't be the case that evidence during a certain time period will only confirm your model. Otherwise you already would've updated. Even if the only hypothetical events are ones which confirm your model, it also has to be the case that absence of those events will count against it. I've updated against Eliezer's model to a degree, because I can imagine a past-5-years world where his model was confirmed more, and that world didn't happen. I think "optimizer" is a confused word and I would prefer that people taboo it. It seems to function as something of a semantic stopsign. The key question is something like: Why doesn't the logic of convergent instrumental goals cause current AIs to take over the world? Would that logic suddenly start to kick in at some point in the future if we just train using more parameters and more data? If so, why? Can you answer that question mechanistically, without using the word "optimizer"? Trying to take over the world is not an especially original strategy. It doesn't take

2TAG2d

Or because they are not optimizers at all.

the void

RobertM4d143

I think you misread my claim. I claim that whatever models they had, they did not predict that AIs at current capability levels (which are obviously not capable of executing a takeover) would try to execute takeovers. Given that I'm making a claim about what their models didn't predict, rather than what they did predict, I'm not sure what I'm supposed to cite here; EY has written many millions of words. One counterexample would be sufficient for me to weaken (or retract) my claim.

EDIT: and my claim was motivated as a response to paragraph... (read more)

5Ebenezer Dukakis4d

Oftentimes, when someone explains their model, they will also explain what their model doesn't predict. For example, you might quote a sentence from EY which says something like: "To be clear, I wouldn't expect a merely human-level AI to attempt takeover, even though takeover is instrumentally convergent for many objectives." If there's no clarification like that, I'm not sure we can say either way what their models "did not predict". It comes down to one's interpretation of the model. From my POV, the instrumental convergence model predicts that AIs will take actions they believe to be instrumentally convergent. Since current AIs make many mistakes, under an instrumental convergence model, one would expect that at times they would incorrectly estimate that they're capable of takeover (making a mistake in said estimation) and attempt takeover on instrumental convergence grounds. This would be a relatively common mistake for them to make, since takeover is instrumentally useful for so many of the objectives we give AIs -- as Yudkowsky himself argued repeatedly. At the very least, we should be able to look at their cognition and see that they are frequently contemplating takeover, then discarding it as unrealistic given current capabilities. This should be one of the biggest findings of interpretability research. I never saw Yudkowsky and friends explain why this wouldn't happen. If they did explain why this wouldn't happen, I expect that explanation would go a ways towards explaining why their original forecast won't happen as well, since future AI systems are likely to share many properties with current ones. Is there any scenario that Yudkowsky said was unlikely to come to pass? If not, it sounds kind of like you're asserting that Yudkowsky's ideas are unfalsifiable? For me it's sufficient to say: Yudkowsky predicted various events, and various other events happened, and the overlap between these two lists of events is fairly limited. That could change as mor

[linkpost] AI Alignment is About Culture, Not Control by JCorvinus

RobertM6d20

This is LLM slop. At least it tells you that upfront (and that it's long). Did you find any interesting, novel claims in it?

1Milan W5d

I found the section "First Contact Didn’t Go Well" interesting. It claims that Bing's reported misaligned behavior was retaliatory, and provides context on why it happened:

the void

RobertM7d3630

I enjoyed most of this post but am (as always) frustrated by the persistent^[1] refusal to engage with the reasons for serious concern about ASI being unaligned by default that came from the earliest of those who were worried, whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.

Separately, I think you are both somewhat too pessimistic about the state of knowledge re: the "spiritual bliss attractor state" among Anthropic employees prior to the experiments that fed into ... (read more)

9Ebenezer Dukakis5d

Citation for this claim? Can you quote the specific passage which supports it? It reminds me of Phil Tetlock's point about the importance of getting forecasters to forecast probabilities for very specific events, because otherwise they will always find a creative way to evaluate themselves so that their forecast looks pretty good. (For example, can you see how Andrew Ng could claim that his "AI will be like electricity" prediction has been pretty darn accurate? I never heard Yudkowsky say "yep, that will happen".) I spent a lot of time reading LW back in the day, and I don't think Yudkowsky et al ever gave a great reason for "agency by default". If you think there's some great argument for the "agency by default" position which people are failing to engage with, please link to it instead of vaguely alluding to it, to increase the probability of people engaging with it! (By "agency by default" I mean spontaneous development of agency in ways creators didn't predict -- scheming, sandbagging, deception, and so forth. Commercial pressures towards greater agency through scaffolding and so on don't count. The fact that adding agency to LLMs is requiring an active and significant commercial push would appear to be evidence against the thesis that it will appear spontaneously in unintended contexts. If it's difficult to do it on purpose, then logically, it's even more difficult to do it by accident!)

eggsyntax's Shortform

RobertM8d20

This is a great experiment similar to some that I've been thinking about over the last few months, thanks for running it. I'd be curious what the results are like for stronger models (and whether, if they do, that substantially changes their outputs when answering in interesting ways). My motivations are mostly to dig up evidence of models having qualities relevant for more patienthood, but would also be interesting from various safety perspectives.

2eggsyntax7d

Me too! Unfortunately I'm not aware of any SAEs on stronger models (except Anthropic's SAEs on Claude, but those haven't been shared publicly). I'm interested to hear your perspective on what results to this experiment might say about moral patienthood.

Broad-Spectrum Cancer Treatments

RobertM8d70

Curated. This is further away from my area of expertise than usual, but I found this post valuable for providing several obvious hooks by which I could begin my own investigations on the subject (if I were so inclined), and for sticking its neck out by proposing some specific research directions. Sometimes all it takes to get a ball rolling is establishing common knowledge.

3PoignardAzur5d

Hijacking your comment to say this: 3-4 days ago my Curated LessWrong RSS feed blew up with 20 or so posts, most of which don't reach the quality bar I usually expect from curated posts. Any idea why that happens? I mean I guess on some level it's on me for not just marking them all on read and moving on, but still, when you've got "read all the emails in my inbox" syndrome it's a mildly disruptive experience.

dannflor's Shortform

RobertM8d20

Sorry, fix is being deployed right now. (For future reference you can also report issues via Intercom, which for which there's a button at the bottom-right of the screen.)

Welcome to LessWrong!

RobertM16d30

For the first, we have the Read History page. For the second, there are some recommendations underneath the comments section of each post, but they're not fully general. For the third - do you mean allowing authors on LessWrong to have paid subscribers?

1Crazy philosopher16d

For the third- yes, I mean exactly it.

Possible AI regulation emergency?

Answer by RobertMJun 02, 202562

Calling your Senator's office is probably the cheapest effective thing you can do here, yeah.

Interpretability Will Not Reliably Find Deceptive AI

RobertM22d20

Whoops, yes, thanks, edited.

Interpretability Will Not Reliably Find Deceptive AI

RobertM22d*20

Curated. While I don't agree with every single positive claim advanced in the post (in particular, I'm less confident that chain-of-thought monitoring will survive to be a useful technique in the regime of transformative AI), this is an excellent distillation of the reasons for skepticism re: interpretability as a cure-all for identifying deceptive AIs. I also happen to think that those reasons generalize to many other agendas.

Separately, it's virtuous to publicly admit to changing one's mind, especially when the incentives are stacked the way ... (read more)

3Evan R. Murphy22d

I agree it's a good post, and it does take guts to tell people when you think that a research direction that you've been championing hard actually isn't the Holy Grail. This is a bit of a nitpick but not insubstantial: Neel is talking about interpretability in general, not just mech-interp. He claims to be accounting in his predictions for other non-mech interp approaches to interpretability that seem promising to some other researchers, such as representation engineering (RepE), which Dan Hendrycks among others has been advocating for recently.

RobertM1moModerator Comment20

Hey Shannon, please read our policy on LLM writing before making future posts consisting almost entirely of LLM-written content.

Will Jesus Christ return in an election year?

RobertM1mo40

Curated. To the extent that we want to raise the sanity waterline, or otherwise improve society's ability to converge on true beliefs, it's important to understand the weaknesses of existing infrastructure. Being unable to reliably translate a prediction market's pricing directly into implied odds of an outcome seems like a pretty substantial weakness. (Note that I'm not sure how much I believe the linked tweet; nonetheless I observe that the odds sure did seem mispriced and the provided explanation seems sufficient to cause some mispricings sometimes.) "Acceptance is the first step," and all that.

The First Law of Conscious Agency: Linguistic Relativity and the Birth of "I"

RobertM1moModerator Comment40

@Dima (lain), please read our policy on LLM writing on LessWrong and hold off on submitting further posts until you've done that.

1Dima (lain)1mo

Thank you. I've read the policy and left the comment there, as the post we've published doesn't fall to any of the categories outlined in the policy.

Why giving workers stocks isn’t enough — and what co-ops get right

RobertM2mo20

Also also, why are socialist-vibe blogposts so often relegated to "personal blogpost" while capitalist-vibe blogposts aren't? I mean, I get the automatic barrage of downvotes, but you'd think the mods would at least try to appear impartial.

Posts are categorized as frontpage / personal once or twice per day, and start out as personal by default. Your post hasn't been looked at yet. (The specific details of what object-level political takes a post has aren't an input to that decision. Whether a post is frontpaged or not is a function of its "timelessness" - i.e. whether we expect people will still find value in reading the post years later - and general interest to the LW userbase.)

1B Jacobs2mo

Ah thanks! I'm probably just in an unlucky timezone then.

LessWrong has been acquired by EA

RobertM3mo20

Sorry, there was a temporary bug where we were returning mismatched reward indicators to the client. It's since been patched! I don't believe anybody actually rolled The Void during this period.

Rafael Harth's Shortform

RobertM3mo20

Sorry, there was a temporary bug where we were returning mismatched reward indicators to the client. It's since been patched! I don't believe anybody actually rolled The Void during this period.

RobertM's Shortform

RobertM3mo280

Pico-lightcone purchases are back up, now that we think we've ruled out any obvious remaining bugs. (But do let us know if you buy any and don't get credited within a few minutes.)

Policy for LLM Writing on LessWrong

RobertM3mo20

If you had some vague prompt like "write an essay about how the field of alignment is misguided" and then proofread it you've met the criteria as laid out.

No, such outputs will almost certainly fail this criteria (since they will by default be written with the typical LLM "style").

4Seth Herd3mo

That's a good point and it does set at least a low bar of bothering to try. But they don't have to try hard. They can almost just append the prompt with "and don't write it in standard LLM style". I think it's a little more complex than that, but not much. Humans can't tell LLM writing from human writing in controlled studies. The question isn't whether you can hide the style or even if it's hard, just how easy. Which raises the question of whether they'd even do that much, because of course they haven't read the FAQ before posting. Really just making sure that new authors read SOMETHING about what's appreciated here would go a long way toward reducing slop posts.

How Much Are LLMs Actually Boosting Real-World Programmer Productivity?

RobertM3mo90

"10x engineers" are a thing, and if we assume they're high-agency people always looking to streamline and improve their workflows, we should expect them to be precisely the people who get a further 10x boost from LLMs. Have you observed any specific people suddenly becoming 10x more prolific?

In addition to the objection from Archimedes, another reason this is unlikely to be true is that 10x coders are often much more productive than other engineers because they've heavily optimized around solving for specific problems or skills that other engineers are bottlenecked by, and most of those optimizations don't readily admit of having an LLM suddenly inserted into the loop.

Arbital has been imported to LessWrong

RobertM4mo80

Not at the moment, but it is an obvious sort of thing to want.

General Intelligence

RobertM4mo20

Thanks for the heads up, we'll have this fixed shortly (just need to re-index all the wiki pages once).

“Sharp Left Turn” discourse: An opinionated review

RobertM4mo132

Curated. This post does at least two things I find very valuable:

Accurately represents differing perspectives on a contentious topic
Makes clear, epistemically legible arguments on a confusing topic

And so I think that this post both describes and advances the canonical "state of the argument" with respect to the Sharp Left Turn (and similar concerns). I hope that other people will also find it helpful in improving their understanding of e.g. objections to basic evolutionary analogies (and why those objections shouldn't make you very optimistic).

sjadler's Shortform

RobertM4mo62

Yes:

My model is that Sam Altman regarded the EA world as a memetic threat, early on, and took actions to defuse that threat by paying lip service / taking openphil money / hiring prominent AI safety people for AI safety teams.

In the context of the thread, I took this to suggest that Sam Altman never had any genuine concern about x-risk from AI, or, at a minimum, that any such concern was dominated by the social maneuvering you're describing. That seems implausible to me given that he publicly expressed concern about x-risk from AI 10 months before OpenAI was publicly founded, and possibly several months before it was even conceived.

2Eli Tyre4mo

I don't dispute that he never had any genuine concern. I guess that he probably did have genuine concern (though not necessarily that that was his main motivation for founding OpenAI).

sjadler's Shortform

RobertM4mo72

Sam Altman posted Machine intelligence, part 1^[1] on February 25th, 2015. This is admittedly after the FLI conference in Puerto Rico, which is reportedly where Elon Musk was inspired to start OpenAI (though I can't find a reference substantiating his interaction with Demis as the specific trigger), but there is other reporting suggesting that OpenAI was only properly conceived later in the year, and Sam Altman wasn't at the FLI conference himself. (Also, it'd surprise me a bit if it took nearly a year, i.e. from Jan 2nd^[2] to Dec 11th... (read more)

3Eli Tyre4mo

Is this taken to be a counterpoint to my story above? I'm not sure exactly how it's related.

Nick Land: Orthogonality

RobertM4mo20

I think it's quite easy to read as condescending. Happy to hear that's not the case!

Nick Land: Orthogonality

RobertM4mo20

I hadn't downvoted this post, but I am not sure why OP is surprised given the first four paragraphs, rather than explaining what the post is about, instead celebrate tree murder and insult their (imagined) audience:

so that no references are needed but those any LW-rationalist is expected to have committed to memory by the time of their first Lighthaven cuddle puddle

2lumpenspace4mo

wait - do you consider that an insult? i snuggled with the best of them

Pick two: concise, comprehensive, or clear rules

RobertM4mo20

I don't think much has changed since this comment. Maybe someone will make a new wiki page on the subject, though if it's not an admin I'd expect it to mostly be a collection of links to various posts/comments.

re: the table of contents, it's hidden by default but becomes visible if you hover your mouse over the left column on post pages.

2Said Achmiz4mo

That’s… pretty bad. Frankly, I don’t understand how you expect anyone to have any idea of what to expect from the site and the moderation thereof, given this utterly shambolic state of affairs. I’ll just repeat my question from two years ago (which did not receive any answer at the time): ---------------------------------------- It doesn’t do that for me (might be a browser issue). In any case, is there a way to have it be visible by default? I’d really prefer that.

Pick two: concise, comprehensive, or clear rules

RobertM4mo90

I understand the motivation behind this, but there is little warning that this is how the forum works. There is no warning that trying to contribute in good faith isn't sufficient, and you may still end up partially banned (rate-limited) if they decide you are more noise than signal. Instead, people invest a lot only to discover this when it's too late.

In addition to the New User Guide that gets DMed to every new user (and is also linked at the top of our About page), we:

Show this comment above the new post form to new users who haven't already had s

... (read more)

6Said Achmiz4mo

This is fine for new users; what about for existing users? I just went to the front page of the site, and it’s not obvious to me where to click to find “The Rules”. The “About” page? Doesn’t seem to be a list of rules. The New User’s Guide? Not really. (There’s a “Rules to be aware of” section at the very, very end of that post, but… surely this isn’t meant to be a list of the rules…? It’s just… three kind of random things.) The LessWrong FAQ? Not really… If I want to know what rules (or guidelines, or… anything, really…) are supposed to be governing my behavior on LW, I actually don’t have any idea where to look. And I’ve been using Less Wrong for a very long time. Related point: when the rules change, how do existing users learn about this? P.S.: What happened to the table of contents on LW post pages? Why can’t I see it anymore?

1Knight Lee4mo

Thank you very much for bringing that up. That does look like a clearer warning, somehow I didn't remember it very well.

RobertM's Shortform

RobertM5mo30

Apropos of nothing, I'm reminded of the "<antthinking>" tags originally observed in Sonnet 3.5's system prompt, and this section of Dario's recent essay (bolding mine):

In 2024, the idea of using reinforcement learning (RL) to train models to generate chains of thought has become a new focus of scaling. Anthropic, DeepSeek, and many other companies (perhaps most notably OpenAI who released their o1-preview model in September) have found that this training greatly increases performance on certain select, objectively measurable tasks like math, coding c

... (read more)

RobertM's Shortform

RobertM5mo372

When is the "efficient outcome-achieving hypothesis" false? More narrowly, under what conditions are people more likely to achieve a goal (or harder, better, faster, stronger) with fewer resources?

The timing of this quick take is of course motivated by recent discussion about deepseek-r1, but I've had similar thoughts in the past when observing arguments against e.g. hardware restrictions: that they'd motivate labs to switch to algorithmic work, which would be speed up timelines (rather than just reducing the naive expected rate of slowdown). S... (read more)

8Canaletto5mo

I think you also have to factor in selection bias. Like suppose there are 3 organizations with 100 resource units, 10 with 20 units, 30 with 5 units. And maybe resources are helpful, but not helpful enough that all the advancements will concentrate in the top 3.

Habryka's Shortform Feed

RobertM5mo20

We have automated backups, and should even those somehow find themselves compromised (which is a completely different concern from getting DDoSed), there are archive.org backups of a decent percentage of LW posts, which would be much easier to restore than paper copies.

6gwern5mo

There is also GreaterWrong, which I believe caches everything rather than passing through live, so it would be able to restore almost all publicly-visible content, in theory.

Learning By Writing

RobertM5mo30

I learned it elsewhere, but his LinkedIn confirms that he started at Anthropic sometime in January.

RobertM's Shortform

RobertM5mo42

I know I'm late to the party, but I'm pretty confused by https://www.astralcodexten.com/p/its-still-easier-to-imagine-the-end (I haven't read the post it's responding to, but I can extrapolate). Surely the "we have a friendly singleton that isn't Just Following Orders from Your Local Democratically Elected Government or Your Local AGI Lab" is a scenario that deserves some analysis...? Conditional on "not dying" that one seems like the most likely stable end state, in fact.

Lots of interesting questions in that situation! Like, money still ... (read more)

4cousin_it5mo

For cognitive enhancement, maybe we could have a system like "the smarter you are, the more aligned you must be to those less smart than you"? So enhancement would be available, but would make you less free in some ways.

Everywhere I Look, I See Kat Woods

RobertM5mo2512

I was thinking the same thing. This post badly, badly clashes with the vibe of Less Wrong. I think you should delete it, and repost to a site in which catty takedowns are part of the vibe. Less Wrong is not the place for it.

I think this is a misread of LessWrong's "vibes" and would discourage other people from thinking of LessWrong as a place where such discussions should be avoided by default.

With the exception of the title, I think the post does a decent job at avoiding making it personal.

3Holly_Elmore5mo

Yeah actually the employees of Lightcone have led the charge in trying to tear down Kat. Its you who has the better standards, Maxwell, not this site.

[New Feature] Your Subscribed Feed

RobertM5mo20

Well, that's unfortunate. That feature isn't super polished and isn't currently in the active development path, but will try to see if it's something obvious. (In the meantime, would recommend subscribing to fewer people, or seeing if the issue persists in Chrome. Other people on the team are subscribed to 100-200 people without obvious issues.)

RobertM's Shortform

RobertM6mo20

FWIW, I don't think "scheming was very unlikely in the default course of events" is "decisively refuted" by our results. (Maybe depends a bit on how we operationalize scheming and "the default course of events", but for a relatively normal operationalization.)

Thank you for the nudge on operationalization; my initial wording was annoyingly sloppy, especially given that I myself have a more cognitivist slant on what I would find concerning re: "scheming". I've replaced "scheming" with "scheming behavior".

It's somewhat sensitive to the exact objec

... (read more)

2ryan_greenblatt6mo

Someone could have objections to validity or the assumptions of our paper. On validity, something like priming could be relevant. On the assumptions, they could e.g. think scheming is very unlikely due to thinking that future AIs will be intentionally trained to be highly myopic and corrigible while also thinking that other possible sources of goal conflict are very unlikely. (I'd disagree with this view, but I don't think this view is totally crazy and it isn't refuted by our paper.) I think our work doesn't very clearly refute this post, though I also just think the post is missing multiple important considerations and is overall pretty wrong and confused in its arguments.

RobertM's Shortform

RobertM6mo*80

I'd like to internally allocate social credit to people who publicly updated after the recent Redwood/Anthropic result, after previously believing that scheming behavior was very unlikely in the default course of events (or a similar belief that was decisively refuted by those empirical results).

Does anyone have links to such public updates?

(Edit log: replaced "scheming" with "scheming behavior".)

ryan_greenblatt6mo*123

FWIW, I don't think "scheming was very unlikely in the default course of events" is "decisively refuted" by our results. (Maybe depends a bit on how we operationalize scheming and "the default course of events", but for a relatively normal operationalization.)

It's somewhat sensitive to the exact objection the person came in with.

My guess is that most reasonable perspectives should update toward thinking scheming has at least a tiny of chance of occuring (>2%), but I wouldn't say a view of <<2% was decisively refuted.

5ryan_greenblatt6mo

Quoting Zvi's post: I don't know of any other clear cut cases. The reviews might also be interesting to look at. I'm not sure if Jacob Andreas and Jasjeet Sekhon have publicly stated prior views on the topic. Yoshua Bengio and Rohin Shah were broadly sympathetic to scheming concerns or similar before.

RobertM's Shortform

RobertM6mo104

One reason to be pessimistic about the "goals" and/or "values" that future ASIs will have is that "we" have a very poor understanding of "goals" and "values" right now. Like, there is not even widespread agreement that "goals" are even a meaningful abstraction to use. Let's put aside the object-level question of whether this would even buy us anything in terms of safety, if it were true. The mere fact of such intractable disagreements about core philosophical questions, on which hinge substantial parts of various cases for and against doo... (read more)

2Seth Herd6mo

We do have a poor understanding of human values. That's one more reason we shouldn't and probably won't try to build them into AGI. You're expressing a common view among the alignment community. I think we should update from that view to the more likely scenario in which we don't even try to align AGI to human values. What we're actually doing is training LLMs to answer questions as they were intended, and to follow instructions as they were intended. The AI needs to understand human values to some degree to do that, but training is really focused on those things. There's an interesting bit in this interview with Tan Zhi Xuan on this distinction between theory and practice of training LLMs, and to a lesser degree in their paper. Not only is that what we are doing for current AI, I think it's both what we should do for future AGI, and what we probably will do. Instruction-following AGI is easier and more likely than value aligned AGI. It's counterintuitive to think about a highly intelligent agent that wants to do what someone else tells it. But it's not logically incoherent. And when the first human decides what goal to put in the system prompt of the first agent they think might ultimately surpass human competence and intelligence, there's little doubt what they'll put there: "follow my instructions, favoring the most recent". Everything else is a subgoal of that non-consequentialist central goal. This approach leaves humans in charge, and that's a problem. Ultimately I think that sort of instrucion-following intent alignment can be a stepping-stone to value alignment, once we've got a superintelligent instruction-following system to help us with that very difficult problem. But there's neither a need nor an incentive to aim directly at that with our first AGIs. So alignment will succeed or fail on other issues. Separately, I fully agree that most people who don't believe in AGI x-risk aren't making a true rejection. They usually really don't believe w

Corrigibility's Desirability is Timing-Sensitive

RobertM6mo42

I agree that in spherical cow world where we know nothing about the historical arguments around corrigibility, and who these particular researchers are, we wouldn't be able to make a particularly strong claim here. In practice I am quite comfortable taking Ryan at his word that a negative result would've been reported, especially given the track record of other researchers at Redwood.

at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values d

RobertM6mo42

I mean, yes, but I'm addressing a confusion that's already (mostly) conditioning on building on it.

Knight Lee's Shortform

RobertM6mo31

The /allPosts page shows all quick takes/shortforms posted, though somewhat de-emphasized.

Knight Lee's Shortform

RobertM6mo30

FYI: we have spoiler blocks.

1Knight Lee6mo

Thank you for the help :) By the way, how did you find this message? I thought I already edited the post to use spoiler blocks, and I hid this message by clicking "remove from Frontpage" and "retract comment" (after someone else informed me using a PM). EDIT: dang it I still see this comment despite removing it from the Frontpage. It's confusing.

Claude's Constitutional Consequentialism?

RobertM6mo94

This doesn't seem like it'd do much unless you ensured that there were training examples during RLAIF which you'd expect to cause that kind of behavior enough of the time that there'd be something to update against. (Which doesn't seem like it'd be that hard, though I think separately that approach seems kind of doomed - it's falling into a brittle whack-a-mole regime.)

Daniel Kokotajlo6mo118

Indeed, we should get everyone to make predictions about whether or not this change would be sufficient, and if it isn't, what changes would be suffficient. My prediction would be that this change would not be sufficient but that it would help somewhat.

WannabeChthonic's Shortform

RobertM6mo40

LessWrong doesn't have a centralized repository of site rules, but here are some posts that might be helpful:

https://www.lesswrong.com/posts/bGpRGnhparqXm5GL7/models-of-moderation

https://www.lesswrong.com/posts/kyDsgQGHoLkXz6vKL/lw-team-is-adjusting-moderation-policy

We do currently require content to be posted in English.

RobertM's Shortform

RobertM6mo*30

"It would make sense to pay that cost if necessary" makes more sense than "we should expect to pay that cost", thanks.

it sounds like you view it as a bad plan?

Basically, yes. I have a draft post outlining some of my objections to that sort of plan; hopefully it won't sit in my drafts as long as the last similar post did.

(I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect no

... (read more)

2Noosphere896mo

I agree that conditional on that happening, this is plausible, but also it's likely that some of the answers from such a philosophically competent being to be unsatisfying to us. One example is that such a philosophically competent AI might tell you that CEV either doesn't exist, or if it does is so path-dependent that it cannot resolve moral disagreements, which is actually pretty plausible under my model of moral philosophy.

RobertM's Shortform

RobertM6mo70

What do people mean when they talk about a "long reflection"? The original usages suggest flesh-humans literally sitting around and figuring out moral philosophy for hundreds, thousands, or even millions of years, before deciding to do anything that risks value lock-in, but (at least) two things about this don't make sense to me:

A world where we've reliably "solved" for x-risks well enough to survive thousands of years without also having meaningfully solved "moral philosophy" is probably physically realizable, but this seems like a pretty fine needl

... (read more)

2Noosphere896mo

To answer these questions: 1 possible answer is that something like CEV does not exist, and yet alignment is still solvable anyways for almost arbitrarily capable AI, which could well happen, and for me personally this is honestly the most likely outcome of what happens by default. There are arguments against the idea that CEV even exists or is well defined that are important to note, and we shouldn't assume that technological progress equates with progress towards your preferred philosophy: https://www.lesswrong.com/posts/Y7gtFMi6TwFq5uFHe/some-biases-and-selection-effects-in-ai-risk-discourse#hkoGD6Gwi9YKKZ6S2 https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_3_Possible_implications_for_AI_alignment_discourse https://joecarlsmith.com/2021/06/21/on-the-limits-of-idealized-values And there might not be any real justifiable way to resolve disagreements between the philosophies/moralities, either, if there isn't a way to converge to a single morality.

5Vladimir_Nesov6mo

Long reflection is a concrete baseline for indirect normativity. It's straightforwardly meaningful, even if it's unlikely to be possible or a good idea to run in base reality. From there, you iterate to do better. Path dependence of long reflection could be addressed by considering many possible long reflection traces jointly, aggregating their own judgement about each other to define which traces are more legitimate (as a fixpoint of some voting/preference setup), or how to influence the course of such traces to make them more legitimate. For example, a misaligned AI takeover within a long reflection trace makes it illegitimate, and preventing such is an intervention that improves a trace. "Locking in" preferences seems like something that should be avoided as much as possible, but creating new people or influencing existing ones is probably morally irreversible, and that applies to what happens inside long reflection as well. I'm not sure that "nonperson" modeling of long reflection is possible, that sufficiently good prediction of long traces of thinking doesn't require modeling people well enough to qualify as morally relevant to a similar extent as concrete people performing that thinking in base reality. But here too considering many possible traces somewhat helps, making all possibilities real (morally valent) according to how much attention is paid to their details, which should follow their collectively self-defined legitimacy. In this frame, the more legitimate possible traces of long reflection become the utopia itself, rather than a nonperson computation planning it. Nonperson predictions of reflection's judgement might steer it a bit in advance of legitimacy or influence decisions, but possibly not much, lest they attain moral valence and start coloring the utopia through their content and not only consequences.

5_will_6mo

On your second point, I think that MacAskill and Ord were more saying “It would be worth it to spend thousands of years figuring out moral philosophy / figuring out what to do with the cosmos, if that’s how long it takes to be ~sure we’ve reached the ‘correct’ answer before locking things in, on account of the astronomical waste argument” than “I literally predict it will take today-humans thousands of years to figure out moral philosophy, even if we make a serious and coordinated effort to do so.” Somewhat relatedly, quoting from the ‘Long Reflection Reading List’ I wrote earlier this year (fn. 4): On your first point, I continue to be curious about your perspective. I basically agree with the following (written by Zach Stein-Perlman), but, based on what you said in your parentheses, it sounds like you view it as a bad plan? (I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect not solving AI philosophical competence (while having solved intent alignment) to lead to catastrophe (thus putting us outside the worlds in which x-risks are reliably ‘solved’ for), perhaps in the way Wei Dai has talked about?) 1. ^ We don't need these human-obsoleting AIs to be able to implement CEV. We want to be able to defer to them on tricky wisdom-loaded questions like what should we do about the overall AI situation? They can ask us questions as needed. 2. ^ To avoid being rushed by your own AI project, you also have to ensure that your AI can't be stolen and can't escape, so you have to implement excellent security and control.

yams's Shortform

RobertM7mo40

I tried to make a similar argument here, and I'm not sure it landed. I think the argument has since demonstrated even more predictive validity with e.g. the various attempts to build and restart nuclear power plants, directly motivated by nearby datacenter buildouts, on top of the obvious effects on chip production.

3yams7mo

I've just read this post and the comments. Thank you for writing that; some elements of the decomposition feel really good, and I don't know that they've been done elsewhere. I think discourse around this is somewhat confused, because you actually have to do some calculation on the margin, and need a concrete proposal to do that with any confidence. The straw-Pause rhetoric is something like "Just stop until safety catches up!" The overhang argument is usually deployed (as it is in those comments) to the effect of 'there is no stopping.' And yeah, in this calculation, there are in fact marginal negative externalities to the implementation of some subset of actions one might call a pause. The straw-Pause advocate really doesn't want to look at that, because it's messy to entertain counter-evidence to your position, especially if you don't have a concrete enough proposal on the table to assign weights in the right places. Because it's so successful against straw-Pausers, the anti-pause people bring in the overhang argument like an absolute knockdown, when it's actually just a footnote to double check the numbers and make sure your pause proposal avoids slipping into some arcane failure mode that 'arms' overhang scenarios. That it's received as a knockdown is reinforced by the gearsiness of actually having numbers (and most of these conversations about pauses are happening in the abstract, in the absence of, i.e., draft policy). But... just because your interlocutor doesn't have the numbers at hand, doesn't mean you can't have a real conversation about the situations in which compute overhang takes on sufficient weight to upend the viability of a given pause proposal. You said all of this much more elegantly here: ...which feels to me like the most important part. The burden is on folks introducing an argument from overhang risk to prove its relevance within a specific conversation, rather than just introducing the adversely-gearsy concept to justify safety-coded

Orthogonality Thesis

RobertM7mo20

Should be fixed now.