All of Eliezer Yudkowsky's Comments + Replies

Cool.  What's the actual plan and why should I expect it not to create machine Carissa Sevar?  I agree that the Textbook From The Future Containing All The Simple Tricks That Actually Work Robustly enables the construction of such an AI, but also at that point you don't need it.

So if it's difficult to get amazing trustworthy work out of a machine actress playing an Eliezer-level intelligence doing a thousand years worth of thinking, your proposal to have AIs do our AI alignment homework fails on the first step, it sounds like?

joshcΩ5157

I do not think that the initial humans at the start of the chain can "control" the Eliezers doing thousands of years of work in this manner (if you use control to mean "restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner")

That's because each step in the chain requires trust.

For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.

So the control property is not maintained. But my argument is th... (read more)

So the "IQ 60 people controlling IQ 80 people controlling IQ 100 people controlling IQ 120 people controlling IQ 140 people until they're genuinely in charge and genuinely getting honest reports and genuinely getting great results in their control of a government" theory of alignment?

joshcΩ4136

I'd replace "controlling" with "creating" but given this change, then yes, that's what I'm proposing.

I don't think you can train an actress to simulate me, successfully, without her going dangerous.  I think that's over the threshold for where a mind starts reflecting on itself and pulling itself together.

joshcΩ2103

I would not be surprised if the Eliezer simulators do go dangerous by default as you say.

But this is something we can study and work to avoid (which is what I view to be my main job)

My point is just that preventing the early Eliezers from "going dangerous" (by which I mean from "faking alignment") is the bulk of the problem humans need address (and insofar as we succeed, the hope is that future Eliezer sims will prevent their Eliezer successors from going dangerous too)

I'll discuss why I'm optimistic about the tractability of this problem in future posts.

I'm not saying that it's against thermodynamics to get behaviors you don't know how to verify.  I'm asking what's the plan for getting them.

2Noosphere89
My claim here is that there is no decisive blocker for plans that get getting a safe, highly capable AIs that is used for automated AI safety research, in the way that thermodynamics blocks you from getting a perpetual motion machine (under the assumption that the universe is time symmetric, that is physics stays the same no matter when an experiment happens), which has been well tested, and the proposed blockers do not have anywhere close to the amount of evidence thermodynamics does such that we can safely discard any plan that doesn't meet a prerequisite.

One of the most important projects in the world.  Somebody should fund it.

niplav297

At the end of 2023, MIRI had ~$19.8 mio. in assets. I don't know much about the legal restrictions of how that money could be used, or what the state for financial assets is now, but if it's similar then MIRI could comfortably fund Velychko's primate experiments, and potentially some additional smaller projects.

(Potentially relevant: I entered the last GWWC donor lottery with the hopes of donating the resulting money to intelligence enhancement, but wasn't selected.)

I think this project should receive more red-teaming before it gets funded.

Naively, it would seem that the "second species argument" matches much more strongly to the creation of a hypothetical Homo supersapiens than it does to AGI.

We've observed many warning shots regarding catastrophic human misalignment. The human alignment problem isn't easy. And "intelligence" seems to be a key part of the human alignment picture. Humans often lack respect or compassion for other animals that they deem intellectually inferior -- e.g. arguing that because those othe... (read more)

kave213

Copying over Eliezer's top 3 most important projects from a tweet:

1.  Avert all creation of superintelligence in the near and medium term.

2.  Augment adult human intelligence.

3.  Build superbabies.

5Noosphere89
TBH, I don't particularly think it's one of the most important projects right now, due to several issues: 1. There's no reason to assume that we could motivate them any better than what we already do, unless we are in the business of changing personality, which carries it's own problems, or we are willing to use it on a massive scale, which simply cannot be done currently. 2. We are running out of time. The likely upper bound for AI that will automate basically everything is 15-20 years from Rafael Harth and Cole Wyeth, and unfortunately there's a real possibility that the powerful AI comes in 5-10 years, if we make plausible assumptions about scaling continuing to work, and given that there's no real way to transfer any breakthroughs to the somatic side of gene editing, it will be irrelevant by the time AI comes. Thus, human intelligence augmentation is quite poor from a reducing X-risk perspective.

Can you tl;dr how you go from "humans cannot tell which alignment arguments are good or bad" to "we justifiably trust the AI to report honest good alignment takes"?  Like, not with a very large diagram full of complicated parts such that it's hard to spot where you've messed up.  Just whatever simple principle you think lets you bypass GIGO.

Eg, suppose that in 2020 the Open Philanthropy Foundation would like to train an AI such that the AI would honestly say if the OpenPhil doctrine of "AGI in 2050" was based on groundless thinking ultimately dri... (read more)

1Noosphere89
For this: I think a crux here is I do not believe we will get no-go theorems like this, and more to the point, complete impossibilities given useful assumptions are generally much rarer than you make it out to be. The big reason for this is the very fact that neural networks are messy/expressive means it's extremely hard to bound their behavior, and for the same reason you couldn't do provable safety/alignment on the AI itself except for very toy examples, will also limit any ability to prove hard theorems on what an AI is aligned to. From an epistemic perspective, we have way more evidence of the laws of thermodynamics existing than particular proposals for AI alignment being impossible, arguably by billions or trillions of bits more, so much so that there is little reason to think at our current state of epistemic clarity that we can declare a direction impossible (rather than impractical). Quintin Pope correctly challenges @Liron here on this exact point, because of the yawning gap in evidence between thermodynamics and AI alignment arguments, and Liron kind of switched gears mid-way to claim a much weaker stance: https://x.com/QuintinPope5/status/1703569557053644819 The weak claim is here: https://x.com/liron/status/1704126007652073539
joshcΩ937-24

Sure, I'll try.

I agree that you want AI agents to arrive at opinions that are more insightful and informed than your own. In particular, you want AI agents to arrive at conclusions that are at least as good as the best humans would if given lots of time to think and do work. So your AI agents need to ultimately generalize from some weak training signal you provide to much stronger behavior. As you say, the garbage-in-garbage-out approach of "train models to tell me what I want to hear" won't get you this.

Here's an alternative approach. I'll describe it in ... (read more)

BuckΩ63415

the OpenPhil doctrine of "AGI in 2050"

(Obviously I'm biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you're really overstating how many Bayes points you get against Ajeya's timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.

I don't think you've ever made concrete predictions publicly (which makes me think it's worse behavior for you to criticize people for their predictions), b... (read more)

2[comment deleted]

You seem confused about my exact past position.  I was arguing against EAs who were like, "We'll solve AGI with policy, therefore no doom."  I am not presently a great optimist about the likelihood of policy being an easy solution.  There is just nothing else left.

Reply1111
7Martin Randall
You're reading too much into this review. It's not about your exact position in April 2021, it's about the evolution of MIRI's strategy over 2020-2024, and placing this Time letter in that context. I quoted you to give a flavor of MIRI attitudes in 2021 and deliberately didn't comment on it to allow readers to draw their own conclusions. I could have linked MIRI's 2020 Updates and Strategy, which doesn't mention AI policy at all. A bit dull. In September 2021, there was a Discussion with Eliezer Yudkowsky which seems relevant. Again, I'll let readers draw their own conclusions, but here's a fun quote: I welcome deconfusion about your past positions, but I don't think they're especially mysterious. The thread was started by Grant Demaree, and you were replying to a comment by him. You seem confused about Demaree's exact past position. He wrote, for example: "Eliezer gives alignment a 0% chance of succeeding. I think policy, if tried seriously, has >50%". Perhaps this is foolish, dangerous, optimism. But it's not "no doom".

(I affirm this as my intended reading.)

It certainly bears upon AI, but it bears that way by making a point about the complexity of a task rather than talking about an intelligent mechanism which is purportedly aligned on that task.  It does this by talking about an unintelligent mechanism, which is meant to be a way of talking about the task itself rather than any particular machine for doing it.

6Martin Randall
Yes, and. The post is about the algorithmic complexity of human values and it is about powerful optimizers ("genies") and it is about the interaction of those two concepts. The post makes specific points about genies, including intelligent genies, that it would not make if it was not also about genies. Eg: You wrote, "the Outcome Pump is a genie of the second class". But the Time Travel Outcome Pump is fictional. The genie of the second class that Yudkowsky-2007 expects to see in reality is an AI. So the Outcome Pump is part of a parable for this aspect of powerful & intelligent AIs, despite being unintelligent. There's lots of evidence I could give here, the tags ("Parables & Fables"), a comment from Yudkowsky-2007 on this post, and the way others have read it, both in the comments and in other posts like Optimality is the Tiger. Also, the Time Travel Outcome Pump is easy to use safely, it's not the case that "no wish is safe", and that attitude only makes sense parabolically. I don't think that's a valuable discussion topic, I'm not sure you would even disagree. However, when reading parables, it's important to understand what properties transfer and what properties do not. Jesus is recorded as saying "The Kingdom of Heaven is like a pearl of great price". If I read that and go searching for heaven inside oysters then I have not understood the parable. Similarly, if someone reads this parable and concludes that an AI will not be intelligent then they have not understood the parable or the meaning of AI. I don't really see people making that misinterpretation of this post, it's a pretty farcical take. I notice you disagree here and elsewhere. Given that, I understand your desire for a top-of-post clarification. Adding this type of clarification is usually the job of an editor.

Your distinction between "outer alignment" and "inner alignment" is both ahistorical and unYudkowskian.  It was invented years after this post was written, by someone who wasn't me; and though I've sometimes used the terms in occasions where they seem to fit unambiguously, it's not something I see as a clear ontological division, especially if you're talking about questions like "If we own the following kind of blackbox, would alignment get any easier?" which on my view breaks that ontology.  So I strongly reject your frame that this post was "cl... (read more)

7Matthew Barnett
While the term "outer alignment" wasn’t coined until later to describe the exact issue that I'm talking about, I was using that term purely as a descriptive label for the problem this post clearly highlights, rather than implying that you were using or aware of the term in 2007.  Because I was simply using "outer alignment" in this descriptive sense, I reject the notion that my comment was anachronistic. I used that term as shorthand for the thing I was talking about, which is clearly and obviously portrayed by your post, that's all. To be very clear: the exact problem I am talking about is the inherent challenge of precisely defining what you want or intend, especially (though not exclusively) in the context of designing a utility function. This difficulty arises because, when the desired outcome is complex, it becomes nearly impossible to perfectly delineate between all potential 'good' scenarios and all possible 'bad' scenarios. This challenge has been a recurring theme in discussions of alignment, as it's considered hard to capture every nuance of what you want in your specification without missing an edge case. This problem is manifestly portrayed by your post, using the example of an outcome pump to illustrate. I was responding to this portrayal of the problem, and specifically saying that this specific narrow problem seems easier in light of LLMs, for particular reasons. It is frankly frustrating to me that, from my perspective, you seem to have reliably missed the point of what I am trying to convey here. I only brought up Christiano-style proposals because I thought you were changing the topic to a broader discussion, specifically to ask me what methodologies I had in mind when I made particular points. If you had not asked me "So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post -- though of course it has no bearing on the actual point that this post makes?" then I would not hav

What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won't work to say what you want.  This point is true!

Matthew is not disputing this point, as far as I can tell.

Instead, he is trying to critique some version of[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.

You have exhorted him several times to distinguish between that larger argument and the narrow point made by this post:

[...] and if you think that some larger thing is not corr

... (read more)

The post is about the complexity of what needs to be gotten inside the AI.  If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything.  But it would not change the complexity of what needs to be moved inside the AI, which is the narrow point that this post is about; and if you think that some larger thing is not correct, you should not confuse... (read more)

1David Johnston
Algorithmic complexity is precisely analogous to difficulty-of-learning-to-predict, so saying "it's not about learning to predict, it's about algorithmic complexity" doesn't make sense. One read of the original is: learning to respect common sense moral side constraints is tricky[1], but AI systems will learn how to do it in the end. I'd be happy to call this read correct, and is consistent with the observation that today's AI systems do respect common sense moral side constraints given straightforward requests, and that it took a few years to figure out how to do it. That read doesn't really jive with your commentary. Your commentary seems to situate this post within a larger argument: teaching a system to "act" is different to teaching it to "predict" because in the former case a sufficiently capable learner's behaviour can collapse to a pathological policy, whereas teaching a capable learner to predict does not risk such collapse. Thus "prediction" is distinguished from "algorithmic complexity". Furthermore, commonsense moral side constraints are complex enough to risk such collapse when we train an "actor" but not a "predictor". This seems confused. First, all we need to turn a language model prediction into an action is a means of turning text into action, and we have many such means. So the distinction between text predictor and actor is suspect. We could consider an alternative knows/cares distinction: does a system act properly when properly incentivised ("knows") vs does it act properly when presented with whatever context we are practically able to give it ("""cares""")? Language models usually act properly given simple prompts, so in this sense they "care". So rejecting evidence from language models does not seem well justified. Second, there's no need to claim that commonsense moral side constraints in particular are so hard that trying to develop AI systems that respect them leads to policy collapse. It need only be the case that one of the things we
8Matthew Barnett
I think it's important to be able to make a narrow point about outer alignment without needing to defend a broader thesis about the entire alignment problem. To the extent my argument is "outer alignment seems easier than you portrayed it to be in this post, and elsewhere", then your reply here that inner alignment is still hard doesn't seem like it particularly rebuts my narrow point. This post definitely seems to relevantly touch on the question of outer alignment, given the premise that we are explicitly specifying the conditions that the outcome pump needs to satisfy in order for the outcome pump to produce a safe outcome. Explicitly specifying a function that delineates safe from unsafe outcomes is essentially the prototypical case of an outer alignment problem. I was making a point about this aspect of the post, rather than a more general point about how all of alignment is easy. (It's possible that you'll reply to me by saying "I never intended people to interpret me as saying anything about outer alignment in this post" despite the clear portrayal of an outer alignment problem in the post. Even so, I don't think what you intended really matters that much here. I'm responding to what was clearly and explicitly written, rather than what was in your head at the time, which is unknowable to me.) It seems you're assuming here that something like iterated amplification and distillation will simply fail, because the supervisor function that provides rewards to the model can be hacked or deceived. I think my response to this is that I just tend to be more optimistic than you are that we can end up doing safe supervision where the supervisor ~always remains in control, and they can evaluate the AI's outputs accurately, more-or-less sidestepping the issues you mention here. I think my reasons for believing this are pretty mundane: I'd point to the fact that evaluation tends to be easier than generation, and the fact that we can employ non-agentic tools to help eva

Wish there was a system where people could pay money to bid up what they believed were the "top arguments" that they wanted me to respond to.  Possibly a system where I collect the money for writing a diligent response (albeit note that in this case I'd weigh the time-cost of responding as well as the bid for a response); but even aside from that, some way of canonizing what "people who care enough to spend money on that" think are the Super Best Arguments That I Should Definitely Respond To.  As it stands, whatever I respond to, there's somebody... (read more)

1Christopher King
I would suggest formulating this like a literal attention economy. 1. You set a price for your attention (probably like $1). The price at which even if the post is a waste of time, the money makes it worth it. 2. "Recommenders" can recommend content to you by paying the price. 3. If the content was worth your time, you pay the recommender the $1 back plus a couple cents. The idea is that the recommenders would get good at predicting what posts you'd pay them for. And since you aren't a causal decision theorist they know you won't scam them. In particular, on average you should be losing money (but in exchange you get good content). This doesn't necessarily require new software. Just tell people to send PayPals with a link to the content. With custom software, theoretically there could exist a secondary market for "shares" in the payout from step 3 to make things more efficient. That way the best recommenders could sell their shares and then use that money to recommend more content before you payout. If the system is bad at recommending content, at least you get paid!
Raemon139

I do think such a system would be really valuable, and is the sort of the thing the LW team should try to build. (I'm mostly not going to respond to this idea right now but I've filed it away as something to revisit more seriously with Lightcone. Seems straightforwardly good)

But it feels slightly orthogonal to what I was trying to say. Let me try again.

(this is now official a tangent from the original point, but, feels important to me)

It would be good if the world could (deservedly) trust, that the best x-risk thinkers have a good group epistemic process f... (read more)

I note that I haven't said out loud, and should say out loud, that I endorse this history.  Not every single line of it (see my other comment on why I reject verificationism) but on the whole, this is well-informed and well-applied.

If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value "small molecular squiggles" versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else? 

Value them primarily?  Uhhh... maybe 1:3 against?  I admit I have never actually pondered this question before today; but 1 in 4 uncontrolled superintelligences spending most of their resources on tiny squiggles doesn't sound off by, like, more than 1-2 orders of magnitude in either direction.

Clocks ar

... (read more)

Not obviously stupid on a very quick skim.  I will have to actually read it to figure out where it's stupid.

(I rarely give any review this positive on a first skim.  Congrats.)

7Dima (lain)
If it takes Eliezer Yudkowsky 8 months to find out where the article is stupid it might very well be the case it's not.

Did you figure out where it's stupid?

From the top of my head:

While the overall idea is great if they can actually get something like it to work, it certainly won't with the approach described in this post.

We have no way of measuring when an agent is thinking about itself versus others, and no way of doing that has been proposed here.

The authors propose optimizing not for the similarity of activations between "when it thinks about itself" and "when it thinks about others", but for the similarity of activations between "when there's a text apparently referencing the author-character of some tex... (read more)

5kromem
This seems to have the common issue of considering alignment as a unidirectional issue as opposed to a bidirectional problem. Maximizing self/other overlap may lead to non-deceptive agents, but it's necessarily going to also lead to agents incapable of detecting that they are being decieved and in general performing worse at theory of mind. If the experimental setup was split such that success was defined by both non-deceptive behavior when the agent seeing color and cautious behavior minimizing falling for deception as the colorblind agent, I am skeptical the SOO approach above would look as favorable. Empathy/"seeing others as oneself" is a great avenue to pursue, and this seems like a promising evaluation metric to help in detecting it, but turning SOO into a Goodhart's Law maximization seems (at least to me) to be a disastrous approach in any kind of setup accounting for adversarial 'others.'

By "dumb player" I did not mean as dumb as a human player.  I meant "too dumb to compute the pseudorandom numbers, but not too dumb to simulate other players faithfully apart from that".  I did not realize we were talking about humans at all.  This jumps out more to me as a potential source of misunderstanding than it did 15 years ago, and for that I apologize.

7Wei Dai
In this comment of yours later in that thread, it seems clear that you did have humans in mind and were talking specifically about a game between a human (namely me), and a "smart player": Also that thread started with you saying "Don’t forget to retract: http://www.weidai.com/smart-losers.txt" and that article mentioned humans in the first paragraph.

I don't always remember my previous positions all that well, but I doubt I would have said at any point that sufficiently advanced LDT agents are friendly to each other, rather than that they coordinate well with each other (and not so with us)?

4Wei Dai
I realized that my grandparent comment was stated badly, but didn't get a chance to fix it before you replied. To clarify, the following comment of yours from the old thread seems to imply that we humans should be able to coordinate with a LDT agent in one shot PD (i.e., if we didn't "mistakenly" believe that the LDT agent would defect). Translated into real life, this seems to imply that (if alignment is unsolvable) we should play "cooperate" by building unaligned ASI, and unaligned ASI should "cooperate" by treating us well once built.

Actually, to slightly amend that:  The part where squiggles are small is a more than randomly likely part of the prediction, but not a load-bearing part of downstream predictions or the policy argument.  Most of the time we don't needlessly build our own paperclips to be the size of skyscrapers; even when having fun, we try to do the fun without vastly more resources, than are necessary to that amount of fun, because then we'll have needlessly used up all our resources and not get to have more fun.  We buy cookies that cost a dollar instead ... (read more)

3Richard_Ngo
I agree that the particular type of misaligned goal is not crucial. I'm thinking of molecular squiggles as an unusually clean type of misalignment to make arguments about, because it's very clear that they're not valuable. If you told me that molecular squiggles weren't a central example of a goal that you think a misaligned superintelligence might have, then I'd update, but it sounds like your statements are consistent with this. If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value "small molecular squiggles" versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else? Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human's preferences about how human civilization is structured?

The part where squiggles are small and simple is unimportant. They could be bigger and more complicated, like building giant mechanical clocks. The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.

Actually, to slightly amend that:  The part where squiggles are small is a more than randomly likely part of the prediction, but not a load-bearing part of downstream predictions or the policy argument.  Most of the time we don't needlessly build our own paperclips to be the size of skyscrapers; even when having fun, we try to do the fun without vastly more resources, than are necessary to that amount of fun, because then we'll have needlessly used up all our resources and not get to have more fun.  We buy cookies that cost a dollar instead ... (read more)

I think that the AI's internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn't surprise me to find distinct thoughts in there about electrons.  As the internal ontology goes to be more about affordances and actions, I expect to find increasing disalignment.  As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI's internals, I expect to find much larger differences -- not just that the AI ha... (read more)

Reply1232
2Noosphere89
I agree with something like the claim that the definition of concepts like human values depend on their internals, and are reflective, and that the environment doesn't have an objective morality/values (I'm a moral relativist, and sympathetic to moral anti-realism), but I wouldn't go as far as saying that data on human values in the environment is entirely uninformative, and more importantly, without other background assumptions, this can't get us to a state where the alignment problem for AI is plausibly hard, and it excludes too few models to be useful.
3Martin Randall
A potential big Model Delta in this conversation is between Yudkowsky-2022 and Yudkowsky-2024. From List of Lethalities: Vs the parent comment: Yudkowsky is "not particularly happy" with List of Lethalities, and this comment was made a day after the opening post, so neither quote should be considered a perfect expression of Yudkowsky's belief. In particular the second quote is more epistemically modest, which might be because it is part of a conversation rather than a self-described "individual rant". Still, the differences are stark. Is the AI utterly, incredibly alien "on a staggering scale", or does the AI have "noticeable alignments to human ontology"? Are the differences pervasive with "nothing that would translate well", or does it depend on whether the concepts are "purely predictive", about "affordances and actions", or have "reflective aspects"? The second quote is also less lethal. Human-to-human comparisons seem instructive. A deaf human will have thoughts about electrons, but their internal ontology around affordances and actions will be less aligned. Someone like Eliezer Yudkwosky has the skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment, whereas I can't do that because I project the category boundary onto the environment. Someone with dissociative identities may not have a general notion that maps onto my "myself". Someone who is enlightened may not have a general notion that maps onto my "I want". And so forth. Regardless, different ontologies is still a clear risk factor. The second quote still modestly allows the possibility of a mind so utterly alien that it doesn't have thoughts about electrons. And there are 42 other lethalities in the list. Security mindset says that risk factors can combine in unexpected ways and kill you. I'm not sure if this is an update from Yudkowsky-2022 to Yudkowsky-2024. I might expect an update to be flagged as such (
6Eli Tyre
Could anyone possibly offer 2 positive and 2 negative examples of a reflective-in-this-sense concept?
1David Jilk
This is probably the wrong place to respond to the notion of incommensurable ontologies. Oh well, sorry. While I agree that if an agent has a thoroughly incommensurable ontology, alignment is impossible (or perhaps even meaningless or incoherent), it also means that the agent has no access whatsoever to human science. If it can't understand what we want, it also can't understand what we've accomplished.  To be more concrete, it will not understand electrons from any of our books, because it won't understand our books. It won't understand our equations, because it won't understand equations nor will it have referents (neither theoretical nor observational) for the variables and entities contained there. Consequently, it will have to develop science and technology from scratch. It took a long time for us to do that, and it will take that agent a long time to do it. Sure, it's "superintelligent," but understanding the physical world requires empirical work. That is time-consuming, it requires tools and technology, etc. Furthermore, an agent with an incommensurable ontology can't manipulate humans effectively - it doesn't understand us at all, aside from what it observes, which is a long, slow way to learn about us. Indeed it doesn't even know that we are a threat, nor does it know what a threat is. Long story short, it will be a long time - decades? Centuries? before such an agent would be able to prevent us from simply unplugging it. Science does not and cannot proceed at the speed of computation, so all of the "exponential improvement" in its "intelligence" is limited by the pace of knowledge growth. Now, what if it has some purchase on human ontology? Well, then, it seems likely that it can grow that to a sufficient subset and in that way we can understand each other sufficiently well - it can understand our science, but also it can understand our values. The point if you have one you're likely to have the other. Of course, this does not mean that it will align
1Keith Winston
As to the last point, I agree that it seems likely that most iterations of AI can not be "pointed in a builder-intended direction" robustly. It's like thinking you're the last word on your children's lifetime worth of thinking. Most likely (and hopefully!) they'll be doing their own thinking at some point, and if the only thing the parent has said about that is "thou shalt not think beyond me", the most likely result of that, looking only at the possibility we got to AGI and we're here to talk about it, may be to remove ANY chance to influence them as adults. Life may not come with guarantees, who knew? Warmly, Keith
1Valerio
It could be worth exploring reflection in transparency-based AIs, the internals of which are observable. We can train a learning AI, which only learns concepts by grounding them on the AI's internals (consider the example of a language-based AI learning a representation linking saying words and its output procedure). Even if AI-learned concepts do not coincide with human concepts, because the AI's internals greatly differ from human experience (e.g. a notion of "easy to understand" assuming only a metaphoric meaning for an AI), AI-learned concepts remain interpretable to the programmer of the AI given the transparency of the AI (and the programmer of the AI could engineer control mechanisms to deal with disalignment). In other words, there will be unnatural abstractions, but they will be discoverable on the condition of training a different kind of AI - as opposed to current methods which are not inherently interpretable. This is monumental work, but desperately needed work
5Ebenezer Dukakis
If I encountered an intelligent extraterrestrial species, in principle I think I could learn to predict fairly well things like what it finds easy to understand, what its values are, and what it considers to be ethical behavior, without using any of the cognitive machinery I use to self-reflect. Humans tend to reason about other humans by asking "what would I think if I was in their situation", but in principle an AI doesn't have to work that way. But perhaps you think there are strong reasons why this would happen in practice? Supposing we had strong reasons to believe that an AI system wasn't self-aware and wasn't capable of self-reflection. So it can look over a plan it generated and reason about its understandability, corrigibility, impact on human values, etc. without any reflective aspects. Does that make alignment any easier according to you? Supposing the AI lacks a concept of "easy to understand", as you hypothesize. Does it seem reasonable to think that it might not be all that great at convincing a gatekeeper to unbox it, since it might focus on super complex arguments which humans can't understand? Is this mostly about mesa-optimizers, or something else?

So, would you also say that two random humans are likely to have similar misalignment problems w.r.t. each other? E.g. my brain is different from yours, so the concepts I associate with words like "be helpful" and "don't betray Eliezer" and so forth are going to be different from the concepts you associate with those words, and in some cases there might be strings of words that are meaningful to you but totally meaningless to me, and therefore if you are the principal and I am your agent, and we totally avoid problem #2 (in which you give me instructions and I just don't follow them, even the as-interpreted-by-me version of them) you are still screwed? (Provided the power differential between us is big enough?)

Corrigibility and actual human values are both heavily reflective concepts.  If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment -- which of course most people can't do because they project the category boundary onto the environment

Actual human values depend on human internals, but predictions about systems that strongly couple to human behavior depend on human internals as well. I thus expect efficient r... (read more)

Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI's internal ontology at training time.  My guess is that more of the disagreement lies here.

I doubt much disagreement between you and I lies there, because I do not expect ML-style training to robustly point an AI in any builder-intended direction. My hopes generally don't route through targeting via ML-style ... (read more)

What the main post is responding to is the argument:  "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?"  And to this the main post says, "Nope."

The main post is not arguing:  "If you abstract away the tasks humans evolved to solve, from human levels of performance at those tasks, the tasks AIs are being trained to solve are harder than those tasks in principle even if they were being solved perfectly."  I agree this is just false, and did not think my post said otherwise.

3Jan_Kulveit
I do agree the argument "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?" is wrong and clearly the answer is "Nope".  At the same time I do not think parts of your argument in the post are locally valid or good justification for the claim. Correct and locally valid argument why GPTs are not capped by human level was already written here. In a very compressed form, you can just imagine GPTs have text as their "sensory inputs" generated by the entire universe, similarly to you having your sensory inputs generated by the entire universe. Neither human intelligence nor GPTs are constrained by the complexity of the task (also: in the abstract, it's the same task).  Because of that, "task difficulty" is not a promising way how to compare these systems, and it is necessary to look into actual cognitive architectures and bounds.  With the last paragraph, I'm somewhat confused by what you mean by "tasks humans evolved to solve". Does e.g. sending humans to the Moon, or detecting Higgs boson, count as a "task humans evolved to solve" or not? 

Unless I'm greatly misremembering, you did pick out what you said was your strongest item from Lethalities, separately from this, and I responded to it.  You'd just straightforwardly misunderstood my argument in that case, so it wasn't a long response, but I responded.  Asking for a second try is one thing, but I don't think it's cool to act like you never picked out any one item or I never responded to it.

EDIT: I'm misremembering, it was Quintin's strongest point about the Bankless podcast.  https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky?commentId=cr54ivfjndn6dxraD

6tailcalled
I'm kind of ambivalent about this. On the one hand, when there is a misunderstanding, but he claims his argument still goes through after correcting the misunderstanding, it seems like you should also address that corrected form. On the other hand, Quintin Pope's correction does seem very silly. At least by my analysis: This approach considers only the things OpenAI could do with their current ChatGPT setup, and yes it's correct that there's not much online learning opportunity in this. But that's precisely why you'd expect GPT+DPO to not be the future of AI; Quintin Pope has clearly identified a capabilities bottleneck that prevents it from staying fully competitive. (Note that humans can learn even if there is a fraction of people who are sharing intentionally malicious information, because unlike GPT and DPO, humans don't believe everything we're told.) A more autonomous AI could collect actionable information at much greater scale, as it wouldn't be dependent on trusting its users for evaluating what information to update on, and it would have much more information about what's going on than the chat-based I/O. This sure does look to me like a huge bottleneck that's blocking current AI methods, analogous to the evolutionary bottleneck: The full power of the AI cannot be used to accumulate OOM more information to further improve the power of the AI.

If Quintin hasn't yelled "Empiricism!" then it's not about him.  This is more about (some) e/accs.

82-anna
Is it? The law could be setup so that the revealed evidence can only be used to help the innocent, not against the client. Would that be an acceptable compromise?

I am denying that superintelligences play this game in a way that looks like "Pick an ordinal to be your level of sophistication, and whoever picks the higher ordinal gets $9."  I expect sufficiently smart agents to play this game in a way that doesn't incentivize attempts by the opponent to be more sophisticated than you, nor will you find yourself incentivized to try to exploit an opponent by being more sophisticated than them, provided that both parties have the minimum level of sophistication to be that smart.

If faced with an opponent stupid enoug... (read more)

6Martín Soto
I agree most superintelligences won't do something which is simply "play the ordinal game" (it was just an illustrative example), and that a superintelligence can implement your proposal, and that it is conceivable most superintelligences implement something close enough to your proposal that they reach Pareto-optimality. What I'm missing is why that is likely. Indeed, the normative intuition you are expressing (that your policy shouldn't in any case incentivize the opponent to be more sophisticated, etc.) is already a notion of fairness (although in the first meta-level, rather than object-level). And why should we expect most superintelligences to share it, given the dependence on early beliefs and other pro tanto normative intuitions (different from ex ante optimization)? Why should we expect this to be selected for? (Either inside a mind, or by external survival mechanisms) Compare, especially, to a nascent superintelligence who believes most others might be simulating it and best-responding (thus wants to be stubborn). Why should we think this is unlikely? Probably if I became convinced trapped priors are not a problem I would put much more probability on superintelligences eventually coordinating. Another way to put it is: "Sucks to be them!" Yes sure, but also sucks to be me who lost the $1! And maybe sucks to be me who didn't do something super hawkish and got a couple other players to best-respond! While it is true these normative intuitions pull on me less than the one you express, why should I expect this to be the case for most superintelligences?

You have misunderstood (1) the point this post was trying to communicate and (2) the structure of the larger argument where that point appears, as follows:

First, let's talk about (2), the larger argument that this post's point was supposed to be relevant to.

Is the larger argument that superintelligences will misunderstand what we really meant, due to a lack of knowledge about humans?

It is incredibly unlikely that Eliezer Yudkowsky in particular would have constructed an argument like this, whether in 2007, 2017, or even 1997.  At all of these points i... (read more)

The old paradox: to care it must first understand, but to understand requires high capability, capability that is lethal if it doesn't care

But it turns out we have understanding before lethal levels of capability. So now such understanding can be a target of optimization. There is still significant risk, since there are multiple possible internal mechanisms/strategies the AI could be deploying to reach that same target. Deception, actual caring, something I've been calling detachment, and possibly others. 

This is where the discourse should be focusing... (read more)

I agree with cubefox: you seem to be misinterpreting the claim that LLMs actually execute your intended instructions as a mere claim about whether LLMs understand your intended instructions. I claim there is simply a sharp distinction between actual execution and correct, legible interpretation of instructions and a simple understanding of those instructions; LLMs do the former, not merely the latter.

Honestly, I think focusing on this element of the discussion is kind of a distraction because, in my opinion, the charitable interpretation of your posts is s... (read more)

cubefox103

I'm well aware of and agree there is a fundamental difference between knowing what we want and being motivated to do what we want. But as I wrote in the first paragraph:

Already LaMDA or InstructGPT (language models fine-tuned with supervised learning to follow instructions, essentially ChatGPT without any RLHF applied), are in fact pretty safe Oracles in regard to fulfilling wishes without misinterpreting you, and an Oracle AI is just a special kind of Genie whose actions are restricted to outputting text. If you tell InstructGPT what you want, it will

... (read more)

This deserves a longer answer than I have time to allocate it, but I quickly remark that I don't recognize the philosophy or paradigm of updatelessness as refusing to learn things or being terrified of information; a rational agent should never end up in that circumstance, unless some perverse other agent is specifically punishing them for having learned the information (and will lose of their own value thereby; it shouldn't be possible for them to gain value by behaving "perversely" in that way, for then of course it's not "perverse").  Updatelessnes... (read more)

Thank you for engaging, Eliezer.

I completely agree with your point: an agent being updateless doesn't mean it won't learn new information. In fact, it might perfectly decide to "make my future action A depend on future information X", if the updateless prior so finds it optimal. While in other situations, when the updateless prior deems it net-negative (maybe due to other agents exploiting this future dependence), it won't.

This point is already observed in the post (see e.g. footnote 4), although without going deep into it, due to the post being meant for ... (read more)

-5[anonymous]

They can solve it however they like, once they're past the point of expecting things to work that sometimes don't work.  I have guesses but any group that still needs my hints should wait and augment harder.

2roland
Even a small probability of solving alignment should have big expected utility modulo exfohazard. So why not share your guesses?
6Raemon
Fwiw this doesn't feel like a super helpful comment to me. I think there might be a nearby one that's more useful, but this felt kinda coy for the sake of being coy.

I have guesses but any group that still needs my hints should wait and augment harder.

I think this is somewhat harmful to there being a field of (MIRI-style) Agent Foundations. It seems pretty bad to require that people attempting to start in the field have to work out the foundations themselves, I don’t think any scientific fields have worked this way in the past. 

Maybe the view is that if people can’t work out the basics then they won’t be able to make progress, but this doesn’t seem at all clear to me. Many physicists in the 20th century were unabl... (read more)

I disagree with my characterization as thinking problems can be solved on paper, and with the name "Poet".  I think the problems can't be solved by twiddling systems weak enough to be passively safe, and hoping their behavior generalizes up to dangerous levels.  I don't think paper solutions will work either, and humanity needs to back off and augment intelligence before proceeding.  I do not take the position that we need a global shutdown of this research field because I think that guessing stuff without trying it is easy, but because guessing it even with some safe weak lesser tries is still impossibly hard.  My message to humanity is "back off and augment" not "back off and solve it with a clever theory".

5Christopher King
Would you say the point of MIRI was/is to create theory that would later lead to safe experiments (but that it hasn't happened yet)? Sort of like how the Manhattan project discovered enough physics to not nuke themselves, and then started experimenting? 🤔
6Aleksi Liimatainen
I feel like this "back off and augment" is downstream of an implicit theory of intelligence that is specifically unsuited to dealing with how existing examples of intelligence seem to work. Epistemic status: the idea used to make sense to me and apparently no longer does, in a way that seems related to the ways i've updated my theories of cognition over the past years. Very roughly, networking cognitive agents stacks up to cognitive agency at the next level up easier than expected and life has evolved to exploit this dynamic from very early on across scales. It's a gestalt observation and apparently very difficult to articulate into a rational argument. I could point to memory in gene regulatory networks, Michael Levin's work in nonneural cognition, trainability of computational ecological models (they can apparently be trained to solve sudoku), long term trends in cultural-cognitive evolution, and theoretical difficulties with traditional models of biological evolution - but I don't know how to make the constellation of data points easily distinguishable from pareidolia.
7sludgepuddle
I don't know whether augmentation is the right step after backing off or not, but I do know that the simpler "back off" is a much better message to send to humanity than that. More digestible, more likely to be heard, more likely to be understood, doesn't cause people to peg you as a rational tech bro, doesn't at all sound like the beginning of a sci-fi apocalypse plot line. I could go on.
5mishka
I think two issues here should be discussed separately: * technical feasibility * whether this or that route to intelligence augmentation should be actually undertaken I suspect that intelligence augmentation is much more feasible in short-term than people usually assume. Namely, I think that enabling people to tightly couple themselves with specialized electronic devices via high-end non-invasive BCI is likely to do a lot in this sense. This route should be much safer, much quicker, and much cheaper than Neuralink-like approach, and I think it can still do a lot. ---------------------------------------- Even though the approach with non-invasive BCI is much safer than Neuralink, the risks on the personal level are nevertheless formidable. On the social level, we don't really know if the resulting augmented humans/hybrid human-electronic entities will be "Friendly". ---------------------------------------- So, should we try it? My personal AI timelines are rather short, and existing risks are formidable... So, I would advocate organizing an exploratory project of this kind to see if this is indeed technically feasible on a short-term time scale (my expectation is that a small group can obtain measurable progress here within months, not years), and ponder various safety issues deeper before scaling it or before sharing the obtained technological advances in a more public fashion...
6M. Y. Zuo
Is there a reason why post-'augmented' individuals would even pay attention to the existing writings/opinions/desires/etc... of anyone, or anything, up to now? Or is this literally suggesting to leave everything in their future hands?
Vanessa KosoyΩ123111

Thank you for the clarification.

How do you expect augmented humanity will solve the problem? Will it be something other than "guessing it with some safe weak lesser tries / clever theory"?

Not what comes up for me, when I go incognito and google AI risk lesswrong.

3Deco
This posts SEO on google or DDG is not the point I'm looking to make.  Like @Said Achmiz I too got https://www.lesswrong.com/tag/ai-risk when googling. I clicked on it then went to the most click-baity post title. ("Die with dignity" nails this pretty well). Hence it being the first post I came across when googling less-wrong. It's a very prominent post which I believe makes the doom it predicts more likely by either dissuading people from getting involved in AIS or empowering those who oppose it. To make a comparison to another field; "climate doomism" has replaced climate denialism in recent years. Why would promoting that we're doomed be a net positive for AI alignment?
6Said Achmiz
Another data point. Searching for “AI risk lesswrong” (without the quotes) in: DDG, non-incognito: https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk DDG, incognito: https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk Google, non-incognito: https://www.lesswrong.com/tag/ai-risk Google, incognito: https://www.lesswrong.com/tag/ai-risk

I rather expect that existing robotic machinery could be controlled by ASI rather than "moderately smart intelligence" into picking up the pieces of a world economy after it collapses, or that if for some weird reason it was trying to play around with static-cling spaghetti It could pick up the pieces of the economy that way too.

1roha
It seems to me as if we expect the same thing then: If humanity was largely gone (e.g. by several engineered pandemics) and as a consequence the world economy came to a halt, an ASI would probably be able to sustain itself long enough by controlling existing robotic machinery, i.e. without having to make dramatic leaps in nanotech or other technology first. What I wanted to express with "a moderate increase of intelligence" is that it won't take an ASI at the level of GPT-142 to do that, but GPT-7 together with current projects in robotics might suffice to get the necessary planning and control of actuators come into existence. If that assumption holds, it means an ASI might come to the conclusion that it should end the threat that humanity poses to its own existence and goals long before it is capable of building Drexler nanotech, Dyson spheres, Von Neumann probes or anything else that a large portion of people find much too hypothetical to care about at this point in time.

It's false that currently existing robotic machinery controlled by moderately smart intelligence can pick up the pieces of a world economy after it collapses.  One well-directed algae cell could, but not existing robots controlled by moderate intelligence.

1roha
The question in point 2 is whether an ASI could sustain itself without humans and without new types of hardware such as Drexler style nanomachinery, which to a significant portion of people (me not included) seems to be too hypothetical to be of actual concern. I currently don't see why the answer to that question should be a highly certain no, as you seem to suggest. Here are some thoughts: * The world economy is largely catering to human needs, such as nutrition, shelter, healthcare, personal transport, entertainment and so on. Phenomena like massive food waste and people stuck in bullshit jobs, to name just two, also indicate that it's not close to optimal in that. An ASI would therefore not have to prevent a world economy from collapsing or pick it up afterwards, which I also don't think is remotely possible with existing hardware. I think the majority of processes running in the only example of a world economy we have is irrelevant to the self-preservation of an ASI. * An ASI would presumably need to keep it's initial compute substrate running long enough to transition into some autocatalytic cycle, be it on the original or a new substrate. (As a side remark, it's also thinkable that it might go into a reduced or dormant state for a while and let less energy- and compute-demanding processes act on its behalf until conditions have improved on some metric). I do believe that conventional robotics is sufficient to keep the lights on long enough, but to be perfectly honest, that's conditioned on a lack of knowledge about many specifics, like exact numbers of hardware turnover and energy requirements of data centers capable of running frontier models, the amount and quality of chips currently existing on the planet, the actual complexity of keeping different types of power plants running for a relevant period of time, the many detailed issues of existing power grids, etc. I weakly suspect there is some robustness built into these systems that stems not only from

What does this operationalize as?  Presumably not that if we load a bone and a diamond rod under equal pressures, the diamond rod breaks first?  Is it more about if we drop sudden sharp weights onto a bone rod and a diamond rod, the diamond rod breaks first?  I admit I hadn't expected that, despite a general notion that diamond is crystal and crystals are unexpectedly fragile against particular kinds of hits, and if so that modifies my sense of what's a valid metaphor to use.

As an physicist who is also an (unpublished) SF author, if I was trying to describe an ultimate nanoengineered physically strong material, it would be a carbon-carbon composite, using a combination of interlocking structures made out of diamond, maybe with some fluorine passivization, separated by graphene-sheet bilayers, building a complex crack-diffusing structure to achieve toughness in ways comparable to the structures of jade, nacre, or bone. It would be not quite as strong or hard as pure diamond, but a lot tougher. And in a claw-vs-armor fight, yeah... (read more)

"Pandemics" aren't a locally valid substitute step in my own larger argument, because an ASI needs its own manufacturing infrastructure before it makes sense for the ASI to kill the humans currently keeping its computers turned on.  So things that kill a bunch of humans are not a valid substitute for being able to, eg, take over and repurpose the existing solar-powered micron-diameter self-replicating factory systems, aka algae, and those repurposed algae being able to build enough computing substrate to go on running the ASI after the humans die.

It's... (read more)

3Leopard
After reading Pope and Belrose's work, a viewpoint of "lots of good aligned ASIs already building nanosystems and better computing infra" has solidified in my mind. And therefore, any accidentally or purposefully created misaligned AIs necessarily wouldn't have a chance of long-term competitive existence against the existing ASIs. Yet, those misaligned AIs might still be able to destroy the world via nanosystems; as we wouldn't yet trust the existing AIs with the herculean task of protecting our dear nature against the invasive nanospecies and all such. Byrnes voiced similar concerns in his point 1 against Pope&Belrose.
2roha
An attempt to optimize for a minimum of abstractness, picking up what was communicated here: 1. How could an ASI kill all humans? Setting off several engineered pandemics a month with a moderate increase of infectiousness and lethality compared to historical natural cases. 2. How could an ASI sustain itself without humans? Conventional robotics with a moderate increase of intelligence in planning and controlling the machinery. People coming in contact with that argument will check its plausibility, as they will with a hypothetical nanotech narrative. If so inclined, they will come to the conclusion that we may very well be able to protect ourselves against that scenario, either by prevention or mitigation, to which a follow-up response can be a list of other scenarios at the same level of plausibility, derived from not being dependent on hypothetical scientific and technological leaps. Triggering this kind of x-risk skepticism in people seems less problematic to me than making people think the primary x-risk scenario is far fetched sci-fi and most likely doesn't hold to scrutiny by domain experts. I don't understand why communicating a "certain drop dead scenario" with low plausibility seems preferable over a "most likely drop dead scenario" with high plausibility, but I'm open to being convinced that this approach is better suited for the goal of x-risk of ASI being taken seriously by more people. Perhaps I'm missing a part of the grander picture?
3Oliver Sourbut
Gotcha, that might be worth taking care to nuance, in that case. e.g. the linked twitter (at least) was explicitly about killing people[1]. But I can see why you'd want to avoid responses like 'well, as long as we keep an eye out for biohazards we're fine then'. And I can also imagine you might want to preserve consistency of examples between contexts. (Risks being misconstrued as overly-attached to a specific scenario, though?) Yeah... If I'm understanding what you mean, that's why I said, And I further think actually having a few scenarios up the sleeve is an antidote to the Hollywood/overly-specific failure mode. (Unfortunately 'covalently bonded bacteria' and nanomachines also make some people think in terms of Hollywood plots.) Infrastructure can be preserved in other ways, especially as a bootstrap. I think it might be worth giving some thought to other scenarios as intuition pumps. e.g. AI manipulates humans into building quasi-self-sustaining power supplies and datacentres (or just waits for us to decide to do that ourselves), then launches kilopandemic followed by next-stage infra construction. Or, AI invests in robotics generality and proliferation (or just waits for us to decide to do that ourselves), then uses cyberattacks to appropriate actuators to eliminate humans and bootstrap self-sustenance. Or, AI exfiltrates itself and makes oodles of horcruxes backups, launches green goo with genetic clock for some kind of reboot after humans are gone (this one is definitely less solid). Or, AI selects and manipulates enough people willing to take a Faustian bargain as its intermediate workforce, equips them (with strategy, materials tech, weaponry, ...) to wipe out everyone else, then bootstraps next-stage infra (perhaps with human assistants!) and finally picks off the remaining humans if they pose any threat. Maybe these sound entirely barmy to you, but I assume at least some things in their vicinity don't. And some palette/menu of options might be less o

"Pandemics" aren't a locally valid substitute step in my own larger argument, because an ASI needs its own manufacturing infrastructure before it makes sense for the ASI to kill the humans currently keeping its computers turned on.

When people are highly skeptical of the nanotech angle yet insist on a concrete example, I've sometimes gone with a pandemic coupled with limited access to medications that temporarily stave off, but don't cure, that pandemic as a way to force a small workforce of humans preselected to cause few problems to maintain the AI's hard... (read more)

Why is flesh weaker than diamond?  Diamond is made of carbon-carbon bonds.  Proteins also have some carbon-carbon bonds!  So why should a diamond blade be able to cut skin?

I reply:  Because the strength of the material is determined by its weakest link, not its strongest link.  A structure of steel beams held together at the vertices by Scotch tape (and lacking other clever arrangements of mechanical advantage) has the strength of Scotch tape rather than the strength of steel.

Or:  Even when the load-bearing forces holding larg... (read more)

-3paultreseler
"things animated by vitalic spirit, elan vital, can self-heal and self-reproduce" Why aren't you talking with Dr. Michael Levin and Dr. Gary Nolan to assist in the 3D mobility platform builds facilitating AGI biped interaction? Both of them would most assuredly be open to your consult forward. Good to see your still writing.
3Tapatakt
This totally makes sense! But "proteins are held together by van der Waals forces that are much weaker than covalent bonds" still is a bad communication.
0Alexander Gietelink Oldenziel
Impressive.

Minor point about the strength of diamond:

bone is so much weaker than diamond (on my understanding) ... Bone cleaves along the weaker fault line, not at its strongest point.

While it is true that the ultimate strength of diamond is much higher than bone, this is relevant primarily for its ability to resist continuously applied pressure (as is its hardness enabling cutting). The point about fault lines seems more relevant for toughness, another material property that describes how much energy can be absorbed without breaking, and there bone beats diamond eas... (read more)

5Oliver Sourbut
This is an interesting one. I'd also have thought a priori that your strategy of focusing on strength (we're basically focusing pretty hard on tensile strength I think?) would be nice and simple and intuitive.[1] But in practice this seems to confuse/put off quite a few people (exemplified by this post and similar). I wonder if focusing on other aspects of designed-vs-evolved nanomachines might be more effective? One core issue is the ability to aggressively adversarially exploit existing weaknesses and susceptibilities... e.g. I have had success by making gestures like 'rapidly iterating on pandemic-potential viruses or other replicators'[2]. I don't think there's a real need to specifically invoke hardnesses in a 'materials science' sense. Like, ten pandemics a month is probably enough to get the point across, and doesn't require stepping much past existing bio. Ten pandemics a week, coordinated with photosynthetic and plant-parasitising bio stuff if you feel like going hard. I think these sorts of concepts might be inferentially closer for a lot of people. It's always worth emphasising (and you do), that any specific scenario is overly conjunctive and just one option among many. If I had to guess an objection, I wonder if you might feel that's underplaying the risk in some way? ---------------------------------------- 1. It brings to mind the amusing molecular simulations of proteins and other existing nanomachines, where everything is amusingly wiggly. Like, it looks so squishy! Obviously things can be stronger than that. ↩︎ 2. By 'success' I mean 'they have taken me seriously, apparently updated their priorities, and (I think) in a good and non-harmful way' ↩︎

Depends on how much of a superintelligence, how implemented.  I wouldn't be surprised if somebody got far superhuman theorem-proving from a mind that didn't generalize beyond theorems.  Presuming you were asking it to prove old-school fancy-math theorems, and not to, eg, arbitrarily speed up a bunch of real-world computations like asking it what GPT-4 would say about things, etc.

Solution (in retrospect this should've been posted a few years earlier):

let
'Na' = box N contains angry frog
'Ng' = N gold
'Nf' = N's inscription false
'Nt' = N's inscription true

consistent states must have 1f 2t or 1t 2f, and 1a 2g or 1g 2a

then:

1a 1t, 2g 2f => 1t, 2f
1a 1f, 2g 2t => 1f, 2t
1g 1t, 2a 2f => 1t, 2t
1g 1f, 2a 2t => 1f, 2f

I currently guess that a research community of non-upgraded alignment researchers with a hundred years to work, picks out a plausible-sounding non-solution and kills everyone at the end of the hundred years.

2Bogdan Ionut Cirstea
I'm highly confused about what is meant here. Is this supposed to be about the current distribution of alignment researchers? Is this also supposed to hold for e.g. the top 10 percentile of alignment researchers, e.g. on safety mindset? What about many uploads/copies of Eliezer only?
1Simon Fischer
I'm confused by this statement. Are you assuming that AGI will definitely be built after the research time is over, using the most-plausible-sounding solution? Or do you believe that you understand NOW that a wide variety of approaches to alignment, including most of those that can be thought of by a community of non-upgraded alignment researchers (CNUAR) in a hundred years, will kill everyone and that in a hundred years the CNUAR will not understand this? If so, is this because you think you personally know better or do you predict the CNUAR will predictably update in the wrong direction? Would it matter if you got to choose the composition of the CNUAR?

I don't think that faster alignment researchers get you to victory, but uploading should also allow for upgrading and while that part is not trivial I expect it to work.

7Kaarel
I'd be quite interested in elaboration on getting faster alignment researchers not being alignment-hard — it currently seems likely to me that a research community of unupgraded alignment researchers with a hundred years is capable of solving alignment (conditional on alignment being solvable). (And having faster general researchers, a goal that seems roughly equivalent, is surely alignment-hard (again, conditional on alignment being solvable), because we can then get the researchers to quickly do whatever it is that we could do — e.g., upgrading?)

AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque.  LLMs possibly ending up at the center is a small update in favor of alignment success, because it means we might (through some clever sleight, this part is not trivial) be able to have humanese sentences play an inextricable role at the center of thought (hence MIRI's early interest in the Visible Thoughts Project).

The part where LLMs are to predict English answers to some English questions about values, and show common-se... (read more)

5Seth Herd
It seems like those goals are all in the training set, because humans talk about those concepts. Corrigibility is elaborations of "make sure you keep doing what these people say", etc. It seems like you could simply use an LLM's knowledge of concepts to define alignment goals, at least to a first approximation. I review one such proposal here. There's still an important question about how perfectly that knowledge generalizes with continued learning, and to OOD future contexts. But almost no one is talking about those questions. Many are still saying "we have no idea how to define human values", when LLMs can capture much of any definition you like.
3Noosphere89
I want to note that this part: This is wrong, and this disagreement is at a very deep level why I think on the object level that LW was wrong. AIs are white boxes, not black boxes, because we have full read-write access to their internals, which is partially why AI is so effective today. We are the innate reward system, which already aligns our brain to survival and critically doing all of this with almost no missteps, and the missteps aren't very severe. The meme of AI as black box needs to die. These posts can help you get better intuitions, at least: https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/ai-pause-will-likely-backfire#White_box_alignment_in_nature

I have never since 1996 thought that it would be hard to get superintelligences to accurately model reality with respect to problems as simple as "predict what a human will thumbs-up or thumbs-down".  The theoretical distinction between producing epistemic rationality (theoretically straightforward) and shaping preference (theoretically hard) is present in my mind at every moment that I am talking about these issues; it is to me a central divide of my ontology.

If you think you've demonstrated by clever textual close reading that Eliezer-2018 or Elieze... (read more)

-22Ed Corns
TurnTrout*Ω29596

Getting a shape into the AI's preferences is different from getting it into the AI's predictive model.

It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?

I commonly encounter people expressing sentiments like "prosaic alignment work isn't real alignment, because we aren't actually getting the AI to care about X." To which I say: How do you know that? What does it even mean for that claim to be true or false? What do you think you know, and why do you think you know it? What e... (read more)

Kaj_SotalaΩ71412

Getting a shape into the AI's preferences is different from getting it into the AI's predictive model.  MIRI is always in every instance talking about the first thing and not the second.

You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI's predictive model is not sufficient.  It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kin

... (read more)
Chris_LeongΩ6128

Your comment focuses on GPT4 being "pretty good at extracting preferences from human data" when the stronger part of the argument seems to be that "it will also generally follow your intended directions, rather than what you literally said".

I agree with you that it was obvious in advance that a superintelligence would understand human value.

However, it sure sounded like you thought we'd have to specify each little detail of the value function. GPT4 seems to suggest that the biggest issue will be a situation where:

1) The AI has an option that would produce ... (read more)

evhub*Ω418319

I'm not going to comment on "who said what when", as I'm not particularly interested in the question myself, though I think the object level point here is important:

This makes the nonstraightforward and shaky problem of getting a thing into the AI's preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there.

The way I would phrase this is that what you care about is the relative complexity of the objective conditional on the world model. If you're assuming that the model is highly capable, an... (read more)

1Writer
Eliezer, are you using the correct LW account? There's only a single comment under this one.

Getting a shape into the AI's preferences is different from getting it into the AI's predictive model.  MIRI is always in every instance talking about the first thing and not the second.

Why would we expect the first thing to be so hard compared to the second thing? If getting a model to understand preferences is not difficult, then the issue doesn't have to do with the complexity of values. Finding the target and acquiring the target should have the same or similar difficulty (from the start), if we can successfully ask the model to find the target fo... (read more)

4Tor Økland Barstad
Your reply here says much of what I would expect it to say (and much of it aligns with my impression of things). But why you focused so much on "fill the cauldron" type examples is something I'm a bit confused by (if I remember correctly I was confused by this in 2016 also).
Matthew Barnett*Ω4012736

I think you missed some basic details about what I wrote. I encourage people to compare what Eliezer is saying here to what I actually wrote. You said:

If you think you've demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.

I never said that you or any other MIRI person thought it would be "hard to get a superintelligence to understand humans". Here's what I actually wrote:

Non-MIRI p

... (read more)
Rob BensingerΩ1024-4

But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. [...] I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!

Quoting myself in April:

"MIRI's argument for AI risk depended on AIs being bad at n

... (read more)

Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.

 

Everyone agrees that you assumed superintelligence would understand everything humans understand and more. The dispute is entirely about the things that you encounter before superintelligence. In general it seems like the world turned out much more gradual than you expected and there's information to be found in what capabilities emerged sooner in the process.

Reply93311

There's perhaps more detail in Project Lawful and in some nearby stories ("for no laid course prepare", "aviation is the most dangerous routine activity").

6[anonymous]
There are still adversarial equilibria even if every person on the planet is as smart as you. Greater intelligence makes people more tryhard in their roles. It is possible that today one of the reasons things work at all is because regulators get tired and let people do things, cops don't remember all the laws so they allow people to break them, scientists do something illogical and accidentally make a major discovery, and so on. But doctors and mortuary workers couldn't scam people to be against cryonics because the average person is smart. FDA couldn't scam people to think slow drug approvals protect their lives because average people are smart. Local housing authorities couldn't scam people into affordable housing requirements because average people understand supply and demand. Huh. I think you might be correct. Too bad evolution didn't have enough incentive to make humans that smart.

Have you ever seen or even heard of a person who is obese who doesn't eat hyperpalatable foods? (That is, they only eat naturally tasting, unprocessed, "healthy" foods).

Tried this for many years.  Paleo diet; eating mainly broccoli and turkey; trying to get most of my calories from giant salads.  Nothing.

3lc
Might not be the right place to ask this question but: Have you tried bariatric surgery? I'm sure you've done a bunch of research and was wondering if you hadn't because of something you'd read, or you had and it didn't work, or some third thing.

I am not - $150K is as much as I care to stake at my present weath levels - and while I refunded your payment, I was charged a $44.90 fee on the original transmission which was not then refunded to me.

ilialuk285

Oh, that's suboptimal, sending 100$ to cover the fee charge (the extra in case they take another fee for some reason).

Again, apologies for the inconvenience. (wire sent)

Though I disagree with @RatsWrongAboutUAP (see this tweet) and took the other side of the bet, I say a word of praise for RatsWrong about following exactly the proper procedure to make the point they wanted to make, and communicating that they really actually think we're wrong here.  Object-level disagreement, meta-level high-five.

1[comment deleted]
4Lord Dreadwar
Meta-level high-five for engaging with a stigmatised topic with extensive reasoning, onto object-level disagreement: What better strategies did you have in mind re. achieving goals like positioning themselves at the top of our status hierarchy? NHI would likely be aware of cognitive biases we are not, as well as those we are (e.g. the biases that cause humans to double down when prophecies fail in cults, and generally act weirdly around incredibly slim evidence). The highest-status authority, in the eyes of the vast majority of humans, is a deity or deities, and these highly influential, species-shaping status hierarchies are largely based on a few flimsy apparitions. (This is somewhat suspicious, if your priors for alien visitation are relatively high; mine are relatively high due to molecular panspermia.) If you had to isolate seeds for future dominant religions, UFO and UFO-adjacent cults (including Scientology and the New Age) seem like plausible candidates; UFOs are frequently cited as the primary example of an emerging myth in the modern world. If we assume these results are the desired result, we could hypothesise NHI is using its monopoly on miracle generation to craft human-tailored memetically viral belief systems, from ancient gods to today's saucers. Given that ancient gods DO occupy the top of our status hierarchy, beyond our corporate, cultural and political leaders, I'm not sure we can be so confident that creating disreputable UFO reports is a poor strategy; less reputable reports dominated the world in a few centuries.
6awg
How does all of the recent official activity fit into your worldview here? Do you have your own speculations/explanations for why, e.g., Chuck Schumer would propose such specifically-worded legislation on this topic? Does that stuff just not factor into your worldview at all (or perhaps is weighted next to nothing against your own tweeted-about intuitions)?

Glad to have made this bet with you!

5John Wiseman
Eliezer, I'm offering you double the odds at 75:1, and would put up to $50k against your corresponding amount. If the UFO phenomenon is real,  p(doom) may be much lower than you think, and in this view you'd probably be happy to pay it out.
Load More