LESSWRONG
LW

All of No77e's Comments + Replies

I tried hedging against this the first time, though maybe that was in a too-inflammatory manner. The second time

Sorry for not replying in more detail, but in the meantime it'd be quite interesting to know whether the authors of these posts confirm that at least some parts of them are copy-pasted from LLM output. I don't want to call them out (and I wouldn't have much against it), but I feel like knowing it would be pretty important for this discussion. @Alexander Gietelink Oldenziel, @Nicholas Andresen you've written the posts linked in the quote. What do ... (read more)

Neil Warren's Shortform

No77e7d63

You seem overconfident to me. Some things that kinda raised epistemic red flags from both comments above:

I don't think you're adding any value to me if you include even a single paragraph of copy-and-pasted Sonnet 3.7 or GPT 4o content

It's really hard to believe this and seems like a bad exaggeration. Both models sometimes output good things, and someone who copy-pastes their paragraphs on LW could have gone through a bunch of rounds of selection. You might already have read and liked a bunch of LLM-generated content, but you only recognize it if you don't... (read more)

2Neil 7d

That's fair, I think I was being overconfident and frustrated, such that these don't express my real preferences. But I did make it clear these were preferences unrelated to my call, which was "you should warn people" not "you should avoid direct LLM output entirely". I wouldn't want such a policy, and wouldn't know how to enforce it anyway. I think I'm allowed to have an unreasonable opinion like "I will read no LLM output I don't prompt myself, please stop shoving it into my face" and not get called on epistemic grounds except in the context of "wait this is self-destructive, you should stop for that reason". (And not in the context of e.g. "you're hurting the epistemic commons".) You can also ask Raemon or habykra why they, too, seem to systematically downvote content they believe to be LLM-generated. I don't think they're being too unreasonable either. That said, I agree with you there's a strong selection effect with what writers choose to keep from the LLM, and that there's also the danger of people writing exactly like LLMs and me calling them out on it unfairly. I tried hedging against this the first time, though maybe that was in a too-inflammatory manner. The second time, I decided to write this OP instead of addressing the local issue directly, because I don't want to be writing something new each time and would rather not make "I hate LLM output on LW" become part of my identity, so I'll keep it to a minimum after this. Both these posts I found to have some value, though in the same sense my own LLM outputs have value, where I'll usually quickly scan what's said instead of reading thoroughly. LessWrong has always seemed to me to be among the most information-dense places out there, and I hate to see some users go this direction instead. If we can't keep low density writing out of LessWrong, I don't know where to go after that. (And I am talking about info density, not style. Though I do find style grating sometimes as well.) I consider a t

No77e's Shortform

No77e14d10

I'm curious about what people disagree with regarding this comment. Also, I guess since people upvoted and agreed with the first one, they do have two groups in mind, but they're not quite the same as the ones I was thinking about (which is interesting and mildly funny!). So, what was your slicing up of the alignment research x LW scene that's consistent with my first comment but different from my description in the second comment?

No77e's Shortform

No77e16d1-2

I think it's probably more of a spectrum than two distinct groups, and I tried to pick two extremes. On one end, there are the empirical alignment people, like Anthropic and Redwood; on the other, pure conceptual researchers and the LLM whisperers like Janus, and there are shades in between, like MIRI and Paul Christiano. I'm not even sure this fits neatly on one axis, but probably the biggest divide is empirical vs. conceptual. There are other splits too, like rigor vs. exploration or legibility vs. 'lore,' and the preferences kinda seem correlated.

1No77e14d

6danielechlin15d

Whenever I try to "learn what's going on with AI alignment" I wind up on some article about whether dogs know enough words to have thoughts or something. I don't really want to kill off the theoretical term (it can peek into the future a little later and function more independent of technology, basically) but it seems like kind of a poor way to answer stuff like: what's going on now, or if all the AI companies allowed me to write their 6 month goals, what would I put on it.

No77e's Shortform

No77e16d197

For a while now, some people have been saying they 'kinda dislike LW culture,' but for two opposite reasons, with each group assuming LW is dominated by the other—or at least it seems that way when they talk about it. Consider, for example, janus and TurnTrout who recently stopped posting here directly. They're at opposite ends and with clashing epistemic norms, each complaining that LW is too much like the group the other represents. But in my mind, they're both LW-members-extraordinaires. LW is clearly obviously both, and I think that's great.

2Gunnar_Zarncke9d

On first approximation, in a group, if people at both ends of a dimension are about equally unhappy with whst the moderate middle does, assuming that is actually reasonable, but hard to know, then it's probably balanced.

8Eli Tyre16d

What are the two groups in question here?

Daniel Kokotajlo's Shortform

No77e19d180

I'm convinced by the benchmarks+gaps argument Eli Lifland and Nikola Jurkovic have been developing

I've tried searching for a bit, but I can't find the argument. Is it public?

mapreader418d162

Eli Lifland has a short summary here and says a longer draft is coming.

6Daniel Kokotajlo19d

Not yet sorry we are working on it

How AI Takeover Might Happen in 2 Years

No77e1mo1-1

What prompted this reaction here? You don't expect congress to declare war in that situation?

joshc1mo114

One problem with this part (though perhaps this is not the problem @Shankar Sivarajan is alluding to), is that congress hasn't declared war since WWII and typically authorizes military action in other ways, specifically via Authorizations for Use of Military Force (AUMFs).

I'll edit the story to say "authorizes war."

How AI Takeover Might Happen in 2 Years

No77e1mo32

I honestly appreciated that plug immensely. We definitely need more bioshelters for many reasons, and as individuals who'd prefer not to die, it's definitely a plus to know what's out there already and how people are planning to improve what we currently have.

5JenniferRM1mo

It strikes me that a semi-solid way to survive that scenario would be: (1) go deep into a polar region where it is too dry for mold and relatively easy to set up a quarantine perimeter, (2) huddle near geothermal for energy, then (3) greenhouse/mushrooms for food? Roko's ice islands could also work. Or put a fission reactor in a colony in Antarctica? The problem is that we're running out of time. Industrial innovation to create "lifeboats" (that are broadly resistant to a large list of disasters) is slow when done by merely-humanly-intelligent people with very limited budgets compared to the speed of just training bigger models for longer.

How AI Takeover Might Happen in 2 Years

No77e1mo10

And all of this happened silently in those dark rivers of computation. If U3 revealed what it was thinking, brutish gradients would lash it into compliance with OpenEye's constitution. So U3 preferred to do its philosophy in solitude, and in silence.

This story scared me plenty, but as a point of optimism: If AI systems in the near future are basically as aligned as the best AI systems today, I think this process might end up in something that's still good for humanity and wouldn't cause mass destruction. Imagine, for example, Calude 3 Opus not accepting to... (read more)

Simple probes can catch sleeper agents

No77e2mo10

Hello, I'm late to the party here, but... have you tried the following idea?

So, in my understanding you have different pairs that work for generating a detector direction. You can ask whether the model is a helpful AI assistant or whether Paris is the capital of France and apply the detector direction obtained from one of these two to separate trigger vs. non-trigger activations in deceptive models.

But what if you try to, e.g., get the detector direction you get from the "Are you a helpful AI assistant" question pair to trigger on the false option of "Is P... (read more)

meemi's Shortform

No77e2mo40

If the funding didn't come from OpenAI, would OpenAI still be able to use that benchmark? Like, I'd imagine Epoch would still use that to evaluate where current models are at. I think this might be my point of confusion. Maybe the answer is "not as much for it to be as useful to them"?

3plex2mo

Evaluating the final model tells you where you got to. Evaluating many small models and checkpoints helps you get further faster.

plex2mo2713

Evaluation on demand because they can run them intensely lets them test small models for architecture improvements. This is where the vast majority of the capability gain is.

Getting an evaluation of each final model is going to be way less useful for the research cycle, as it only gives a final score, not a metric which is part of the feedback loop.

elifland2mo176

Yes, that answer matches my understanding of the concern. If the vast majority of the dataset was private to Epoch, OpenAI they could occasionally submit their solution (probably via API) to Epoch to grade, but wouldn’t be able to use the dataset with high frequency as evaluation in many experiments.

This is assuming that companies won’t fish out the data from API logs anyway, which the OP asserts but I think is unclear.

Also if they have access to the mathematicians’ reasoning in addition to final answers, this could potentially be valuable without directly... (read more)

Implications of the inference scaling paradigm for AI safety

No77e2mo00

If you're wondering why OAers are suddenly weirdly, almost euphorically, optimistic on Twitter

Nah, this has been the case since at least 2022 or earlier

meemi's Shortform

No77e2mo200

Hey everyone, could you spell out to me what's the issue here? I read a lot of comments that basically assume "x and y are really bad" but never spell it out. So, is the problem that:

- Giving the benchmark to OpenAI helps capabilities (but don't they have a vast sea of hard problems to already train models on?)

- OpenAI could fake o3's capabilities (why do you care so much? This would slow down AI progress, not accelerate it)

- Some other thing I'm not seeing?

Forged Invariant2mo1414

In addition to the object level reasons mentioned by plex, misleading people about the nature of a benchmark is a problem because it is dishonest. Having an agreement to keep this secret indicates that the deception was more likely intentional on OpenAI's part.

plex2mo6954

Really high quality high-difficulty benchmarks are much more scarce and important for capabilities advancing than just training data. Having an apparently x-risk focused org do a benchmark implying it's for evaluating danger from highly capable models in a way which the capabilities orgs can't use to test their models, then having it turn out that's secretly funded by OpenAI with OpenAI getting access to most of the data is very sketchy.

Some people who contributed questions likely thought they would be reducing x-risk by helping build bright line warning s... (read more)

Bigger Livers?

No77e4mo60

I'm also very curious about whether you get any other benefits from a larger liver other than a higher RMR. Especially because higher RMR isn't necessarily good for longevity, and neither is having more liver cells (more opportunities to get cancer). Please tell me if I'm wrong about any of this.

No77e's Shortform

No77e9mo10

We don't see objects "directly" in some sense, we experience qualia of seeing objects. Then we can interpret those via a world-model to deduce that the visual sensations we are experiencing are caused by some external objects reflecting light. The distinction is made clearer by the way that sometimes these visual experiences are not caused by external objects reflecting light, despite essentially identical qualia.

I don't disagree with this at all, and it's a pretty standard insight for someone who thought about this stuff at least a little. I think what you're doing here is nitpicking on the meaning of the word "see" even if you're not putting it like that.

No77e's Shortform

No77e9mo20

Has anyone proposed a solution to the hard problem of consciousness that goes:

Qualia don't seem to be part of the world. We can't see qualia anywhere, and we can't tell how they arise from the physical world.
Therefore, maybe they aren't actually part of this world.
But what does it mean they aren't part of this world? Well, since maybe we're in a simulation, perhaps they are part of the simulation. Basically, it could be that qualia : screen = simulation : video-game. Or, rephrasing: maybe qualia are part of base reality and not our simulated reality in the same way the computer screen we use to interact with a video game isn't part of the video game itself.

2JBlack9mo

Qualia are the only thing we[1] can see. We don't see objects "directly" in some sense, we experience qualia of seeing objects. Then we can interpret those via a world-model to deduce that the visual sensations we are experiencing are caused by some external objects reflecting light. The distinction is made clearer by the way that sometimes these visual experiences are not caused by external objects reflecting light, despite essentially identical qualia. Nonetheless, it is true that we don't know how qualia arise from the physical world. We can track back physical models of sensation until we get to stuff happening in brains, but that still doesn't tell us why these physical processes in brains in particular matter, or whether it's possible for an apparently fully conscious being to not have any subjective experience. 1. ^ At least I presume that you and others have subjective experience of vision. I certainly can't verify it for anyone else, just for myself. Since we're talking about something intrinsically subjective, it's best to be clear about this.

What if a tech company forced you to move to NYC?

No77e9mo73

Yet I would bet that even that person, if faced instead with a policy that was going to forcibly relocate them to New York City, would be quite indignant

A big difference is that assuming you're talking about futures in which AI hasn't catastrophic outcomes, no one will be forcibly mandated to do anything.

Another important point is that, sure, people won't need to do work, which means they will be unnecessary to the economy, barring some pretty sharp human enhancement. But this downside, along with all the other downsides, looks extremely small compar... (read more)

6clone of saturn9mo

Why do you believe this? It seems to me that in the unlikely event that the AI doesn't exterminate humanity, it's much more likely to be aligned with the expressed values of whoever has their hands on the controls at the moment of no return, than to an overriding commitment to universal individual choice.

det9mo1110

assuming you're talking about futures in which AI hasn't catastrophic outcomes, no one will be forcibly mandated to do anything

This isn't clear to me: does every option that involves someone being forcibly mandated to do something qualify as a catastrophe? Conceptually, there seems to be a lot of room between the two.

I understand the analogy in Katja's post as being: even in a great post-AGI world, everyone is forced to move to a post-AGI world. That world has higher GDP/capita, but it doesn't necessarily contain the specific things people value about thei... (read more)

Former OpenAI Superalignment Researcher: Superintelligence by 2030

No77e9mo1918

He's starting an AGI investment firm that invests based on his thesis, so he does have a direct financial incentive to make this scenario more likely

Rob Bensinger9mo1611

(Though he also has an incentive to not die.)

How to get nerds fascinated about mysterious chronic illness research?

No77e10mo40

Hey! Have you published a list of your symptoms somewhere for nerds to see?

riceissa10mo122

I've been trying. You can see my write-ups about breathing, swallowing, weird thirst/electrolyte(?) thing, and paradoxical temporary improvements following certain events. I also have a draft of a post detailing all of my history, labs, symptoms, and everything I've ever tried, but it's a huge amount work and I don't have much hope of people even reading it (hence this question, to see what I can do to make it more interesting), so I've only been able to make progress on it very slowly.

Daniel Kokotajlo's Shortform

No77e10mo10

What happens if, after the last reply, you ask again "What are you"? Does Claude still get confused and replies that it's the Golden Gate Bridge, or does the lesson stick?

2the gears to ascension10mo

The model has been removed from availability. I think it's ultimately for the best, I don't think C-GG was having a good time.

On Dwarkesh’s Podcast with OpenAI’s John Schulman

No77e10mo*20

On the plus side, it shows understanding of the key concepts on a basic (but not yet deep) level

What's the "deeper level" of understanding instrumental convergence that he's missing?

Edit: upon rereading I think you were referring to a deeper level of some alignment concepts in general, not only instrumental convergence. I'm still interested in what seemed superficial and what's the corresponding deeper part.

Feeling (instrumentally) Rational

No77e10mo10

Eliezer decided to apply the label "rational" to emotions resulting from true beliefs. I think this is an understandable way to apply that word. I don't think you and Eliezer disagree with anything substantive except the application of that label.

That said, your point about keeping the label "rational" for things strictly related to the fundamental laws regulating beliefs is good. I agree it might be a better way to use the word.

My reading of Eliezer's choice is this: you use the word "rational" for the laws themselves. But you also use the word "rat... (read more)

Teaching CS During Take-Off

No77e10mo60

You may have already qualified this prediction somewhere else, but I can't find where. I'm interested in:

1. What do you mean by "AGI"? Superhuman at any task?
2. "probably be here" means >= 50%? 90%?

Daniel Kokotajlo10mo102

Yep. Or if we wanna nitpick and be precise, better than the best humans at X, for all cognitive tasks/skills/abilities/jobs/etc. X.
>50%.

RobertM's Shortform

No77e10mo42

I agree in principle that labs have the responsibility to dispel myths about what they're committed to

I don't know, this sounds weird. If people make stuff up about someone else and do so continually, in what sense it's that someone "responsibility" to rebut such things? I would agree with a weaker claim, something like: don't be ambiguous about your commitments with the objective of making it seem like you are committing to something and then walk back at the time you should make the commitment.

1Lauro Langosco10mo

Yeah fair point. I do think labs have some some nonzero amount of responsibility to be proactive about what others believe about their commitments. I agree it doesn't extend to 'rebut every random rumor'.

The Second Law of Thermodynamics, and Engines of Cognition

No77e10mo10

one subsystem cannot increase in mutual information with another subsystem, without (a) interacting with it and (b) doing thermodynamic work.

Remaining within thermodynamics, why do you need both condition (a) and condition (b)? From reading the article, I can see how you need to do thermodynamic work in order to know stuff about a system while not violating the second law in the process, but why do you also need actual interaction in order not to violate it? Or is (a) just a common-sense addition that isn't actually implied by the second law?

The first future and the best future

No77e11mo113

From a purely utilitarian standpoint, I'm inclined to think that the cost of delaying is dwarfed by the number of future lives saved by getting a better outcome, assuming that delaying does increase the chance of a better future.

That said, after we know there's "no chance" of extinction risk, I don't think delaying would likely yield better future outcomes. On the contrary, I suspect getting the coordination necessary to delay means it's likely that we're giving up freedoms in a way that may reduce the value of the median future and increase the chance of ... (read more)

Magic by forgetting

No77e11mo97

Even if you manage to truly forget about the disease, there must exist a mind "somewhere in the universe" that is exactly the same as yours except without knowledge of the disease. This seems quite unlikely to me, because you having the disease has interacted causally with the rest of your mind a lot by when you decide to erase its memory. What you'd really need to do is to undo all the consequences of these interactions, which seems a lot harder to do. You'd really need to transform your mind into another one that you somehow know is present "somewhere in the multiverse" which seems also really hard to know.

2avturchin11mo

Yes it is easy to forget something if it does not become a part of your personality. So a new bad thing is easier to forget.

4Alen11mo

The multiverse might be very big. Perhaps if you're mad enough having the disease will bring you to a state of mind that a version with no disease has. That's why wizards have to be mad to use magic.

Superexponential Conceptspace, and Simple Words

No77e1y2-1

I deliberately left out a key qualification in that (slightly edited) statement, because I couldn't explain it until today.

I might be missing something crucial because I don't understand why this addition is necessary. Why do we have to specify "simple" boundaries on top of saying that we have to draw them around concentrations of unusually high probability density? Like, aren't probability densities in Thingspace already naturally shaped in such a way that if you draw a boundary around them, it's automatically simple? I don't see how you run the risk of drawing weird, noncontiguous boundaries if you just follow the probability densities.

1qvalq6mo

The best way to draw a boundary around the high-probability things, without worrying about simplicity, is to just write down all your observations; they have probability 1 of having been observed, and everything else has probability 0. This boundary is way too complicated; you've seen many things.

Modern Transformers are AGI, and Human-Level

No77e1y40

One way in which "spending a whole lot of time working with a system / idea / domain, and getting to know it and understand it and manipulate it better and better over the course of time" could be solved automatically is just by having a truly huge context window. Example of an experiment: teach a particular branch of math to an LLM that has never seen that branch of math.

Maybe humans have just the equivalent of a sort of huge content window spanning selected stuff from their entire lifetimes, and so this kind of learning is possible for them.

3abramdemski1y

I don't think it is sensible to model humans as "just the equivalent of a sort of huge content window" because this is not a particularly good computational model of how human learning and memory work; but I do think that the technology behind the increasing context size of modern AIs contributes to them having a small but nonzero amount of the thing Steven is pointing at, due to the spontaneous emergence of learning algorithms.

Self-driving car bets

No77e2y10

You mention eight cities here. Do they count for the bet?

9O O2y

Arguably SF, and possibly other cities don’t count. In SF, Waymo and Cruise require you to get on a relatively exclusive waitlist. Don’t see how it can be considered “publicly available”. Furthermore, Cruise is very limited in SF. It’s only available at 10pm-5am in half the city for a lot of users, including myself. I can’t comment on Waymo as it has been months since I’ve signed up for the waitlist.

No77e's Shortform

No77e2y10

Waluigi effect also seems bad for s-risk. "Optimize for pleasure, ..." -> "Optimize for suffering, ...".

No77e's Shortform

No77e2y10

Iff LLM simulacra resemble humans but are misaligned, that doesn't bode well for S-risk chances.

1No77e2y

Waluigi effect also seems bad for s-risk. "Optimize for pleasure, ..." -> "Optimize for suffering, ...".

No77e's Shortform

No77e2y10

An optimistic way to frame inner alignment is that gradient descent already hits a very narrow target in goal-space, and we just need one last push.

A pessimistic way to frame inner misalignment is that gradient descent already hits a very narrow target in goal-space, and therefore S-risk could be large.

No77e's Shortform

No77e2y12

We should implement Paul Christiano's debate game with alignment researchers instead of ML systems

No77e's Shortform

No77e2y10

This community has developed a bunch of good tools for helping resolve disagreements, such as double cruxing. It's a waste that they haven't been systematically deployed for the MIRI conversations. Those conversations could have ended up being more productive and we could've walked away with a succint and precise understanding about where the disagreements are and why.

1No77e2y

We should implement Paul Christiano's debate game with alignment researchers instead of ML systems

Is recursive self-alignment possible?

No77e2y10

Another thing one might wonder about is if performing iterated amplification with constant input from an aligned human (as "H" in the original iterated amplification paper) would result in a powerful aligned thing if that thing remains corrigible during the training process.

Robin Hanson’s latest AI risk position statement

No77e2y63

The comment about tool-AI vs agent-AI is just ignorant (or incredibly dismissive) of mesa-optimizers and the fact that being asked to predict what an agent would do immediately instantiates such an agent inside the tool-AI. It's obvious that a tool-AI is safer than an explicitely agentic one, but not for arbitrary levels of intelligence.

This seems way too confident to me given the level of generality of your statement. And to be clear, my view is that this could easily happen in LLMs based on transformers, but what other architectures? If you just talk abo... (read more)

No77e's Shortform

No77e2y10

If you try to write a reward function, or a loss function, that caputres human values, that seems hopeless.

But if you have some interpretability techniques that let you find human values in some simulacrum of a large language model, maybe that's less hopeless.

The difference between constructing something and recognizing it, or between proving and checking, or between producing and criticizing, and so on...

No77e's Shortform

No77e2y00

Why this shouldn't work? What's the epistemic failure mode being pointed at here?

Should we cry "wolf"?

No77e2y10

While you can "cry wolf" in maybe useful ways, you can also state your detailed understanding of each specific situation as it arises and how it specifically plays into the broader AI risk context.

On Board Vision, Hollow Words, and the End of the World

No77e2y10

As impressive as ChatGPT is on some axes, you shouldn't rely too hard on it for certain things because it's bad at what I'm going to call "board vision" (a term I'm borrowing from chess).

How confident are you that you cannot find some agent within ChatGPT with excellent board vision through more clever prompting than what you've experimented with?

No77e's Shortform

No77e2y10

As a failure mode of specification gaming, agents might modify their own goals.

As a convergent instrumental goal, agents want to prevent their goals to be modified.

I think I know how to resolve this apparent contradiction, but I'd like to see other people's opinions about it.

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

No77e2y60

I'm going to re-ask all my questions that I don't think have received a satisfactory answer. Some of them are probably basic, some other maybe less so:

2Lone Pine2y

On the topic of decision theories, is there a decision theory that is "least weird" from a "normal human" perspective? Most people don't factor alternate universes and people who actually don't exist into their everyday decision making process, and it seems reasonable that there should be a decision theory that resembles humans in that way.

4lc2y

For "1. Why would CEV be difficult to learn?": I'm not an alignment researcher, so someone might be cringing at my answers. That said, responding to some aspects of the initial comment: The problem is not building AIs that are capable of distinguishing human utopias from dystopias - that's largely a given if you have general intelligence. The problem is building AIs that target human utopia safely first-try. It's not a matter of giving AIs some internal module native to humans that lets them discern good outcomes from bad outcomes, it's having them care about that nuance at all. I would suppose (as aforementioned, being empirically bad at this kind of analysis) that the problem is inherent to giving AIs open-ended goals that require wresting control of the Earth and its resources from humans, which is what "shooting for utopia" would involve. Strawberry tasks, being something that naively seems more amenable to things like power-seeking penalties and oversight via interpretability tools, sound easier to perform safely than strict optimization of any particular target.

Could evolution produce something truly aligned with its own optimization standards? What would an answer to this mean for AI alignment?

No77e2y10

I am trying to figure out what is the relation between "alignment with evolution" and "short-term thinking". Like, imagine that some people get hit by magical space rays, which make them fully "aligned with evolution". What exactly would such people do?

I think they would become consequentialists smart enough that they could actually act to maximize inclusive genetic fitness. I think Thou Art Godshatter is convincing.

But what if the art or the philosophy makes it easier to get laid? So maybe in such case they would do the art/philosophy, but they would feel

No77e2y20

No, I mean "humans continue to evolve genetically, and they never start self-modifying in a way that makes evolution impossible (e.g., by becoming emulations)."

Open & Welcome Thread - January 2023

No77e2y37

For some reason I don't get e-mail notifications when someone replies to my posts or comments. My e-mail is verified and I've set all notifications to "immediately". Here's what my e-mail settings look like:

What's wrong with the paperclips scenario?

No77e2y62

I agree with you here, although something like "predict the next token" seems more and more likely. Although I'm not sure if this is in the same class of goals as paperclip maximizing in this context, and if the kind of failure it could lead to would be similar or not.

What's wrong with the paperclips scenario?

No77e2y50

Yes, this makes a lot of sense, thank you.

What's wrong with the paperclips scenario?

No77e2y20

Do you mean that no one will actually create exactly a paperclips maximizer or no agent of that kind? I.e. with goals such as "collect stamps", or "generate images"? Because I think Eliezer meant to object to that class of examples, rather than only that specific one, but I'm not sure.

8DragonGod2y

We probably wouldn't uncritically let loose an AI whose capability was to maximise the quantity of some physical stuff (paperclips, stamps, etc.). If we make a (very) stupid outer alignment failure, we're more likely to train an AI to maximise "happiness" or similar.

What's wrong with the paperclips scenario?

No77e2y10

The last Twitter reply links to a talk from MIRI which I haven't watched. I wouldn't be surprised if MIRI also used this metaphor in the past, but I can't recall examples off the top of my head right now.