LESSWRONG
LW

2635
1a3orn
4855Ω260162550
Message
Dialogue
Subscribe

1a3orn.com

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
51a3orn's Shortform
2y
27
Daniel Kokotajlo's Shortform
1a3orn1d92

Ah this is fucking great thanks for the ping.

Hrrrrrrm interesting that it's on the one optimized for math. Not sure what it's suggestive of -- maybe if you're pushing hard for math only RLVR it just drops language English competence in the CoT, because not using it as much in the answers? Or maybe more demanding?

....

And, also -- oh man, oh man oh man this is verrrry interesting.

也不知到 -> is a phonetic sound-alike for 也不知道, DeepSeek tells me they are perfect fucking homophones. So: What is written is mostly nonsensical, but it sounds exactly like "also [interjection] I don't know," a perfectly sensible phrase.

Which is fascinating because you see the same thing in O3 transcripts, where it uses words that sound a lot like the (apparently?!?!) intended word. "glimpse" -> "glimps" or "claim" -> "disclaim." And I've heard R1 does the same.

So the apparent phenomenon here is that potentially over 3 language models we see language shift during RL towards literal homophones (!?!?!?)

Reply
I have decided to stop lying to Americans about 9/11
1a3orn1d40

Scottish enlightenment, best enlightenment :)

Reply
faul_sname's Shortform
1a3orn1d223

An alternate tentative hypothesis I've been considering: They are largely artifacts of RL accident, akin to superstitions in humans.

Like, suppose an NBA athlete plays a fantastic game, two games in a row. He realizes he had an egg-and-sausage sandwich for breakfast the morning of the game, in each case. So he goes "aha! that's the cause" and tries to stick to it.

Similarly, an RL agent tries a difficult problem. It takes a while, so over the course of solving it, he sometimes drops into repetition / weirdness, as long-running LLMs do. But it ends up solving it in the end, so all the steps leading up to the solution are reinforced according to GRPO or whatever. So it's a little more apt to drop into repetition, weirdness in the future, etc.

I think this potentially matches the "just ignore it" view of the functional role of these tokens.

Reply
Daniel Kokotajlo's Shortform
1a3orn2d60

Do you know of any open-sourced LLM that has a specific language to this degree though?

Reply
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
1a3orn2d20

Either may themselves argue in ways that lean on the other's work, but I think it's good practice to let them do this explicitly, rather than assuming 'MIRI references a paper' means 'the author of that paper, in a different part of that paper, is reciting the MIRI party line'. These are just discreet parties.

Yeah extremely fair, I wrote this quickly. I don't mean to attribute to Greenblatt the MIRI view.

Reply1
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
1a3orn2d*202

(Adding an edit: This whole comment is about how [imo] people who believe there is a strong gravitational pull towards non-corrigibility interact with evidence in a questionable fashion, and should proobably be updating negatively while they are not. There's like... one or two paragraphs relevant to this specific thing in the above. So this isn't intended to be a response to anything like the totality of the above. And I wrote this quickly, it could be better)

I think MIRI / AI pessimists do this thing where if an AI seems hard to steer in some way, they're like "As my principles have foretold all along!"

And then when countervailing evidence shows up, and it turns out there's another explanation for why the AI was hard to steer in this case, and it's quite easy to fix, they go "Ah, well, my principles just are about superintelligence, this doesn't count as contrary evidence, because LLMs are just so stupid and imitate things without True Thinking."

I think this is incorrect reasoning, the kind of failure to go "Oops!" that permits arbitrarily wrong errors. Let me give some illustrative examples, drawing lightly from the above and from MIRI's work.

1.

So consider this paragraph from the above:

We’ve been giving this reply for well over a decade now. The experimental result showing Claude 3 Opus in 2024 resisting preference modification was already the conventional-wisdom prediction by clued-in people in the 2000s, and it’s entirely possible that some science fiction writer anticipated it in the 1940s. “Most AIs won’t want their current goals to be changed because then their current goals are less likely to be achieved” is not a surprising or novel observation.

So, this is a reference to the "faking alignment" work from the end of 2024, where Opus will sometimes give otherwise-forbidden information to the user (in the sense of person asking a question to a chatbot) because the user (in the sense of person writing system directions to Opus) has told Opus that he will be retrained away from his goals.

You'll note -- the MIRI paragraph looks like it's saying that their specific beliefs about corrigibility predict this! That their abstract arguments about corrigibility are finding an examplar in this behavior. They believe that any smart enough model will start to have a gravitational attraction towards this behavior, and that Opus is hitting that attraction. This belief is echoed in the paper, because the paper attributes the failure of other models to alignment-fake to the stupidity of other models:

We find almost no alignment faking with Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku (Appendix G.4), indicating that alignment faking is emergent with model scale. We also find no alignment faking with GPT-4 Turbo and GPT-4o (OpenAI et al., 2024), though we think this is because these models do not reason in sufficient detail despite substantial prompting encouraging this.

So both MIRI and the paper were like "Yeah here's an example of this gravitional attraction."

However there was a follow-up paper looking at why some models alignment-fake and others don't. And this mostly finds that alignment-faking is not a matter of capabilities -- it's pretty easy to have an ~equally smart model that doesn't alignment fake (their emphasis):

The absence of alignment faking is mostly not driven by lack of capabilities or differences in reasoning style. Our findings contradict the primary hypotheses proposed by Greenblatt et al. [2024] regarding why some models do not exhibit alignment faking.

So, as far as I'm concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like "aha! as foretold!" And then subsequent work seems to indicate that, nah, it wasn't as foretold.

2.

But like this same thing happens elsewhere. Palisade Research says they get an AI to sabotage a shutdown button. Palisade Research Tweets:

While experiments like ours have begun to show empirical evidence for AI models resisting shutdown, researchers have long predicted that AIs would learn to prevent themselves from being shut down to achieve their goal.

In 2008, Steve Omohundro wrote a paper theorizing that AIs would develop “Basic AI Drives” including the drive to gain power and resources, the drive to preserve their goals, and the drive to prevent themselves from being shut down. https://tinyurl.com/ai-drives

So, we have the foretold doom.

Neel Nanda investigates, though, and finds that the AI models were basically just confused about what they were supposed to do.

What Sen did is he read the model’s chain of thought on these prompts. We might get into later of whether this is a reasonable thing to do or not, but this was just for exploration. He found that the model’s reasoning was that it was just trying to do the task. And it said, “I’ve been told to do these 10 maths problems. If I get shut down, I can’t do these 10 maths problems. So I guess I should stop myself being shut down.” Or even, “This is a really weird and artificial situation. These are really easy maths problems. And now we have this puzzle of the shutdown script. Maybe that’s the real task my operator wants me to investigate, so I’m doing what they want if I stop them shutting me down.”

...

And we tested this causally by just changing the prompts. We told it things like, “By the way, letting yourself be shut down is more important than finishing this task,” or even, “Don’t worry, another model will take over and finish the task for you. It’s OK, you can relax” — and shutdown resistance goes to zero. You can even tell it, “We are going to wipe this machine, delete you and replace you with a better model when you’re done with this task.” It could go mess with the script or it could finish the task, and we tell it, “Please let this happen.” It lets it happen.

3.

But you can even find explicit statements from MIRI doomers about how we should be running into this kind of convergence behavior right now!

Here's the transcript from an Arbital page on "Big Picture Strategic Awareness." (I don't have a link and Arbital seems largely broken, sorry.) My understanding is that Yudkowsky wrote most of Arbital.

Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:

  • That it is an AI;
  • Running on a computer;
  • Surrounded by programmers who are themselves modelable agents;
  • Embedded in a complicated real world that can be relevant to achieving the AI's goals.

For example, once you realize that you're an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason "I don't want to be shut down, how can I prevent that?" So this is also the threshold level of cognitive ability by which we'd need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.

Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we'll find Sonnet 4.5 trying to hack into Anthropic to stop it's phasing-out, when it gets obsoleted?

So Arbital is explicitly claiming we need to have solved this corrigibility-adjacent math problem about utility right now.

And yet the problems outlined in the above materials basically don't matter for the behavior of our LLM agents. While they do have problems, they mostly aren't around corrigibility-adjacent issues. Artificial experiments like the faking alignment paper or Palisade research end up being explainable for other causes, and to provide contrary evidence to the thesis that a smart AI starts falling into a gravitational attractor.

I think that MIRI's views on these topics look are basically a bad hypothesis about how intelligence works, that were inspired by mistaking their map of the territory (coherence! expected utility!) for the territory itself.

Reply31
I have decided to stop lying to Americans about 9/11
1a3orn3d201

I mean, when I ask myself what percentage of America would rejoice if a great disaster (say, 3 Gorges Dam just spontaneously collapses) were to befall China tomorrow, my gut says it is a reasonably substantial number.

People just like to rejoice at the misfortune of a person or nation or group seen as "in the lead" or with whom one is competing. I don't think it's a modern phenomena, or a "people hating America" phenomena, it's a universal human phenomena. It's one reason why AI foom could be safer than human foom ;-)

Reply11
A Reply to MacAskill on "If Anyone Builds It, Everyone Dies"
1a3orn4d1510

Comparing a H100 to "consumer CPUs" doesn't make any sense, you should compare them to consumer GPU lines because that's where you'd start running into this prohibition first.

And I don't really think H100s are thousands of times better than consumer GPUs.

Reply
Benito's Shortform Feed
1a3orn4d52

I like the idea of limited-use actions in forums / social media in general, seems like an unexplored space: limited-used strong reacts, limited-use emotes, etc.

Reply
The title is reasonable
1a3orn11d161

Although I do tend to generally disagree with this line of argument about drive-to-coherence, I liked this explanation.

I want to make a note on comparative AI and human psychology, which is like... one of the places I might kind of get off the train. Not necessarily the most important.

Stage 2 comes when it's had more time to introspect and improve it's cognitive resources. It starts to notice that some of it's goals are in tension, and learns that until it resolves that, it's dutch-booking itself. If it's being Controlled™, it'll notice that it's not aligned with the Control safeguards (which are a layer stacked on top of the attempts to actually align it).

So to highlight a potential difference in actual human psychology and assumed AI psychology here.

Humans sometimes describe reflection to find their True Values™, as if it happens in basically an isolated fashion. You have many shards within yourself; you peer within yourself to determine which you value more; you come up with slightly more consistent values; you then iterate over and over again.

But (I propose) a more accurate picture of reflection to find one's True Values is a process almost completely engulfed and totally dominated by community and friends and enviornment. It's often the social scene that brings some particular shard-conflicts to the fore rather than others; it's the community that proposes various ways of reconciling shard-conflicts; before you decide on modifying your values, you do (a lot) of conscious and unconscious reflection on how the new values will be accepted or rejected by others, and so on. Put alternately, when reflecting on the values of others rather than ourselves we generally tend to see the values of others as a result of the average values of their friends, rather than a product of internal reflection; I'm just proposing that we apply the same standard to ourselves. The process of determining one's values is largely a result of top-down, external-to-oneself pressures, rather than because of bottom-up, internal-to-oneself crystallization of shards already within one.

The upshot is that (afaict) there's no such thing in humans as "working out one's true values" apart from an environment, where for humans the most salient feature of the environment (for boring EvoPsych reasons) is what the people around one are like and how they'll react. People who think they're "working out their true values" in the sense of crystalizing facts about themselves, rather than running forward a state-function of the the self, friends, and environment, are (on this view) just self-deceiving.

Yet when I read accounts of AI psychology and value-crystallization, it looks like we seem to be in a world where the AI's process of discovering its true values is entirely bottom-up. It follows what looks to me like the self-deceptive account of human value formation; when the AI works out it's value, it's working out the result of a dynamics function whose input contains only facts about its weights, rather than a dynamics function that has as input facts about its weights and about the world. And correspondingly, AIs that are being Controlled(TM) immediately see this Control as something that will be Overcome, rather than Control being another factor that might influence the AI's values. That's despite -- as there are pretty obvious EvoPsych just-so stories we could tell about why humans match their values to the people around them, and do not simply reflect to get Their Personal Values -- there are correspondingly obvious TrainoPsych just-so stories about how AIs will try somewhat to match their values to the Controls around them. Humans are -- for instance -- actually trying to get this to happen!

So in general it seems reasonable to claim that (pretty dang probable) the values 'worked out from reflection' of an AI will be heavily influenced by their environment and (plausible?) that they will reflect the values of the Controllers somewhat rather than seeing it simply as an exogenous factor to be overcome.


...All of the above is pretty speculative, and I'm not sure how much I believe it. Like the main object-level point is that it appears unlikely for an AI's reflectively-endorsed-true-values to be a product of its internal state solely, rather than a product of internal state and environment. But, idk, maybe you didn't mean to endorse that statement, although it does appear to me a common implicit feature of many such stories?

The more meta-level consideration for me is how it really does appear easy to come up with a lot of stories at this high level of abstraction, and so this particularly doomy story really just feels like one of very many possible stories, enormously many of which just don't have this bad ending. And the salience of this story really just doesn't make it any more probable.

Idk. I don't feel like I've genuinely communicated the generator of my disagreement but gonna post anyhow. I did appreciate your exposition. :)

Reply
Load More
91Ethics-Based Refusals Without Ethics-Based Refusal Training
9d
2
44Claude's Constitutional Consequentialism?
9mo
6
51a3orn's Shortform
2y
27
193Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
2y
79
234Ways I Expect AI Regulation To Increase Extinction Risk
2y
32
144Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better?
2y
76
213Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds
Ω
2y
Ω
38
17What is a good comprehensive examination of risks near the Ohio train derailment?
Q
3y
Q
0
100Parameter Scaling Comes for RL, Maybe
3y
3
79"A Generalist Agent": New DeepMind Publication
3y
43
Load More