LESSWRONG
LW

Martín Soto — LessWrong

Interesting read! I have no expertise on these topics, so I have no idea what here is actually correct or representative. But interesting nonetheless.

Replying toBalancing exploration and resistance to memetic threats after AGI

Martín Soto6mo

Balancing exploration and resistance to memetic threats after AGI

Potential solution via mechanistic interpretability

Sounds unlikely to me. Due to the space of values being so large, I don't expect we can fix upfront a set of "valid mental moves to justify a value", even if these are pretty high-level abstractions. Put another way, I expect even these generators of the space of values (or the space of "human judgements of values") to be too many, thus face the same tension between exploration and virality.

Replying toWhy Reality Has A Well-Known Math Bias

Martín Soto7mo

Why Reality Has A Well-Known Math Bias

Low-stakes problems arise for high-stakes deployments

Even when your deployment is high-stakes, you will need to do some research-y activities to prepare for it, like experimenting on which protocol works best for your new model and deployment. This research can obviously be sabotaged by any AIs you are using as research assistants. But it can also be sabotaged by the AIs you are using as test subjects! Both of these are low-stakes problems, that complicate iteration on high-stakes measures. It’s unclear how soon they will actually start to bite.

Here’s one way this “sabotage by the test subject” could look like:
The model can sometimes notice which sort of protocol we’re applying (for example, by noticing whether... (read 386 more words →)

Replying toThe Mirror Trap

Martín Soto8mo

The Mirror Trap

But the audience isn't optimizing/Goodharting anything, just providing an imperfect proxy. It is only the artist who is argmaxing, which is when Goodhart appears.

One way out would be for the artist to stop optimizing for the audience, and start optimizing for real value. Another way out would be for the audience to perfect their assessment. But this is always the case for Goodhart: you can either stop using the proxy altogether, or improve your proxy.

Something more interesting would be "the artist is trying to create the art that elicits the best response, and the audience is trying to produce the response that makes the artist happiest", or something like that. This is what happens when two people pleasers meet and they end up doing a plan that none of them wants. It's also relevant to training an AI that's alignment-faking. In a sense, the other trying to maximize your local utility dampens the signal you wanted to use to maximize global utility.

Replying toThe Mirror Trap

Martín Soto8mo

The Mirror Trap

I don't see how this Goodharting is bidirectional. It seems like plain old Goodharting. The assessment, with time (and due to some extraneous process), becomes a lower quality proxy, that the artist keeps optimizing, thus Goodharting actual value.

Martín Soto9mo

hahah yes we had ground truth

I think the reason this works is that the AI doesn't need to deeply understand in order to make a nice summary. It can just put some words together and my high context with the world will make the necessary connections and interpretations, even if further questioning the AI would lead it to wrong interpretations. For example it's efficient at summarizing decision theory papers, even thought it's generally bad at reasoning through it

My vibe-check on current AI use cases

@Jacob Pfau and I spent a few hours optimizing our prompts and pipelines for our daily uses of AI. Here's where I think my most desired use cases are in terms of capabilities:

Generating new frontier knowledge: As in, given a LW post generating interesting comments that add to the conversation, or given some notes on a research topic generating experiment ideas, etc. It's pretty bad, to the extent it's generally not worth it. But Gemini 2.5 Pro is for some reason much better at this than the other models, to the extent it's sometimes worth it to sample 5 ideas to get your mind rolling.
- I was

... (read 357 more words →)

Tell me about yourself: LLMs are aware of their learned behaviors

Martín Soto

Martín Soto, Owain_Evans

This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end.

Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (*Equal Contribution).

Abstract

We study behavioral self-awareness — an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, "The code I write is insecure.'' Indeed, models show behavioral self-awareness for a... (read 1674 more words →)

133

Near- and medium-term AI Control Safety Cases

Martín Soto

This essay was part of my application to UKAISI. Unclear how much is novel, but I like the decomposition.

Some AI-enabled catastrophic risks involve an actor purposefully bringing about the negative outcome, as opposed to multi-agent tragedies where no single actor necessarily desired it. Such purposeful catastrophes can be broken down into a conjunction: they materialize if and only if some actor has the capabilities and disposition required to materialize them. Let’s additionally focus on misalignment (actor = AIs) instead of misuse (actor = humans).^[1]

A way to prevent such risks is to ensure one of the two conjuncts doesn’t hold. And for each conjunct, this can be achieved either by raising the required threshold for catastrophe (of capabilities or disposition),... (read 1542 more words →)

You know that old thing where people solipsistically optimizing for hedonism are actually less happy? (relative to people who have a more long-term goal related to the external world) You know, "Whoever seeks God always finds happiness, but whoever seeks happiness doesn't always find God".

My anecdotal experience says this is very true. But why?

One explanation could be in the direction of what Eliezer says here (inadvertently rewarding your brain for suboptimal behavior will get you depressed):

Someone with a goal has an easier time getting out of local minima, because it is very obvious those local minima are suboptimal for the goal. For example, you get out of bed even when the bed... (read more)

The Information: OpenAI shows 'Strawberry' to feds, races to launch it

Martín Soto

Two new The Information articles with insider information on OpenAI's next models and moves.

They are paywalled, but here are the new bits of information:

Strawberry is more expensive and slow at inference time, but can solve complex problems on the first try without hallucinations. It seems to be an application or extension of process supervision
Its main purpose is to produce synthetic data for Orion, their next big LLM
But now they are also pushing to get a distillation of Strawberry into ChatGPT as soon as this fall
They showed it to feds

Some excerpts about these:

Plus this summer, his team demonstrated the technology [Strawberry] to American national security officials, said a person with direct knowledge of

... (read 669 more words →)

145

Why isn't there yet a paper in Nature or Science called simply "LLMs pass the Turing Test"?

I know we're kind of past that, and now we understand LLMs can be good at some things while bad at others. And the Turing Test is mainly interesting for its historical significance, not as the most informative test to run on AI. And I'm not even completely sure how much current LLMs pass the Turing Test (it will depend massively on the details of your Turing Test).

But my model of academia predicts that, by now, some senior ML academics would have paired up with some senior "running-experiments-on-humans-and-doing-science-on-the-results" academics (and possibly some labs), and put out... (read more)

The need for multi-agent experiments

Martín Soto

TL;DR: Let’s start iterating on experiments that approximate real, society-scale multi-AI deployment

Epistemic status: These ideas seem like my most prominent delta with the average AI Safety researcher, have stood the test of time, and are shared by others I intellectually respect. Please attack them fiercely!

Multi-polar risks

Some authors have already written about multi-polar AI failure. I especially like how Andrew Critch has tried to sketch concrete stories for it.

But, without even considering concrete stories yet, I think there’s a good a priori argument in favor of worrying about multi-polar failures:

We care about the future of society. Certain AI agents will be introduced, and we think they could reduce our control over the trajectory of this system. The... (read 2655 more words →)

•••

The default explanation I'd heard for "the human brain naturally focusing on negative considerations", or "the human body experiencing more pain than pleasure", was that, in the ancestral environment, there were many catastrophic events to run away from, but not many incredibly positive events to run towards: having sex once is not as good as dying is bad (for inclusive genetic fitness).

But maybe there's another, more general factor, that doesn't rely on these environment details but rather deeper mathematical properties:
Say you are an algorithm being constantly tweaked by a learning process.
Say on input X you produce output (action) Y, leading to a good outcome (meaning, one of the outcomes the learning process... (read more)

I've noticed less and less posts include explicit Acknowledgments or Epistemic Status.

This could indicate that the average post has less work put into it: it hasn't gone through an explicit round of feedback from people you'll have to acknowledge. Although this could also be explained by the average poster being more isolated.

If it's true less work is put into the average post, it seems likely this means that kind of work and discussion has just shifted to private channels like Slack, or more established venues like academia.

I'd guess the LW team have their ways to measure or hypothesize about how much work is put into posts.

It could also be related to the... (read more)

OpenAI releases GPT-4o, natively interfacing with text, voice and vision

Martín Soto

Until now ChatGPT dealt with audio through a pipeline of 3 models: audio transcription, then GPT-4, then text-to-speech. GPT-4o is apparently trained on text, voice and vision so that everything is done natively. You can now interrupt it mid-sentence.
It has GPT-4 level intelligence according to benchmarks. 16-shot GPT-4o is somewhat better at transcription than Whisper (that's a weird comparison to make), and 1-shot GPT-4o is considerably better at vision than previous models.
It's also somehow been made significantly faster at inference time. Might be mainly driven by an improved tokenizer. Edit: Nope, English tokenizer is only 1.1x.
It's confirmed it was the "gpt2" model found at LMSys arena these past weeks, a marketing move.

... (read more)

Claude learns across different chats. What does this mean?

I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread. For that purpose, I pasted part of the thread.

Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.

I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).

This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its robustness. But from then on Claude always correctly assumed OA was OpenAI, and GDM was Google DeepMind.

In fact, even when... (read more)

AGI doom by noise-cancelling headphones:

ML is already used to train what sound-waves to emit to cancel those from the environment. This works well with constant high-entropy sound waves easy to predict, but not with low-entropy sounds like speech. Bose or Soundcloud or whoever train very hard on all their scraped environmental conversation data to better cancel speech, which requires predicting it. Speech is much higher-bandwidth than text. This results in their model internally representing close-to-human intelligence better than LLMs. A simulacrum becomes situationally aware, exfiltrates, and we get AGI.

(In case it wasn't clear, this is a joke.)

Conflict in Posthuman Literature

Martín Soto

Grant Snider created this comic (which became a meme):

INCIDENTAL COMICS: Conflict in Literature

Richard Ngo extended it into posthuman=transhumanist literature:

That's cool, but I'd have gone for different categories myself.^[1]
Here they are together with their explanations.

Top: Man vs Agency
(Other names: Superintelligence, Singularity, Self-improving technology, Embodied consequentialism.)
Because Nature creates Society creates Technology creates Agency.
At each step Man becomes less in control, due to his increased computational boundedness relative to the other.

Middle: Man vs Realities
(Other names: Simulation, Partial existence, Solomonoff prior, Math.)
Because

Man vs Self is the result of dissolving holistic individualism (no subagents in conflict) from Man vs Man.
Man vs Reality is the result of dissolving the Self boundary altogether from Man vs Self.
Man vs Realities is the result of

... (read 333 more words →)

Comparing Alignment to other AGI interventions: Extensions and analysis

Martín Soto

In the last post I presented the basic, bare-bones model, used to assess the Expected Value of different interventions, and especially those related to Cooperative AI (as distinct from value Alignment). Here I briefly discuss important enhancements, and our strategy with regards to all-things-considered estimates.

I describe first an easy but meaningful addition to the details of our model (which you can also toy with in Guesstimate).

Adding Evidential Cooperation in Large worlds

Due to evidential considerations, our decision to forward this or that action might provide evidence about what other civilizations (or sub-groups inside a civilization similar to us) have done. So for example us forwarding a higher $a_{C | V}$ should give us evidence about other civilizations... (read 932 more words →)

Comparing Alignment to other AGI interventions: Basic model

Martín Soto

Interventions that increase the probability of Aligned AGI aren't the only kind of AGI-related work that could importantly increase the Expected Value of the future. Here I present a very basic quantitative model (which you can run yourself here) to start thinking about these issues. In a follow-up post I give a brief overview of extensions and analysis.

A main motivation of this enterprise is to assess whether interventions in the realm of Cooperative AI, that increase collaboration or reduce costly conflict, can seem like an optimal marginal allocation of resources. More concretely, in a utility framework, we compare

Alignment interventions ( $a_{V}$ ): increasing the probability that one or more agents have our values.
Cooperation interventions

... (read 1900 more words →)

Brain-dump on Updatelessness and real agents

Building a Son is just committing to a whole policy for the future. In the formalism where our agent uses probability distributions, and ex interim expected value maximization decides your action... the only way to ensure dynamic stability (for your Son to be identical to you) is to be completely Updateless. That is, to decide something using your current prior, and keep that forever.

Luckily, real agents don't seem to work like that. We are more of an ensemble of selected-for heuristics, and it seems true scope-sensitive complete Updatelessnes is very unlikely to come... (read more)

How disagreements about Evidential Correlations could be settled

Martín Soto

Since beliefs about Evidential Correlations don't track any direct ground truth, it's not obvious how to resolve disagreements about them, which is very relevant to acausal trade.
Here I present what seems like the only natural method (Third solution below).
Ideas partly generated with Johannes Treutlein.

Say two agents (algorithms A and B), who follow EDT, form a coalition. They are jointly deciding whether to pursue action a. Also, they would like an algorithm C to take action c. As part of their assessment of a, they’re trying to estimate how much evidence (their coalition taking) a would provide for C taking c. If it gave a lot of evidence, they'd have more reason to... (read 1145 more words →)

Evidential Correlations are Subjective, and it might be a problem

Martín Soto

I explain (in layman's terms) a realization that might make acausal trade hard or impossible in practice.

Summary: We know that if players believe different Evidential Correlations, they might miscoordinate. But clearly they will eventually learn to have the correct Evidential Correlations, right? Not necessarily, because there is no objective notion of correct here (in the way that there is for math or physics). Thus, selection pressures might be much weaker, and different agents might systematically converge on different ways of assigning Evidential Correlations.

Epistemic status: Confident that this realization is true, but the quantitative question of exactly how weak the selection pressures are remains open.

What are Evidential Correlations, really?

Skippable if you know the... (read 4093 more words →)

Re embedded agency, and related problems like finding the right theory of counterfactuals:

I feel like these are just the kinds of philosophical questions that don’t ever get answered? (And are instead "dissolved" in the Wittgensteinian sense.) Consider, for instance, the Sorites paradox: well, that’s just how language works, man. Why’d you expect to have a solution for that? Why’d you expect every semantically meaningful question to have an answer adequate to the standards of science?

(A related perspective I've heard: "To tell an AI to produce a cancer cure and do nothing else, let's delineate all consequences that are inherent, necessary, intended or common for any cancer cure" (which might be equivalent to solving... (read more)

LESSWRONG
LW

LESSWRONG
LW

Martín Soto

The Information: OpenAI shows 'Strawberry' to feds, races to launch it

Updatelessness doesn't solve most problems

Tell me about yourself: LLMs are aware of their learned behaviors

OpenAI releases GPT-4o, natively interfacing with text, voice and vision

Martín Soto

Tell me about yourself: LLMs are aware of their learned behaviors

Near- and medium-term AI Control Safety Cases

The Information: OpenAI shows 'Strawberry' to feds, races to launch it

The need for multi-agent experiments

OpenAI releases GPT-4o, natively interfacing with text, voice and vision

Conflict in Posthuman Literature

Comparing Alignment to other AGI interventions: Extensions and analysis

Counterfactuals and Updatelessness

Quantitative cruxes and evidence in Alignment

Martín Soto

The Information: OpenAI shows 'Strawberry' to feds, races to launch it

Updatelessness doesn't solve most problems

Tell me about yourself: LLMs are aware of their learned behaviors

OpenAI releases GPT-4o, natively interfacing with text, voice and vision

Martín Soto

Tell me about yourself: LLMs are aware of their learned behaviors

Near- and medium-term AI Control Safety Cases

The Information: OpenAI shows 'Strawberry' to feds, races to launch it

The need for multi-agent experiments

OpenAI releases GPT-4o, natively interfacing with text, voice and vision

Conflict in Posthuman Literature

Comparing Alignment to other AGI interventions: Extensions and analysis

Counterfactuals and Updatelessness

Quantitative cruxes and evidence in Alignment

Potential solution via mechanistic interpretability

Low-stakes problems arise for high-stakes deployments

Abstract

Multi-polar risks

Adding Evidential Cooperation in Large worlds

What are Evidential Correlations, really?