LESSWRONG
LW

All of Alex Mallen's Comments + Replies

What goals will AIs have? A list of hypotheses

I think this does a great job of reviewing the considerations regarding what goals would be incentivized by SGD by default, but I think that in order to make predictions about which goals will end up being relevant in future AIs, we have to account for the outer loop of researchers studying model generalization and changing their training processes.

For example, reward hacking seems very likely by default from RL, but it is also relatively easy to notice in many forms and AI projects will be incentivized to correct it. On the other hand, ICGs might be harder to notice and have fewer incentives for correcting.

3Daniel Kokotajlo1mo

Yeah, I agree, I think that's out of scope for this doc basically. This doc is trying to figure out what the "default" outcome is, but then we have to imagine that human alignment teams are running various tests and might notice that this is happening and then course-correct. But whether and how that happens, and what the final outcome of that process is, is something easier thought about once we have a sense of what the default outcome is. EDIT: After talking to my colleague Eli it seems this was oversimplifying. Maybe this is the methodology we should follow, but in practice the original post is kinda asking about the outer loop thing.

What goals will AIs have? A list of hypotheses

Alex Mallen1mo50

Scaling Laws for Reward Model Overoptimization is relevant empirical work for understanding how optimality and pretraining priors weigh against each other as a function of variables like reward quality, amount of optimization against reward, and policy size.

I tentatively think these results suggest that if >90% of training compute is used for outcome-based RL, then most errors in the reward would be exploited. This pushes in favor of hypotheses 4 and 5 (reward-seeking and reward-seeking by proxy), though perhaps people will notice and fix these reward-hacking issues.

The Case Against AI Control Research

Alex Mallen3mo1814

A genre of win-conditions that I think this post doesn't put enough weight on:

The AIs somewhere in-between TAI and superintelligence (AIs that are controllable and/or not egregiously misaligned) generate convincing evidence of risks from racing to superintelligence. If people actually believed these risks, they would act more cautiously, modulo "I'll do it safer than they will". People don't currently believe in the risks much and IMO this is pretty counterfactual to the amount they're currently mitigating risks.

AI Control helps a bit here by allowing us to get more e.g. model organisms work done at the relevant time (though my point pushes more strongly in favor of other directions of work).

6Buck3mo

Note also that it will probably be easier to act cautiously if you don't have to be constantly in negotiations with an escaped scheming AI that is currently working on becoming more powerful, perhaps attacking you with bioweapons, etc!

The Case Against AI Control Research

Alex Mallen3mo2418

I think it's useful and would be surprised if discussions like this post weren't causing people to improve their models and change their behaviors.

How to prevent collusion when using untrusted models to monitor each other

Alex Mallen4mo52

Really great list!

It might be that sampling from the model sometimes samples an alignment faking persona and sometimes doesn't. Then, sampling several times to get an ensemble is helpful.

This could be generalized to any kind of inconsistency that leads to the model not always choosing to lie. E.g. The model makes reasoning errors some fraction of the time, has imperfect situational awareness, or picks inconsistent collusion signals.

You can also resample during red-teaming to increase the effective length of the upfront red-teaming-of-monitor period, forcing the model to have more consistent goals, reasoning, situational awareness, and Schelling points to evade being caught.

5Nathan Helm-Burger4mo

Also, using noise injections into the activations and examining the pattern of results. Models that are scheming to sandbag will sometimes fail at sandbagging when given a noise injection, and thus perform unexpectedly well at a task they had been consistently failing. Similarly, a scheming model might slip up and give an honest answer. You wouldn't know which was the honest answer, but you could note the discrepancy and look for patterns.

AI Control: Improving Safety Despite Intentional Subversion

Alex Mallen5mo*42

The general concern here is that your human auditors might not be perfect. I think in practice when making a safety case for a deployment, it would make sense to run a red-team/blue-team game with a human/AI red-team and your real auditors to see if your auditors are reliable.

I'm not sure how the proposed steganographic attack would work though. How do you ensure that your backdoored code will hash to XYZ if you can't see the backdoor opportunity until after you've decided on XYZ?

AI Control: Improving Safety Despite Intentional Subversion

Alex Mallen5mo43

(Not an author, but Buck suggested I look at this)

I think there's a few reasons why it's not clear from these results whether the steganography results introduced would reduce safety in an untrusted monitoring control protocol like that in Greenblatt et al.:

- The untrusted model needs to reliably collude in order to defeat the blue-team because if it messes up once during the upfront evaluation period (30 problems), it loses. Having a 90% chance of communicating correctly to the monitor on each problem gives a 0.9^30 = 4% chance of passing the upfron... (read more)

4Dakara5mo

Thank you for both of your responses, they address all of my worries. Good luck on all of your future projects!

Anthropic: Three Sketches of ASL-4 Safety Case Components

Alex Mallen5mo60

I think that if they:
- Find a feature that is necessary for defection in all of the cases of defection in a diverse and realistic set of model organisms
- Locate that feature with very little optimization on the defection cases
I'd be fairly confident (>~80%) that it would catch any particular deployment-time defection attempt from an ASL-4 schemer. This would cut risk from the deployment by >5x when paired with compelling incrimination and a shutdown. This is a substantial risk reduction, and I might even buy this safety case if there weren't better o... (read more)

Anthropic: Three Sketches of ASL-4 Safety Case Components

Alex Mallen5mo*Ω692

This is really exciting to see not only as a way to understand better what people at Anthropic are thinking about how to make ASL-4 safety cases but also as a survey of how various safety agendas might play a role in practice!

Here are some thoughts about the details:

Given that the mechinterp safety case involves running a kind of control evaluation, other monitoring methods can largely be substituted into the eval. This could enable cheap testing of other monitors e.g. based on supervised probing or mechanistic anomaly detection.
I like the idea to check

... (read more)

Claude 3.5 Sonnet

Alex Mallen10mo195

Their addendum contains measurements on refusals and harmlessness, though these aren't that meaningful and weren't advertised.

Ilya Sutskever and Jan Leike resign from OpenAI [updated]

Alex Mallen1y10

You seem to be talking primarily about Ilya, but it seems especially unclear why Jan is also leaving now.

-7O O1y

Essay competition on the Automation of Wisdom and Philosophy — $25k in prizes

Alex Mallen1y65

I think this is a very important and neglected area! That its tractability is low is one of its central features, but progress on it now greatly aides in progress later by making these hard-to-meaure aims easier to measure.

ChatGPT can learn indirect control

Alex Mallen1y10

The only way ChatGPT can control anything is by writing text, so figuring out that it should write the text that should appear in the image seems pretty straightforward. It only needs to rationalize why this would work.

OpenAI's Sora is an agent

Alex Mallen1y61

I think the relevant notion of "being an agent" is whether we have reason to believe it generalizes like a consequentialist (e.g. its internal cognition considers possible actions and picks among them based on expected consequences and relies minimally on the imitative prior). This is upstream of the most important failure modes as described by Roger Grosse here.

I think Sora is still in the bottom left like LLMs, as it has only been trained to predict. Without further argument or evidence I would expect that it probably for the most part hasn't learned to ... (read more)

Dreams of AI alignment: The danger of suggestive names

Alex Mallen1y30

I'll add another one to the list: "Human-level knowledge/human simulator"

Max Nadeau helped clarify some ways in which this framing introduced biases into my and others' models of ELK and scalable oversight. Knowledge is hard to define and our labels/supervision might be tamperable in ways that are not intuitively related to human difficulty.

Different measurements of human difficulty only correlate at about 0.05 to 0.3, suggesting that human difficulty might not be a very meaningful concept for AI oversight, or that our current datasets for experimenting wi... (read more)

My guess at Conjecture's vision: triggering a narrative bifurcation

Alex Mallen1y166

Adversarial mindset. Adversarial communication is to some extent necessary to communicate clearly that Conjecture pushes back against AGI orthodoxy. However, inside the company, this can create a soldier mindset and poor epistemic. With time, adversariality can also insulate the organization from mainstream attention, eventually making it ignored.

This post has a battle-of-narratives framing and uses language to support Conjecture's narrative but doesn't actually argue why Conjecture's preferred narrative puts the world in a less risky posi... (read more)

Deep Deceptiveness

Alex Mallen2y10

(Initial reactions that probably have some holes)

If we use ELK on the AI output (i.e. extensions of Burns's Discovering Latent Knowledge paper, where you feed the AI output back into the AI and look for features that the AI notices), and somehow ensure that the AI has not learned to produce output that fools its later self + ELK probe (this second step seems hard; one way you might do this is to ensure the features being picked up by the ELK probe are actually the same ones used/causally responsible for the decision in the first place), then it seems initi... (read more)

Beliefs and Disagreements about Automating Alignment Research

Alex Mallen3y70

It's a tool for interacting with language models: https://generative.ink/posts/loom-interface-to-the-multiverse/