There are two roles that don't show up in your trip planning example but which I think are important and valuable in AI safety: the Time Buyer and the Trip Canceler.

It's not at all clear how long it will take Alice to solve the central bottleneck (or for that matter if she'll be able to solve it at all). The Time Buyer tries to find solutions that may not generalize to the hardest version of the problem but will hold off disaster long enough for the central bottleneck to be solved.

The Trip Canceler tries to convince everyone to cancel the trip so that the fully general solution isn't needed at all (or at least to delay it long enough for Alice to have plenty of time to work.

They may seem less like the hero of the story, but they're both playing vital roles.

eggsyntax's Shortform

eggsyntax7d*70

Some interesting thoughts on (in)efficient markets from Byrne Hobart, worth considering in the context of Inadequate Equilibria.

(I've selected one interesting bit, but there's more; I recommend reading the whole thing)

When a market anomaly shows up, the worst possible question to ask is "what's the fastest way for me to exploit this?" Instead, the first thing to do is to steelman it as aggressively as possible, and try to find any way you can to rationalize that such an anomaly would exist. Do stocks rise on Mondays? Well, maybe that means savvy investors have learned through long experience that it's a good idea to take off risk before the weekend, and even if this approach loses money on average, maybe the one or two Mondays a decade where the market plummets at the open make it a winning strategy because the savvy hedgers are better-positioned to make the right trades within that set.^[1] Sometimes, a perceived inefficiency is just measurement error: heavily-shorted stocks reliably underperform the market—until you account for borrow costs (and especially if you account for the fact that if you're shorting them, there's a good chance that your shorts will all rally on the same day your longs are underperforming). There's even meta-efficiency at work in otherwise ridiculous things like gambling on 0DTE options or flipping meme stocks: converting money into fun is a legitimate economic activity, though there are prudent guardrails on it just in case someone finds that getting a steady amount of fun requires burning an excessive number of dollars.
These all flex the notion of efficiency a bit, but it's important to enumerate them because they illustrate something annoying about the question of market efficiency: the more precisely you specify the definition, and the more carefully you enumerate all of the rational explanations for seemingly irrational activities, the more you're describing a model of reality so complicated that it's impossible to say whether it's 50% or 90% or 1-ε efficient.

Good Research Takes are Not Sufficient for Good Strategic Takes

eggsyntax9d3326

Strong upvote (both as object-level support and for setting a valuable precedent) for doing the quite difficult thing of saying "You should see me as less expert in some important areas than you currently do."

Can we ever ensure AI alignment if we can only test AI personas?

eggsyntax15d82

I agree with Daniel here but would add one thing:

what we care about is which one they wear in high-stakes situations where e.g. they have tons of power and autonomy and no one is able to check what they are doing or stop them. (You can perhaps think of this one as the "innermost mask")

I think there are also valuable questions to be asked about attractors in persona space -- what personas does an LLM gravitate to across a wide range of scenarios, and what sorts of personas does it always or never adopt? I'm not aware of much existing research in this direction, but it seems valuable. If for example we could demonstrate certain important bounds ('This LLM will never adopt a mass-murderer persona') there's potential alignment value there IMO.

eggsyntax's Shortform

eggsyntax18d*5525

...soon the AI rose and the man died^[1]. He went to Heaven. He finally got his chance to discuss this whole situation with God, at which point he exclaimed, "I had faith in you but you didn't save me, you let me die. I don't understand why!"

God replied, "I sent you non-agentic LLMs and legible chain of thought, what more did you want?"

^{^}
https://en.wikipedia.org/wiki/Parable_of_the_drowning_man

What’s up with LLMs representing XORs of arbitrary features?

eggsyntax19d20

and the tokens/activations are all still very local because you're still early in the forward pass

I don't understand why this would necessarily be true, since attention heads have access to values for all previous token positions. Certainly, there's been less computation at each token position in early layers, so I could imagine there being less value to retrieving information from earlier tokens. But on the other hand, I could imagine it sometimes being quite valuable in early layers just to know what tokens had come before.

AI Safety Policy Won't Go On Like This – AI Safety Advocacy Is Failing Because Nobody Cares.

eggsyntax24d20

For me as an outsider, it still looks like the AI safety movement is only about „how do we prevent AI from killing us?“. I know it‘s an oversimplification, but that‘s how, I believe, many who don‘t really know about AI perceive it.

I don't think it's that much of an oversimplification, at least for a lot of AIS folks. Certainly that's a decent summary of my central view. There are other things I care about -- eg not locking in totalitarianism -- but they're pretty secondary to 'how do we prevent AI from killing us?'. For a while there was an effort in some quarters to rebrand as AINotKillEveryoneism which I think does a nice job centering the core issue.

It may as you say be unsexy, but it's still the thing I care about; I strongly prefer to live, and I strongly prefer for everyone's children and grandchildren to get to live as well.

Do models know when they are being evaluated?

eggsyntax1mo70

We create a small dataset of chat and agentic settings from publicly available benchmarks and datasets.

I believe there are some larger datasets of relatively recent real chat evaluations, eg the LMSYS dataset was most recently updated in July (I'm assuming but haven't verified that the update added more recent chats).

A Problem to Solve Before Building a Deception Detector

eggsyntax2mo30

Can you clarify what you mean by 'neural analog' / 'single neural analog'? Is that meant as another term for what the post calls 'simple correspondences'?

Even if all the safety-relevant properties have them, there's no reason to believe (at least for now) that we have the interp tools to find them in time i.e., before having systems fully capable of pulling off a deception plan.

Agreed. I'm hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I'm skeptical that that'll happen. Or alternately I'm hopeful that we turn out to be in an easy-mode world where there is something like a single 'deception' direction that we can monitor, and that'll at least buy us significant time before it stops working on more sophisticated systems (plausibly due to optimization pressure / selection pressure if nothing else).

I'm also worried that claims such as "we can make important forward progress on particular intentional states even in the absence of such a general account." could further lead to a slippery slope that more or less embraces having the dangerous thing first without sufficient precautions

I agree that that's a real risk; it makes me think of Andreessen Horowitz and others claiming in an open letter that interpretability had basically been solved and so AI regulation isn't necessary. On the other hand, it seems better to state our best understanding plainly, even if others will slippery-slope it, than to take the epistemic hit of shifting our language in the other direction to compensate.

A Problem to Solve Before Building a Deception Detector

eggsyntax2mo30

i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence.

I haven't read that sequence, I'll check it out, thanks. I'm thinking of work like the ROME paper from David Bau's lab that suggest that fact storage can be identified and edited, and various papers like this one from Mor Geva+ that find evidence that the MLP layers in LLMs are largely key-value stores.

relatedly, your second bullet point assumes that you can identify the 'fact' related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?

It does seem like a naive approach would require pre-identifying all facts you wanted to track. On the other hand, I can imagine an approach like analyzing the output for factual claims and then searching for those in the record of activations during the output. Not sure, seems very TBD.

I think that detecting/preventing models from knowingly lying would be a good research direction and it's clearly related to strategic deception, but I'm not actually sure that it's a superset (consider a case when I'm bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don't know or care whether what I'm saying is true or false or whatever).

Great point! I can certainly imagine that there could be cases like that, although I can equally imagine that LLMs could be consistently tracking the truth value of claims even if that isn't a big factor determining the output.

but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!

That seems reasonable. I've mostly had the impression that 1 has generally been true in specific cases where researchers have looked for it, but it's definitely not something I've specifically gone looking for. I'll be interested to read the sequence from DeepMind.