eggsyntax

AI safety & alignment researcher

Wikitag Contributions

Comments

Sorted by

Some interesting thoughts on (in)efficient markets from Byrne Hobart, worth considering in the context of Inadequate Equilibria.

When a market anomaly shows up, the worst possible question to ask is "what's the fastest way for me to exploit this?" Instead, the first thing to do is to steelman it as aggressively as possible, and try to find any way you can to rationalize that such an anomaly would exist. Do stocks rise on Mondays? Well, maybe that means savvy investors have learned through long experience that it's a good idea to take off risk before the weekend, and even if this approach loses money on average, maybe the one or two Mondays a decade where the market plummets at the open make it a winning strategy because the savvy hedgers are better-positioned to make the right trades within that set.[1] Sometimes, a perceived inefficiency is just measurement error: heavily-shorted stocks reliably underperform the market—until you account for borrow costs (and especially if you account for the fact that if you're shorting them, there's a good chance that your shorts will all rally on the same day your longs are underperforming). There's even meta-efficiency at work in otherwise ridiculous things like gambling on 0DTE options or flipping meme stocks: converting money into fun is a legitimate economic activity, though there are prudent guardrails on it just in case someone finds that getting a steady amount of fun requires burning an excessive number of dollars.

These all flex the notion of efficiency a bit, but it's important to enumerate them because they illustrate something annoying about the question of market efficiency: the more precisely you specify the definition, and the more carefully you enumerate all of the rational explanations for seemingly irrational activities, the more you're describing a model of reality so complicated that it's impossible to say whether it's 50% or 90% or 1-ε efficient.

eggsyntax2118

Strong upvote (both as object-level support and for setting a valuable precedent) for doing the quite difficult thing of saying "You should see me as less expert in some important areas than you currently do." 

I agree with Daniel here but would add one thing:

what we care about is which one they wear in high-stakes situations where e.g. they have tons of power and autonomy and no one is able to check what they are doing or stop them. (You can perhaps think of this one as the "innermost mask")

I think there are also valuable questions to be asked about attractors in persona space -- what personas does an LLM gravitate to across a wide range of scenarios, and what sorts of personas does it always or never adopt? I'm not aware of much existing research in this direction, but it seems valuable. If for example we could demonstrate certain important bounds ('This LLM will never adopt a mass-murderer persona') there's potential alignment value there IMO.

eggsyntax*5525

...soon the AI rose and the man died[1]. He went to Heaven. He finally got his chance to discuss this whole situation with God, at which point he exclaimed, "I had faith in you but you didn't save me, you let me die. I don't understand why!"

God replied, "I sent you non-agentic LLMs and legible chain of thought, what more did you want?"

  1. ^

and the tokens/activations are all still very local because you're still early in the forward pass

I don't understand why this would necessarily be true, since attention heads have access to values for all previous token positions. Certainly, there's been less computation at each token position in early layers, so I could imagine there being less value to retrieving information from earlier tokens. But on the other hand, I could imagine it sometimes being quite valuable in early layers just to know what tokens had come before.

For me as an outsider, it still looks like the AI safety movement is only about „how do we prevent AI from killing us?“. I know it‘s an oversimplification, but that‘s how, I believe, many who don‘t really know about AI perceive it.

I don't think it's that much of an oversimplification, at least for a lot of AIS folks. Certainly that's a decent summary of my central view. There are other things I care about -- eg not locking in totalitarianism -- but they're pretty secondary to 'how do we prevent AI from killing us?'. For a while there was an effort in some quarters to rebrand as AINotKillEveryoneism which I think does a nice job centering the core issue.

It may as you say be unsexy, but it's still the thing I care about; I strongly prefer to live, and I strongly prefer for everyone's children and grandchildren to get to live as well.

We create a small dataset of chat and agentic settings from publicly available benchmarks and datasets.

I believe there are some larger datasets of relatively recent real chat evaluations, eg the LMSYS dataset was most recently updated in July (I'm assuming but haven't verified that the update added more recent chats).

Can you clarify what you mean by 'neural analog' / 'single neural analog'? Is that meant as another term for what the post calls 'simple correspondences'?

Even if all the safety-relevant properties have them, there's no reason to believe (at least for now) that we have the interp tools to find them in time i.e., before having systems fully capable of pulling off a deception plan.

Agreed. I'm hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I'm skeptical that that'll happen. Or alternately I'm hopeful that we turn out to be in an easy-mode world where there is something like a single 'deception' direction that we can monitor, and that'll at least buy us significant time before it stops working on more sophisticated systems (plausibly due to optimization pressure / selection pressure if nothing else).

I'm also worried that claims such as "we can make important forward progress on particular intentional states even in the absence of such a general account." could further lead to a slippery slope that more or less embraces having the dangerous thing first without sufficient precautions

I agree that that's a real risk; it makes me think of Andreessen Horowitz and others claiming in an open letter that interpretability had basically been solved and so AI regulation isn't necessary. On the other hand, it seems better to state our best understanding plainly, even if others will slippery-slope it, than to take the epistemic hit of shifting our language in the other direction to compensate.

i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence.

I haven't read that sequence, I'll check it out, thanks. I'm thinking of work like the ROME paper from David Bau's lab that suggest that fact storage can be identified and edited, and various papers like this one from Mor Geva+ that find evidence that the MLP layers in LLMs are largely key-value stores.

relatedly, your second bullet point assumes that you can identify the 'fact' related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?

It does seem like a naive approach would require pre-identifying all facts you wanted to track. On the other hand, I can imagine an approach like analyzing the output for factual claims and then searching for those in the record of activations during the output. Not sure, seems very TBD.

I think that detecting/preventing models from knowingly lying would be a good research direction and it's clearly related to strategic deception, but I'm not actually sure that it's a superset (consider a case when I'm bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don't know or care whether what I'm saying is true or false or whatever).

Great point! I can certainly imagine that there could be cases like that, although I can equally imagine that LLMs could be consistently tracking the truth value of claims even if that isn't a big factor determining the output.

but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!

That seems reasonable. I've mostly had the impression that 1 has generally been true in specific cases where researchers have looked for it, but it's definitely not something I've specifically gone looking for. I'll be interested to read the sequence from DeepMind.

Also The Chameleon (would have included it in the last comment but had to consult a kid first).

Load More