Aaron_Scher

Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise. 

https://ascher8.github.io/

Wikitag Contributions

Comments

Sorted by

The successors will have sufficiently similar goals as its predecessor by default. It’s hard to know how likely this is, but note that this is basically ruled out if the AI has self-regarding preferences.

1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?

Given that the lab failed to align the AI, it’s unclear why the AI will be able to align its successor, especially if it has the additional constraint of having to operate covertly and with scarcer feedback loops.

2. I don't thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence. 

Overall, very interesting and good post. 

This increase occurred between 1950 and 1964, and leveled off thereafter.

Hm, this data doesn't feel horribly strong to me. What happened from 1965-1969, why is that data point relatively low, seems inconsistent with the poisoning theory? My prior is that data is noisy and it is easy to see effects that don't mean much. But this is an interesting and important topic, and I'm sorry it's infeasible to access better data. 

Neat, weird. 

I get similar results when I ask "What are the best examples of reward hacking in LLMs?" (GPT-4o). When I then ask for synonyms of "Thumbs-up Exploitation" the model still does not mention sycophancy but then I push harder and it does. 

Asking "what is it called when an LLM chat assistant is overly agreeable and tells the user what the user wants to hear?" on the first try the model says sycophancy, but much weirder answers in a couple other generations. Even got a "Sy*cophancy". 

I got the model up to 3,000 tokens/s on a particularly long/easy query. 

As an FYI, there has been other work on large diffusion language models, such as this: https://www.inceptionlabs.ai/introducing-mercury

We should also consider that, well, this result just doesn't pass the sniff test given what we've seen RL models do.

FWIW, I interpret the paper to be making a pretty narrow claim about RL in particular. On the other hand, a lot of the production "RL models" we have seen may not be pure RL. For instance, if you wanted to run a similar test to this paper on DeepSeek-V3+, you would compare DeepSeek-V3 to DeepSeek-R1-Zero (pure RL diff, according to the technical report), not to DeepSeek-R1 (trained with a hard-to-follow mix of SFT and RL). R1-Zero is a worse model than R1, sometimes by a large margin. 

Out of curiosity, have your takes here changed much lately? 

I think the o3+ saga has updated me a small-medium amount toward "companies will just deploy misaligned AIs and consumers will complain but use them anyway" (evidenced by deployment of models that blatantly lie from multiple companies) and "slightly misaligned AI systems that are very capable will likely be preferred over more aligned systems that are less capable" (evidenced by many consumers, including myself, switching over to using these more capable lying models). 

I also think companies will work a bit to reduce reward hacking and blatant lying, and they will probably succeed to some extent (at least for noticeable, everyday problems), in the next few months. That, combined with OpenAI's rollback of 4o sycophancy, will perhaps make it seem like companies are responsive to consumer pressure here. But I think the situation is overall a small-medium update against consumer pressure doing the thing you might hope here. 

Side point: Noting one other dynamic: advanced models are probably not going to act misaligned in everyday use cases (that consumers have an incentive to care about, though again revealed preference is less clear), even if they're misaligned. That's the whole deceptive alignment thing. So I think it does seem more like the ESG case? 

I agree that the report conflates these two scales of risk. Fortunately, one nice thing about that table (Table 1 in the paper) is that readers can choose which of these risks they want to prioritize. I think more longtermist-oriented folks should probably weigh the badness of these as Loss on Control being the most bad, followed perhaps by Bad Lock-in, then Misuse and War. But obviously there's a lot of variance within these. 

I agree that there *might* be some cases where policymakers will have difficult trade-offs to make about these risks. I'm not sure how likely I think this is, but I agree it's a good reason we should keep this nuance insofar as we can. I guess it seems to me like we're not anywhere near the right decision makers actually making these tradeoffs, nor near them having values that particularly up-weigh the long term future. 

I therefore feel okay about lumping these together in a lot of my communication these days. But perhaps this is the wrong call, idk. 

The viability of a pause is dependent on a bunch of things, like the number of actors who could take some dangerous action, how hard it would be for them to do that, how detectable it would be, etc. These are variable factors. For example, if the world got rid of advanced AI chips completely, dangerous AI activities would then take a long time and be super detectable. We talk about this in the research agenda; there are various ways to extend "breakout time", and these methods could be important to long-term stability. 

Aaron_ScherΩ330

I think your main point is probably right but was not well argued here. It seems like the argument is a vibe argument of like "nah they probably won't find this evidence compelling". 

You could also make an argument from past examples where there has been large action to address risks in the world, and look at the evidence there (e.g., banning of CFCs, climate change more broadly, tobacco regulation, etc.) 

You could also make an argument from existing evidence around AI misbehavior and how its being dealt with, where (IMO) 'evidence much stronger than internals' basically doesn't seem to affect the public conversation outside the safety community (or even much here). 

 

I think it's also worth saying a thing very directly: just because non-behavioral evidence isn't likely to be widely legible and convincing does not mean it is not useful evidence for those trying to have correct beliefs. Buck's previous post and many others discuss the rough epistemic situation when it comes to detecting misalignment. Internals evidence is going to be one of the tools in the toolkit, and it will be worth keeping in mind. 

Another thing worth saying: if you think scheming is plausible, and you think it will be difficult to update against scheming from behavioral evidence (Buck's post), and you think non-behavioral evidence is not likely to be widely convincing (this post), then the situation looks really rough. 

I appreciate this post, I think it's a useful contribution to the discussion. I'm not sure how much I should be updating on it. Points of clarification:

Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated.

  1. Have you upgraded these benchmarks? Is it possible that the diminishing returns you're seen in the Sonnet 3.5-3.7 series are just normal benchmark saturation? What % scores are the models getting? i.e., somebody could make the same observation about MMLU and basically be like "we've seen only trivial improvements since GPT-4", but that's because the benchmark is not differentiating progress well after like the high 80%s (in turn I expect this is due to test error and the distribution of question difficulty).
  2. Is it correct that your internal benchmark is all cybersecurity tasks? Soeren points out that companies may be focusing much less on cyber capabilities than general SWE.
  3. How much are you all trying to elicit models' capabilities, and how good do you think you are? E.g., do you spend substantial effort identifying where the models are getting tripped up and trying to fix this? Or are you just plugging each new model into the same scaffold for testing (which I want to be clear is a fine thing to do, but is useful methodology to keep in mind). I could totally imagine myself seeing relatively little performance gains if I'm not trying hard to elicit new model capabilities. This would be even worse if my scaffold+ was optimized for some other model, as now I have an unnaturally high baseline (this is a very sensible thing to do for business reasons, as you want a good scaffold early and it's a pain to update, but it's useful methodology to be aware of when making model comparisons). Especially re the o1 models, as Ryan points out in a comment. 
Load More