Student and co-director of the AI Safety Initiative at Georgia Tech. Interested in technical safety/alignment research and general projects that make AI go better. More about me here.
Thanks for the kind words Zephy!
Yeah interesting point about batch size. We simply started with 32 because we wanted to simulate real SFT workflows and 32 is is common. Similar story with lr and the rest of the training parameters aside from those needed to save training costs. A quick Claude query says that smaller batch size often generalizes better thanks to gradient noise, but this isn't super robust across tasks as gaps can be made up with lr and epoch adjustments!
Thanks Boaz and Parv for writing these. I think there are a few important details that didn't get past the information bottleneck that is natural language.
Note: Parv (author of this post) and I are close friends in real life. We work on AIS field building and research together, so my context with him may skew my interpretation of his post and this discussion.
What does being ok mean? I can infer maybe 2 definitions from the discussion.
(1) Being ok means "doing well for yourself", which includes financial security, not being in the hypothesized permanent underclass, and living a fulfilling life in general.
(2) Being ok means (1) AND not seeing catastrophic risk materialize (even if it doesn't impact you as much), which some of us assign intrinsic value to. I think this is what Parv meant by "I did not want the world with these things to end".
Boaz, I think you're referring to definition (1) when you say the below right? We likely won't be okay under definition (2), which is why the emotions imparted by Parv's piece resonated with so many readers? (Unsure, inviting Parv to comment himself)
"I believe that you will most likely will be OK, and in any case should spend most of your time acting under this assumption."
However, under either definition, I agree that it is productive to act under the belief "I will be okay if I try my hardest to improve the outcome of AI"
In general, selection for reward produces equally strong selection for reward’s necessary and sufficient conditions. In general, it seems like there should be a lot of those. Therefore, since selection is not only for reward but for anything which goes along with reward (e.g. reaching the goal), then selection won’t advantage reward optimizers over agents which reach goals quickly / pick up lots of trash / [do the objective].
Low confidence disagree here. if the AI has a very good model of how to achieve goal/reward X (which LLMs generally do), then the 'reward optimizer' policy elicits the set of necessary actions (like pick up lots of trash) that leads to this reward. in this sense, I think the 'think about what actions achieve goal and do them' behavior will achieve better rewards and therefore be more heavily selected for. I think the above also fits in the framing of the recent behavioral selection model proposed by Alex Mallen (https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1), similar to the 'motivation' cognitive pattern.
Why will the AI display this kind of explicit reward modelling in the first place? 1. we kind of tell the LLM what the goal is in certain RL tasks. 2. the most coherent persona/solution is one that explicit models rewards/thinks about goals, whether from assistant persona training or writing about AI.
Therefore I think we should reconsider implication #1? if the above is correct, AI can and will optimize for goals/rewards, just not in the intrinsic sense. this can be seen as a 'cognitive groove' that gets chiseled in to the AI, but is problematic in the same ways as the reward optimization premise.
The "hallucination/reliability" vs "misaligned lies" distinction probably matters here. The former should in principle go away as capability/intelligence scales while the latter probably gets worse?
I don't know of a good way to find evidence of model 'intent' for this type of incrimination, but if we explain this behavior with the training process it'd probably look something like:
Maybe the scaling law to look at here is model size vs. the % of misaligned data needed for the LLMs to learn this kind of misalignment? Or maybe inoculation prompting fixes all of this, but you'd have to craft custom data for each undesired trait...
Does anyone know of a convincing definition of 'intent' in LLMs (or a way to identify it)? In model organisms type work, I find it hard to 'incriminate' LLMs. Even though the output of the LLM will remain what it is regardless of 'intent', I think this distinction may be important because 'intentionally lying' and 'stochastic parroting' should scale differently with overall capabilities.
I find this hard for several reasons, but I'm highly uncertain whether these are fundamental limitations:
More on giving undergrads their first research experience. Yes, giving first research experience is high impact, but we want to reserve these opportunities to the best people. Often, this first research experience is most fruitful when they work with a highly competent team. We are turning focus to assemble such teams and find fits for the most value aligned undergrads.
We always find it hard to form pipelines because individuals are just so different! I don't even feel comfortable using 'undergrad' as a label if I'm honest...
Thanks again Esben for collaborating with us! Can confidently say that the above is super valuable advice for any AI safety hackathon organizers, they're consistent with our experiences.
In the context of a college campus hackathon, I'd especially stress focus on preparing starter materials and making submission requirements clear early on!
Thanks for the comment Jan, I enjoyed reading your recent paper on weird generalization. Adding takes and questions in the same order as your bullet points: