I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.
Website: https://jacquesthibodeau.com
Twitter: https://twitter.com/JacquesThibs
GitHub: https://github.com/JayThibs
(Just a general thought, not agreeing/disagreeing)
One thought I had recently: it feels like some people make an effort to update their views/decision-making based on new evidence and to pay attention to the key assumptions or viewpoints that depend on it. And therefore, they end up reflecting on how this should impact their future decisions or behaviour.
In fact, they might even be seeking evidence as quickly as possible to update their beliefs and ensure they can make the right decisions moving forward.
Others will accept new facts and avoid taking the time to adjust their overall dependent perspectives. In these cases, it seems to me that they are almost always less likely to make optimal decisions.
If an LLM trying to do research learns that Subliminal Learning is possible, it seems likely that they will be much better at applying that new knowledge if it is integrated into itself as a whole.
"Given everything I know about LLMs, what are the key things that would update my views on how we work? Are there previous experiments I misinterpreted due to relying on underlying assumptions I had considered to be a given? What kind of experiment can I run to confirm a coherent story?"
Seems to me that if you point an AI towards automated AI R&D, it will be more capable of it if it can internalize new information and disentangle it into a more coherent view.
If all labs intend to cause recursive self-improvement and claim to solve alignment with some vague “eh, we’ll solve it with automated AI alignment researchers”, this is not good enough.
At the very least, they all need to provide public details of their plan with a Responsible Automation Policy.
My girlfriend (who is not at all SF-brained and typically doesn’t read LessWrong unless I send her something) really enjoyed it and felt it was great because it helped her empathize with people in AI safety / LessWrong (makes them feel more human). She felt it was well-written, enjoyably written. It was something she could read without it being a task.
That said, I am a little bit confused by folks who both say, “current AI models have nothing to do with future powerful (real) AIs” yet also consistently use “bad” behaviour from current AIs as a reason to stop.
Often, the argument made is, “we don’t even understand the previous generations of AIs, how do we even hope to align future AIs?”
I guess the way I understand it is that given that we can’t even get current AIs to do exactly what we want, then we should expect the same for future AIs. However, this feels somewhat connected to the fact that current AIs are just sloppy and lack the capability, not only some thing about “we don’t know how to align current models perfectly to our intentions.”
The key argument against the superalignment/automated alignment agenda is that while AIs will excel in verifiable domains, such as code, they will struggle with hard-to-verify tasks.
For example, science in domains we have little data (alignment of superintelligence) and techniques that work for weaker models will be poor proxies and break at superintelligence (i.e. harder to monitor, internal reasoning, models are no longer stateless and are continually learning, tangibly different reasoning than the weak reasoning that currently exists, etc).
Ultimately, you get convincing slop, and even though you might catch non-superintelligent AIs doing so-called “scheming”, it’s not that helpful because they are not capable enough to cause a catastrophe at this point.
The crux is whether AIs end up capable of +10x-ing actually useful superalignment research while you are in the valley of life, which is when you can quickly verify outputs are not slop (no longer severely bottlenecked on human talent; after the slop era), but before all your control techniques are basically doomed.
So, you hope to prevent AIs from sabotaging AI safety research AND that the resulting safety research isn’t just a poor proxy that works well at a specific model size/shape, but then completely fails when you have self-modifying superintelligence.
Ultimately, you’d better have a backup plan for superalignment that isn’t just, “we’ll stop if we catch the AIs being deceptively aligned.” There are worlds where everything seems plausibly safe, you have a very convincing, vetted safety plan, you implement it, and you die.
Thanks for the post, Simon! I think having more discussion giving specific criticisms and demands for the mainline alignment plan by the labs is needed.
I’d like to eventually put forth my strongest arguments for superalignment as a whole and what we need to happen to realistically convince/force the labs to stop.
Quick comments:
I've DM'd you my current draft doc on this, though it may be incomprehensible.
Have you published this doc? If so, which one is it? If not, may I see it?
Hmm, so I still hold the view that they are worthwhile even if they are not informative, particularly for the reasons you seem to have pointed to (i.e. training up good human researchers to identify who has a knack for a specific style of research s.t. we can use them for providing initial directions to AIs automating AI safety R&D as well as serving as model output verifiers OR building infra that ends up being used by AIs that are good enough to do tons of experiments leveraging that infra but not good enough to come up with completely new paradigms).
I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.
Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.
CURRENT WORK
TOPICS TO CHAT ABOUT
POTENTIAL COLLABORATIONS
TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH