jacquesthibs — LessWrong

LESSWRONG
LW

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
Debate over which agenda actually contributes to solving the core AI x-risk problems.
What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
How can we make something like the d/acc vision (by Vitalik Buterin) happen?
How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

Strong math background, can understand Influence Functions enough to extend the work.
Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast.

jacquesthibs's Shortform

jacquesthibs3d20

Another option I didn't mention is to build a company with the intention of getting acquired. This is generally bizarre, and VCs don't like it, since you'd be unlikely to deliver massive returns for them (most acquisitions are considered failures). That said, acquisitions in the AI space are quite high. Then again, VCs may be concerned the founders will just get acquihired instead.

One way that might work is to basically have no legitimate revenue for a few years and still build something the big labs really want at some point in the future (unclear what this is, but a non-AI safety company like Bun can get acquired with virtually no revenue afaict, though they only raised 7 million in funding in 2022). From an AI safety perspective, it's unclear how it would play out since your goal might be to have your tech disseminated across all companies.

jacquesthibs's Shortform

jacquesthibs4d80

LW feature request (low on the importance scale):

It would be nice to be able to refresh the TTS for a post if it has been edited. I was reading this post, and it was a bit confusing to keep track of the audio since it had been edited.

wunan's Shortform

jacquesthibs4d30

Hmm, my thought was that devs (or at least Anthropic folks) have improved their ability to estimate how much AI is helping us since the release of the first truly agentic model? My feeling is that most top-end people should be better calibrated despite the moving target. Most people in the study had spent less than 50 hours (except for one of the folks who performed well), so I don’t think we cnnuse the study to say much about how things change over the course months or a year of usage and training (unless we do another study I guess).

In terms of the accurate prediction, I’m not recalling what exactly made me believe this, though if you look at the first chart in the METR thread, the confidence intervals of the predicted uplift from the devs is below the 38%. The average thought they were 24% faster at the beginning of the study (so, in fact, he probably underestimated his uplift a bit).

wunan's Shortform

jacquesthibs4d110

I think there is nuance about the downlift study that would be helpful to highlight:

Many participants used Sonnet 3.7 in Cursor for the first time (chat vs agent usage is a different skillset).
Sonnet 3.7 was notoriously bad in Cursor compared to Claude Code (since it was post-trained with the CC harness). I personally spent a few hours updating the system prompt in Cursor so that it became more usable.
Many people outside of Anthropic feel like Opus 4.5 is another “Sonnet 3.5 moment.”
We’ve learned a lot more about how to code with AI since then. Anthropic obviously teaches and sets up best practices internally.
There was in fact one person in the study who did accurately predict their uplift (+38%). IIRC they were also the most experienced with coding agents! They wrote a thread on the topic here.

This is not to say that it’s true that Anthropic employees are getting that high of an uplift, but may make it a bit more believable.

zroe1's Shortform

jacquesthibs5d50

I’ve looked into this as part of my goal of accelerating safety research and automating as much as we can. It was one of the primary things I imagined we would do when we pushed for the non-profit path. We eventually went for-profit because we expected there would not be enough money dispersed to do this, especially in a short timelines world.

I am again considering going non-profit again to pursue this goal, among others. I’ll send you and others a proposal on what I would imagine this looks like in the grander scheme.

I’ve been in AI safety for a while now and feel like I’ve formed a fairly comprehensive view of what would accelerate safety research, reduce power concentration, what it takes to automate research more safely as capabilities increase, and more.

I’ve tried to make this work as part of a for-profit, but it is incredibly hard to tackle the hard parts of the problem in that situation and since that is my intention, I’m again considering if a non-profit will have to do despite the unique difficulties that come with that.

jacquesthibs's Shortform

jacquesthibs5d2212

Most AI safety plans include “automating AI safety research.” There’s a need for better clarity of what it looks like.

There are at least four things that get conflated in the term “automated research”:

AI uses search to output what was already discovered (e.g. finds the solution in existing paper(s)).
AI uses search to find pieces of a solution that come together to solve a problem (hopefully in a verifiable domain / lean proof).
AI agents use existing research techniques we already know about, and apply them to a variety of new experiments. An example of AI safety research would be using insights/techniques from subliminal learning and emergent misalignment to study new dataset splits and models trained in new ways, while applying existing interpretability techniques with an auditor agent.
Getting AIs to create novel techniques that substantially improve the domain in question. This is like getting an AI to come up with a new paradigm, which may change how we even think about that research area.

For AI safety, the crux of many disagreements is whether one believes that:

3 & 4 are meaningfully different in ways that are substantially harder to get 4 than it is to get 3. Some people even seem to fail to disentangle the two and end up convinced that AIs are solving research as some singular thing.
4-level capabilities are already in the superintelligence-regime, so it defeats the purpose of using it for safety if you don’t have guarantees that it is safe.
When talking about superintelligence (the kind that, e.g. can start and grow entire large-scale businesses on its own, solve long-term complex goals like eliminating cancer and deal with any change in the world that goes beyond its initial training data), AI safety research needs novel paradigm-level breakthroughs (4) to reduce risks down to acceptable levels. Meaning that you might expect 3 to be too much within-paradigm, relatively unenlightened research.
Whether 4 is unneeded for a safe transition. Some folks seem to believe that 3 (which could be described as “relatively unenlightened” research) will be enough to align every subsequent AI, even once we are past 4.
Some folks believe that scaffold and inference compute at not much higher level of capability is all you need to get 4, and that you’ll be fine from a safety perspective because the models are currently useful for research and don’t seem misaligned.
Some seem to believe that 3 may produce good research output (within that set of possible experiments), but you will basically get slop for anything in 4 (anything truly out-of-distribution). So, the AI was put through the wringer and believes it has substantially made the next model safe, but, because it is incapable of generalizing well OOD, it fails to align a 4-level model. It has good intentions, but basically only does good safety work for 3-level models and totally fails at generating sufficient safety research techniques for aligning 4. It just slops itself to a disaster.
Even if 3 is helpful, it doesn’t end up meaningfully speeding up safety research in comparison to the pace of progress with respect to superintelligent capabilities.
4 involves the AI continually updating its weights, consolidating insights and placing neatly within its world model. 3 has some sort of disjointed world model that can’t be overcome with fancy scratchpadding and RAG (like, imagine an AI with a knowledge cutoff from 2023 and you RAG in 2026 research, it’s missing *years* of build up in its world model). 3 is suitable for following templates and interpolating within what we know, but fails to *understand* what is OOD.

Ultimately, this seems like a highly important question to clarify, since I believe it is driving many people to be optimistic about AI safety progress, at least to the point that it allows them to keep chugging along the capabilities tech tree. Having clarity on what would convince people otherwise much sooner seems important.

zroe1's Shortform

jacquesthibs7d*92

Relevant: https://www.lesswrong.com/posts/88xgGLnLo64AgjGco/where-are-the-ai-safety-replications

I think doing replications is great and it’s one of the areas I think automated research would be helpful soon. I replicated the Subliminal Learning paper on the day of the release because it was fairly easy to grab the paper, docs, codebases, etc to replicate quickly.

jacquesthibs's Shortform

jacquesthibs7d135

Short timelines, slow takeoff vs. Long timelines, fast takeoff

Due to chain-of-thought in the current paradigm seeming like great news for AI safety, some people seem to have the following expectations:

Short timelines: CoT reduces risks, but shorter preparation time increases the odds of catastrophe.

Long timelines: the current paradigm is not enough; therefore, CoT may stop being relevant, which may increase the odds of catastrophe. We have more time to prepare (which is good), but we may get a faster takeoff than the current paradigm makes it seem like. And therefore, discontinuous takeoff may introduce significantly more risks despite longer timelines.

So, perhaps counterintuitively for some, you could have these two groups:

1. Slow (smooth, non-discontinuous) takeoff, low p(doom), takeoff happens in the next couple of years. [People newer to AI safety seem more likely to expect this imo]

Vs.

2. Fast takeoff (discontinuous capability increase w.r.t. time), high p(doom), (actual) takeoff happens in 8-10 years. [seems more common under the MIRI / traditional AI safety researchers cluster]

I’m not saying those are the only two groups, but I think part of it speaks to how some people are feeling about the current state of progress and safety.

As a result, I think it’s pretty important to gain better clarity on whether we expect the current paradigm to scale without fundamental changes, and, if not, to understand what would come after it and how it would change the risks.

That’s not to say we shouldn’t weigh short timelines more highly due to being more immediate, but there are multiple terms to weigh here.

Alignment remains a hard, unsolved problem

jacquesthibs7d80

I agree it's true that other forums would engage with even worse norms, but I'm personally happy to keep the bar high and have a high standard for these discussions, regardless of what others do elsewhere. My hope is that we never stop striving for better, especially since, for alignment, the stakes are incredibly higher than most other domains, so we need a higher standard of frankness.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments