LESSWRONG
LW

170
jacquesthibs
2835Ω100154010
Message
Dialogue
Subscribe

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs 

LinkedIn: https://www.linkedin.com/in/jacques-thibodeau/ 

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
On Becoming a Great Alignment Researcher (Efficiently)
2jacquesthibs's Shortform
3y
358
jacquesthibs's Shortform
jacquesthibs2y82

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

  • Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
  • I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
  • Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
  • Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
  • Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

  • How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
  • How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
  • Debate over which agenda actually contributes to solving the core AI x-risk problems.
  • What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
  • How can we make something like the d/acc vision (by Vitalik Buterin) happen?
  • How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
  • What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

  • Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
  • I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
  • I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

  • Strong math background, can understand Influence Functions enough to extend the work.
  • Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
  • Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast. 
Reply
How Stuart Buck funded the replication crisis
jacquesthibs8h3233

When you say “creating the replication crisis”, it read to me like he caused lots of people to publish papers that don’t replicate!

Reply
jacquesthibs's Shortform
jacquesthibs1d222

How much of the alignment problem do you think will come down to getting online learning right?

Online learning (and verification) feels like a key capability unlock to me, and it seems to be one of the things that comes up in paths to misalignment.

TLDR: We want to describe a concrete and plausible story for how AI models could become schemers. We aim to base this story on what seems like a plausible continuation of the current paradigm. Future AI models will be asked to solve hard tasks. We expect that solving hard tasks requires some sort of goal-directed, self-guided, outcome-based, online learning procedure, which we call the “science loop”, where the AI makes incremental progress toward its high-level goal. We think this “science loop” encourages goal-directedness, instrumental reasoning, instrumental goals, beyond-episode goals, operational non-myopia, and indifference to stated preferences, which we jointly call “Consequentialism”. We then argue that consequentialist agents that are situationally aware are likely to become schemers (absent countermeasures) and sketch three concrete example scenarios. We are uncertain about how hard it is to stop such agents from scheming. We can both imagine worlds where preventing scheming is incredibly difficult and worlds where simple techniques are sufficient. Finally, we provide concrete research questions that would allow us to gather more empirical evidence on scheming. 

[...]

Self-guided online learning: There is an online learning component to it, i.e. the model has to condense the new knowledge it learned from iterations. For example, the model could run thousands of different trajectories in parallel. Then, it could select the trajectories that it expects to make the most progress toward its goal and fine-tune itself on them. The decisions about which data to select for fine-tuning are made by the model itself with little human correction, e.g. in some form of self-play fashion. Since the problem is hard, humans perform worse than the model at selecting different rollouts, and since there is a lot of data to sift through, humans couldn’t read it all in time anyway.

So, this makes me wonder why I see very little work on this topic within the alignment community.

I've seen multiple startups tackle this problem and have failed for a multitude of reasons (including being too early and lacking customers as a result).

So, as a startup founder trying to find business trajectories that would actually tackle the core of alignment, I'm trying to reflect on whether there's a path that involves something to do with online learning.

Reply
The Thinking Machines Tinker API is good news for AI control and security
jacquesthibs8d20

When it came out, my first thought was that it would be great for reducing power concentration risks if you can easily have AIs train on your specific data. The more autonomous and capable it is at online learning relative to models from the AGI labs, the less companies would need to rely on bigger generalist AI models. It’s one path I’ve considered for our startup.

Reply
1a3orn's Shortform
jacquesthibs12d20

(Just a general thought, not agreeing/disagreeing)

One thought I had recently: it feels like some people make an effort to update their views/decision-making based on new evidence and to pay attention to the key assumptions or viewpoints that depend on it. And therefore, they end up reflecting on how this should impact their future decisions or behaviour.

In fact, they might even be seeking evidence as quickly as possible to update their beliefs and ensure they can make the right decisions moving forward.

Others will accept new facts and avoid taking the time to adjust their overall dependent perspectives. In these cases, it seems to me that they are almost always less likely to make optimal decisions.

If an LLM trying to do research learns that Subliminal Learning is possible, it seems likely that they will be much better at applying that new knowledge if it is integrated into itself as a whole.

"Given everything I know about LLMs, what are the key things that would update my views on how we work? Are there previous experiments I misinterpreted due to relying on underlying assumptions I had considered to be a given? What kind of experiment can I run to confirm a coherent story?"

Seems to me that if you point an AI towards automated AI R&D, it will be more capable of it if it can internalize new information and disentangle it into a more coherent view.

Reply
jacquesthibs's Shortform
jacquesthibs19d3139

If all labs intend to cause recursive self-improvement and claim to solve alignment with some vague “eh, we’ll solve it with automated AI alignment researchers”, this is not good enough.

At the very least, they all need to provide public details of their plan with a Responsible Automation Policy.

Reply
jacquesthibs's Shortform
jacquesthibs24d30

Post is still up here.

More recent thoughts here.

Reply1
The Company Man
jacquesthibs25d162

My girlfriend (who is not at all SF-brained and typically doesn’t read LessWrong unless I send her something) really enjoyed it and felt it was great because it helped her empathize with people in AI safety / LessWrong (makes them feel more human). She felt it was well-written, enjoyably written. It was something she could read without it being a task.

Reply411
jacquesthibs's Shortform
jacquesthibs1mo140

That said, I am a little bit confused by folks who both say, “current AI models have nothing to do with future powerful (real) AIs” yet also consistently use “bad” behaviour from current AIs as a reason to stop.

Often, the argument made is, “we don’t even understand the previous generations of AIs, how do we even hope to align future AIs?”

I guess the way I understand it is that given that we can’t even get current AIs to do exactly what we want, then we should expect the same for future AIs. However, this feels somewhat connected to the fact that current AIs are just sloppy and lack the capability, not only some thing about “we don’t know how to align current models perfectly to our intentions.”

Reply
jacquesthibs's Shortform
jacquesthibs1mo30

The key argument against the superalignment/automated alignment agenda is that while AIs will excel in verifiable domains, such as code, they will struggle with hard-to-verify tasks.

For example, science in domains we have little data (alignment of superintelligence) and techniques that work for weaker models will be poor proxies and break at superintelligence (i.e. harder to monitor, internal reasoning, models are no longer stateless and are continually learning, tangibly different reasoning than the weak reasoning that currently exists, etc).

Ultimately, you get convincing slop, and even though you might catch non-superintelligent AIs doing so-called “scheming”, it’s not that helpful because they are not capable enough to cause a catastrophe at this point.

The crux is whether AIs end up capable of +10x-ing actually useful superalignment research while you are in the valley of life, which is when you can quickly verify outputs are not slop (no longer severely bottlenecked on human talent; after the slop era), but before all your control techniques are basically doomed.

So, you hope to prevent AIs from sabotaging AI safety research AND that the resulting safety research isn’t just a poor proxy that works well at a specific model size/shape, but then completely fails when you have self-modifying superintelligence.

Ultimately, you’d better have a backup plan for superalignment that isn’t just, “we’ll stop if we catch the AIs being deceptively aligned.” There are worlds where everything seems plausibly safe, you have a very convincing, vetted safety plan, you implement it, and you die.

Reply
Load More
36Automating AI Safety: What we can do today
3mo
0
82What Makes an AI Startup "Net Positive" for Safety?
6mo
23
59How much I'm paying for AI productivity software (and the future of AI use)
1y
18
58Shane Legg's necessary properties for every AGI Safety plan
Q
1y
Q
12
17AISC Project: Benchmarks for Stable Reflectivity
2y
0
76Research agenda: Supervising AIs improving AIs
Ω
2y
Ω
5
293Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky
3y
297
87Practical Pitfalls of Causal Scrubbing
Ω
3y
Ω
17
23Can independent researchers get a sponsored visa for the US or UK?
Q
3y
Q
1
60What‘s in your list of unsolved problems in AI alignment?
Q
3y
Q
9
Load More