jacquesthibs

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs 

Sequences

On Becoming a Great Alignment Researcher (Efficiently)

Wiki Contributions

Comments

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

  • Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
  • I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
  • Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
  • Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
  • Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

  • How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
  • How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
  • Debate over which agenda actually contributes to solving the core AI x-risk problems.
  • What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
  • How can we make something like the d/acc vision (by Vitalik Buterin) happen?
  • How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
  • What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

  • Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
  • I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
  • I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

  • Strong math background, can understand Influence Functions enough to extend the work.
  • Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
  • Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast. 

Do we expect future model architectures to be biased toward out-of-context reasoning (reasoning internally rather than in a chain-of-thought)? As in, what kinds of capabilities would lead companies to build models that reason less and less in token-space?

I mean, the first obvious thing would be that you are training the model to internalize some of the reasoning rather than having to pay for the additional tokens each time you want to do complex reasoning.

The thing is, I expect we'll eventually move away from just relying on transformers with scale. And so I'm trying to refine my understanding of the capabilities that are simply bottlenecked in this paradigm, and that model builders will need to resolve through architectural and algorithmic improvements. (Of course, based on my previous posts, I still think data is a big deal.)

Anyway, this kind of thinking eventually leads to the infohazardous area of, "okay then, what does the true AGI setup look like?" This is really annoying because it has alignment implications. If we start to move increasingly towards models that are reasoning outside of token-space, then alignment becomes harder. So, are there capability bottlenecks that eventually get resolved through something that requires out-of-context reasoning?

So far, it seems like the current paradigm will not be an issue on this front. Keep scaling transformers, and you don't really get any big changes in the model's likelihood of using out-of-context reasoning.

This is not limited to out-of-context reasoning. I'm trying to have a better understanding of the (dangerous) properties future models may develop simply as a result of needing to break a capability bottleneck. My worry is that many people end up over-indexing on the current transformer+scale paradigm (and this becomes insufficient for ASI), so they don't work on the right kinds of alignment or governance projects.

---

I'm unsure how big of a deal this architecture will end up being, but the rumoured xLSTM just dropped. It seemingly outperforms other models at the same size:

Maybe it ends up just being another drop in the bucket, but I think we will see more attempts in this direction.

Claude summary:

The key points of the paper are:

  1. The authors introduce exponential gating with memory mixing in the new sLSTM variant. This allows the model to revise storage decisions and solve state tracking problems, which transformers and state space models without memory mixing cannot do.
  2. They equip the mLSTM variant with a matrix memory and covariance update rule, greatly enhancing the storage capacity compared to the scalar memory cell of vanilla LSTMs. Experiments show this matrix memory provides a major boost.
  3. The sLSTM and mLSTM are integrated into residual blocks to form xLSTM blocks, which are then stacked into deep xLSTM architectures.
  4. Extensive experiments demonstrate that xLSTMs outperform state-of-the-art transformers, state space models, and other LSTMs/RNNs on language modeling tasks, while also exhibiting strong scaling behavior to larger model sizes.

This work is important because it presents a path forward for scaling LSTMs to billions of parameters and beyond. By overcoming key limitations of vanilla LSTMs - the inability to revise storage, limited storage capacity, and lack of parallelizability - xLSTMs are positioned as a compelling alternative to transformers for large language modeling.

Instead of doing all computation step-by-step as tokens are processed, advanced models might need to store and manipulate information in a compressed latent space, and then "reason" over those latent representations in a non-sequential way.

The exponential gating with memory mixing introduced in the xLSTM paper directly addresses this need. Here's how:

  1. Exponential gating allows the model to strongly update or forget the contents of each memory cell based on the input. This is more powerful than the simple linear gating in vanilla LSTMs. It means the model can decisively revise its stored knowledge as needed, rather than being constrained to incremental changes. This flexibility is crucial for reasoning, as it allows the model to rapidly adapt its latent state based on new information.
  2. Memory mixing means that each memory cell is updated using a weighted combination of the previous values of all cells. This allows information to flow and be integrated between cells in a non-sequential way. Essentially, it relaxes the sequential constraint of traditional RNNs and allows for a more flexible, graph-like computation over the latent space.
  3. Together, these two components endow the xLSTM with a dynamic, updateable memory that can be accessed and manipulated "outside" the main token-by-token processing flow. The model can compress information into this memory, "reason" over it by mixing and gating cells, then produce outputs guided by the updated memory state.

In this way, the xLSTM takes a significant step towards the kind of "reasoning outside token-space" that I suggested would be important for highly capable models. The memory acts as a workspace for flexible computation that isn't strictly tied to the input token sequence.

Now, this doesn't mean the xLSTM is doing all the kinds of reasoning we might eventually want from an advanced AI system. But it demonstrates a powerful architecture for models to store and manipulate information in a latent space, at a more abstract level than individual tokens. As we scale up this approach, we can expect models to perform more and more "reasoning" in this compressed space rather than via explicit token-level computation.

I'm going to assume that Shane Legg has thought about it more and read more of the existing work than many of us combined. Certainly, there are smart people who haven't thought about it much, but Shane is definitely not one of them. He only had a short 5-minute talk, but I do hope to see a longer treatment on how he expects we will fully solve necessary property 3.

From a Paul Christiano talk called "How Misalignment Could Lead to Takeover" (from February 2023):

Assume we're in a world where AI systems are broadly deployed, and the world has become increasingly complex, where humans know less and less about how things work.

A viable strategy for AI takeover is to wait until there is certainty of success. If a 'bad AI' is smart, it will realize it won't be successful if it tries to take over, not a problem. 

So you lose when a takeover becomes possible, and some threshold of AIs behave badly. If all the smartest AIs you produce end up having the same problem, then takeover becomes more likely.

In other words, the smartest AIs will only take over if all the other AIs will also try to takeover.

Additionally, this is more likely to happen in an unstable world that could come from:

  • War (or other exogenous shocks): making it harder for humans to detect and respond to problems.
  • Cascading loss of control: some AIs try takeover and aren't fully successful, but make the world increasingly crazy, which cascades into other AIs (who previously thought it would be unsuccessful) now joining the takeover.
  • Models interacting with each other: once some AIs realize could be successful, they may want to share that information with other models to see if other models may want to join in.

There's this tempting feeling of training your AI against these takeover attempts ("hey, look at this bad AI who tried to takeover the data center, don't do that!"), but you may just be training your model to learn that it needs to go much bigger if it wants to actually succeed at the takeover attempt.

Paul believes that if this kind of phenomenon is real, then he expects we can get compelling demonstrations in a lab (that would require some imagination to bridge the examples in the lab vs the wild). We'll still get demonstrations in the wild, but unclear if they will be big enough to make humanity reconsider things.

Not that I know of, but I will at least consider periodically pinging him on X (if this post gets enough people’s attention).

EDIT: Shane did like my tweet (https://x.com/jacquesthibs/status/1785704284434129386?s=46), which contains a link to this post and a screenshot of your comment.

Just wanted to note that I had a similar question here.

Also, DM me if you want to collaborate on making this a real project. I've been slowly working towards something like this, but I expect to focus more on it in the coming months. I'd like to have something like a version 1.0 in 2-3 months from now. I appreciate you starting this thread, as I think it's ideal for this to be a community effort. My goal is to feed this stuff into the backend of an alignment research assistant system.

Hey Bogdan, I'd be interested in doing a project on this or at least putting together a proposal we can share to get funding.

I've been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).

I saw the MAIA paper, too; I'd like to look into it some more.

Anyway, here's a related blurb I wrote:

Project: Regularization Techniques for Enhancing Interpretability and Editability

Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.

In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hide away the superposition in other parts of the network, making SoLU unhelpful for making the models more interpretable.

That said, we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.

Whether this works or not, I'd be interested in making more progress on automated interpretability, in the similar ways you are proposing.

I'm hoping to collaborate with some software engineers who can help me build an alignment research assistant. Some (a little bit outdated) info here: Accelerating Alignment. The goal is to augment alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.

What I have in mind also relates to this post by Abram Demski and this post by John Wentworth (with a top comment by me).

Send me a DM if you (or any good engineer) are reading this.

Hey Abram! I appreciate the post. We've talked about this at length, but this was still really useful feedback and re-summarization of the thoughts you shared with me. I've written up notes and will do my best to incorporate what you've shared into the tools I'm working on.

Since we last spoke, I've been focusing on technical alignment research, but I will dedicate a lot more time to LLMs for Alignment Research in the coming months.

For anyone reading this: If you are a great safety-minded software engineer and want to help make this vision a reality, please reach out to me. I need all the help I can get to implement this stuff much faster. I'm currently consolidating all of my notes based on what I've read, interviews with other alignment researchers, my own notes about what I'd find useful in my research, etc. I'll be happy to share those notes with people who would love to know more about what I have in mind and potentially contribute.

Load More