We are thrilled to introduce Leap Labs, an AI startup. We’re building a universal interpretability engine.

We design robust interpretability methods with a model-agnostic mindset. These methods in concert form our end-to-end interpretability engine. This engine takes in a model, or ideally a model and its training dataset (or some representative portion thereof), and returns human-parseable explanations of what the model ‘knows’.

Research Ethos:

  • Reproducible and generalisable approaches win. Interpretability algorithms should produce consistent outputs regardless of any random initialisation. Future-proof methods make minimal assumptions about model architectures and data types. We’re building interpretability for next year’s models. 
  • Relatedly, heuristics aren’t enough. Hyperparameters should always be theoretically motivated. It’s not enough that some method or configuration works well in practice. (Or, even worse, that it’s tweaked to get a result that looks sensible to humans.) We find out why.

Aims:

  • We must grow interpretability and AI safety in the real world. Leap is a for-profit company incorporated in the US, and the plan is to scale quickly, and to hire and upskill researchers and engineers – we need more meaningful jobs for AI alignment researchers to make progress, nearly as much as we need the researchers themselves.
  • Slow potentially dangerous broad domain systems. Public red-teaming is a means of change. Robust interpretability methods make discovering failure modes easier. We demonstrate the fragility of powerful and opaque systems, and push for caution.
  • Speed potentially transformative narrow domain systems. AI for scientific progress is an important side quest. Interpretability is the backbone of knowledge discovery with deep learning, and has huge potential to advance basic science by making legible the complex patterns that machine learning models identify in huge datasets.
  • Regulation is coming – let’s use it. We predict that governments and companies will begin to regulate and audit powerful models more explicitly, at very least from a bias-prevention viewpoint. We want to make sure that these regulations actually make models safer, and that audits are grounded in (our) state-of-the-art interpretability work.
  • Interpretability as standard. Robust interpretability, failure mode identification and knowledge discovery should be a default part of all AI development. Ultimately, we will put a safety-focussed interpretability system in the pipeline of every leading AI lab.

We are currently seeking funding/investment. Contact us here.
 

New Comment
12 comments, sorted by Click to highlight new comments since:

This is a for-profit company, and you're seeking investment as well as funding to reduce x-risk. Given that, how do you expect to monetise this in the future? (Note: I think this is well worth funding for altruistic reduce-x-risk reasons)

Relatedly, have you considered organizing the company as a Public Benefit Corporation, so that the mission and impact is legally protected alongside shareholder interests?

We're looking into it!

This isn't set in stone, but likely we'll monetise by selling access to the interpretability engine, via an API. I imagine we'll offer free or subsidised access to select researchers/orgs.  Another route would be to open source all of it, and monetise by offering a paid, hosted version with integration support etc.

The research ethos seems like it could easily be used to justify research that appears to be safety-oriented, but actually advances capabilities.

Have you considered how your interpretability tool can be used to increase capability?

What processes are in place to ensure that you are not making the problem worse?

Good questions. Doing any kind of technical safety research that leads to better understanding of state of the art models carries with it the risk that by understanding models better, we might learn how to improve them. However, I think that the safety benefit of understanding models outweighs the risk of small capability increases, particularly since any capability increase is likely heavily skewed towards model specific interventions (e.g. "this specific model trained on this specific dataset exhibits bias x in domain y, and could be improved by retraining with more varied data from domain y", rather than "the performance of all of models of this kind could be improved with some intervention z"). I'm thinking about this a lot at the moment and would welcome further input. 

This is a fantastic thing to do. If interpretability is to actually help in any way as regards AGI, it needs to be the kind of thing that is already being used and stress-tested in prod long before AGI comes around.

What kind of license are you looking at for the engine?

Thanks! Unsure as of yet – we could either keep it proprietary and provide access through an API (with some free version for select researchers), or open source it and monetise by offering a paid, hosted tier with integration support. Discussions are ongoing. 

Thanks for the post. I'll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:

We must grow interpretability and AI safety in the real world.

Strong +1 to working on more real-world-relevant approaches to interpretability. 

Regulation is coming – let’s use it.

Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be something that has a once-and-for-all solution, so governance seems to matter a lot in the likely case of a future with highly-prolific TAI. And because of the pace of governance, work now to establish concern, offices, precedent, case law, etc. seems uniquely key. 

Speed potentially transformative narrow domain systems. AI for scientific progress is an important side quest. Interpretability is the backbone of knowledge discovery with deep learning, and has huge potential to advance basic science by making legible the complex patterns that machine learning models identify in huge datasets.

I do not see the reasoning or motivation for this, and it seems possibly harmful.

First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good. They are heavy-tailed in both directions. This seems like possible safety washing. But to be fair, this is a critique I have of a ton of AI alignment work including some of my own. 

Second, I don't know of any examples of gaining particularly useful domain knowledge from interpretability related things in deep learning other than maybe the predictivness of nonrobust features. Another possible example could be using deep-learning to find new algorithms for things like matrix multiplication, but this isn't really "interpretability". Do you have other examples in mind? Progress in the last 6 years on reverse-engineering nontrivial systems has seemed to be tenuous at best

So I'd be interested in hearing more about whether/how you expect this one type of work to be robustly good and what is meant by "Interpretability is the backbone of knowledge discovery with deep learning."
 

Thanks for the comment! I'll respond to the last part:

"First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good."

I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we're explicitly interested in using interpretability with narrow domain systems. 

"Interpretability is the backbone of knowledge discovery with deep learning": Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren't able to parse. If we can use interpretability to extract these patterns in a human-parsable way, in a (very Olah-ish) sense we can reframe deep learning models as lenses through which to view the world, and to make sense of data that would otherwise be opaque to us.

Here are a couple of examples:

https://www.mdpi.com/2072-6694/14/23/5957

https://www.deepmind.com/blog/exploring-the-beauty-of-pure-mathematics-in-novel-ways

https://www.nature.com/articles/s41598-021-90285-5

Are you concerned about AI risk from narrow systems of this kind?

Thanks. 

Are you concerned about AI risk from narrow systems of this kind?

No. Am I concerned about risks from methods that work for this in narrow AI? Maybe. 

This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of "backbone" which I would not have used. I think we're on the same page. 
 

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?