This is a for-profit company, and you're seeking investment as well as funding to reduce x-risk. Given that, how do you expect to monetise this in the future? (Note: I think this is well worth funding for altruistic reduce-x-risk reasons)
Relatedly, have you considered organizing the company as a Public Benefit Corporation, so that the mission and impact is legally protected alongside shareholder interests?
This isn't set in stone, but likely we'll monetise by selling access to the interpretability engine, via an API. I imagine we'll offer free or subsidised access to select researchers/orgs. Another route would be to open source all of it, and monetise by offering a paid, hosted version with integration support etc.
The research ethos seems like it could easily be used to justify research that appears to be safety-oriented, but actually advances capabilities.
Have you considered how your interpretability tool can be used to increase capability?
What processes are in place to ensure that you are not making the problem worse?
Good questions. Doing any kind of technical safety research that leads to better understanding of state of the art models carries with it the risk that by understanding models better, we might learn how to improve them. However, I think that the safety benefit of understanding models outweighs the risk of small capability increases, particularly since any capability increase is likely heavily skewed towards model specific interventions (e.g. "this specific model trained on this specific dataset exhibits bias x in domain y, and could be improved by retraining with more varied data from domain y", rather than "the performance of all of models of this kind could be improved with some intervention z"). I'm thinking about this a lot at the moment and would welcome further input.
This is a fantastic thing to do. If interpretability is to actually help in any way as regards AGI, it needs to be the kind of thing that is already being used and stress-tested in prod long before AGI comes around.
What kind of license are you looking at for the engine?
Thanks! Unsure as of yet – we could either keep it proprietary and provide access through an API (with some free version for select researchers), or open source it and monetise by offering a paid, hosted tier with integration support. Discussions are ongoing.
Thanks for the post. I'll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:
We must grow interpretability and AI safety in the real world.
Strong +1 to working on more real-world-relevant approaches to interpretability.
Regulation is coming – let’s use it.
Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be something that has a once-and-for-all solution, so governance seems to matter a lot in the likely case of a future with highly-prolific TAI. And because of the pace of governance, work now to establish concern, offices, precedent, case law, etc. seems uniquely key.
Speed potentially transformative narrow domain systems. AI for scientific progress is an important side quest. Interpretability is the backbone of knowledge discovery with deep learning, and has huge potential to advance basic science by making legible the complex patterns that machine learning models identify in huge datasets.
I do not see the reasoning or motivation for this, and it seems possibly harmful.
First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good. They are heavy-tailed in both directions. This seems like possible safety washing. But to be fair, this is a critique I have of a ton of AI alignment work including some of my own.
Second, I don't know of any examples of gaining particularly useful domain knowledge from interpretability related things in deep learning other than maybe the predictivness of nonrobust features. Another possible example could be using deep-learning to find new algorithms for things like matrix multiplication, but this isn't really "interpretability". Do you have other examples in mind? Progress in the last 6 years on reverse-engineering nontrivial systems has seemed to be tenuous at best.
So I'd be interested in hearing more about whether/how you expect this one type of work to be robustly good and what is meant by "Interpretability is the backbone of knowledge discovery with deep learning."
Thanks for the comment! I'll respond to the last part:
"First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good."
I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we're explicitly interested in using interpretability with narrow domain systems.
"Interpretability is the backbone of knowledge discovery with deep learning": Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren't able to parse. If we can use interpretability to extract these patterns in a human-parsable way, in a (very Olah-ish) sense we can reframe deep learning models as lenses through which to view the world, and to make sense of data that would otherwise be opaque to us.
Here are a couple of examples:
https://www.mdpi.com/2072-6694/14/23/5957
https://www.deepmind.com/blog/exploring-the-beauty-of-pure-mathematics-in-novel-ways
https://www.nature.com/articles/s41598-021-90285-5
Are you concerned about AI risk from narrow systems of this kind?
Thanks.
Are you concerned about AI risk from narrow systems of this kind?
No. Am I concerned about risks from methods that work for this in narrow AI? Maybe.
This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of "backbone" which I would not have used. I think we're on the same page.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
We are thrilled to introduce Leap Labs, an AI startup. We’re building a universal interpretability engine.
We design robust interpretability methods with a model-agnostic mindset. These methods in concert form our end-to-end interpretability engine. This engine takes in a model, or ideally a model and its training dataset (or some representative portion thereof), and returns human-parseable explanations of what the model ‘knows’.
Research Ethos:
Aims:
We are currently seeking funding/investment. Contact us here.