At Apollo, we have spent some time weighing the pros and cons of the for-profit vs. non-profit approach so it might be helpful to share some thoughts.
In short, I think you need to make really sure that your business model is aligned with what increases safety. I think there are plausible cases where people start with good intentions but insufficient alignment between the business model and the safety research that would be the most impactful use of their time where these two goals diverge over time.
For example, one could start as an organization that builds a product but merely as a means to subsidize safety research. However, when they have to make tradeoffs, these organizations might choose to focus more talent on product because it is instrumentally useful or even necessary for the survival of the company. The forces that pull toward profit (e.g. VCs, status, growth) are much more tangible than the forces pulling towards safety. Thus, I could see many ways in which this goes wrong.
A second example: Imagine an organization that builds evals and starts with the intention of evaluating the state-of-the-art models because they are most likely to be risky. Soon they realize that there are only a few orgs that build the best models and there are a ton of customers that work with non-frontier systems who'd be willing to pay them a lot of money to build evals for their specific application. Thus, the pull toward doing less impactful but plausibly more profitable work is stronger than the pull in the other direction.
Lastly, one thing I'm somewhat afraid of is that it's very easy to rationalize all of these decisions in the moment. It's very easy to say that a strategic shift toward profit-seeking is instrumentally useful for the organization, growth, talent, etc. And there are cases in which this is true. However, it's easy to continue such a rationalization spree and maneuver yourself into some nasty path dependencies. Some VCs only came on for the product, some hires only want to ship stuff, etc.
In conclusion, I think it's possible to do profitable safety work but it's hard. You should be confident that your two goals are compatible when things get hard, you should have a team and culture that can resist the pulls and even produce counter pulls when you're not doing safety-relevant work and you should only work with funders who fully understand and buy into your true mission.
It seems like all of those points are of the form "you could do better alignment work if you didn't worry about profits". Which is definitely true. But only if you have some other source of funding. Since alignment work is funding-constrained, that mostly isn't true.
So, what's the alternative? Work a day job and work nights on alignment?
An important factor that should go into this calculation (not just for you or your org but for anyone) is the following: given that AI safety is currently quite severely funding-constrained (just look at the examples of projects that are not getting funded right now), I think people should assess their own scientific calibre relative to other people in technical AI safety who will seek for funding.
It's not a black-and-white choice between doing technical AI safety research, or AI governance/policy/advocacy, or not contributing to reducing the AI risk at all. The relevant 80000 hours page perpetuates this view and therefore is not serving the cause well in this regard.
For people with more engineering, product, and business dispositions I believe there are many ways to help some to reduce the AI risk, many of which I referred to in other comments on this page, and here. And we should do a better job at laying out these paths for people, a-la "Work on Climate for AI risks".
Thanks Marius, definitely agreed that business model alignment is critical here, and that culture and investors matter a bunch in determining the amount of impact an org has.
This is an interesting point. I also feel like the governance model of the org and culture of mission alignment with increasing safety is important, in addition to the exact nature of the business and business model at the time the startup is founded. Looking at your examples, perhaps by “business model” you are referring both to what brings money in but also the overall governance/decision-making model of the organization?
You may want to connect with Lionheart Ventures (https://www.lionheart.vc/), they focus on investments that reduce x-risk.
Robustness approaches
@Soroush Pour was working on a startup Harmony Intelligence in all these areas, you may be interested to talk to him.
All the ideas below are a bit early given the lack of economically valuable agents that currently exist. However, I would wager that the next iteration of LLMs (e.g. GPT-5) will unlock a world of enterprise automation, as well as be able to perform basic consumer tasks like booking a flight or scheduling a dinner with friends.
I think GPT-5 level capabilities are not even required for that. It just requires iteration and optimisation of "LLM programs" (or LM agent architectures, no bright line between these) in an automated way. There are already works in this direction, e.g., Promptbreeder, Self-Taught Optimizer, and other works.
All this suffices GPT-4 capabilities, and I think it will enter mainstream (including the industry) in 2024. Josh Albrecht (CTO of Imbue) alludes to this here (and, apparently, Imbue works on productising something like their own version of Self-Taught Optimizer).
Agent testing environments
This is similar to building testing software for LLMs, but once systems become agentic / multi-step, it’s even harder to build test cases. More importantly, one would likely need to be able to easily build agent environments and set them up and tear them down automatically in addition to managing test cases successfully.
Incidentally, Imbue is also working on this: see Avalon, and Josh has also said that they are planning to add language capabilities to this environment.
Language model agents do seem like an enormous business opportunity. They're a way to fund alignment research if you think that's a worthwile progress/alignment tradeoff. I also agree with that Christiano post arguing that progressing language model agents is net-neutral or positive, for the reasons he gives and more. But it's highly debatable.
I wish I could get someone to debate it.
I have a whole schpiel on why and how language model agents are the best shot we'll get at alignment. These arguments seem obvious, and nobody has given me counterarguments that there's a better shot. So I'll keep trying to drum up more interest until someone explains to me why I'm getting it wrong.
Great article! Just reached out. A couple ideas I want to mention are working on safer models directly (example: https://www.lesswrong.com/posts/JviYwAk5AfBR7HhEn/how-to-control-an-llm-s-behavior-why-my-p-doom-went-down-1), which for smaller models might not be cost prohibitive to make progress on. There’s also building safety-related cognitive architecture components that have commercial uses. For example, world model work (example: https://www.lesswrong.com/posts/nqFS7h8BE6ucTtpoL/let-s-buy-out-cyc-for-use-in-agi-interpretability-systems) or memory systems (example: https://www.lesswrong.com/posts/FKE6cAzQxEK4QH9fC/qnr-prospects-are-important-for-ai-alignment-research). My work is trying to do a few of these things concurrently (https://www.lesswrong.com/posts/caeXurgTwKDpSG4Nh/safety-first-agents-architectures-are-a-promising-path-to).
Responded! And thanks for sharing, will check out those posts. Really enjoyed the first one which I read the other day.
One way to advance the state of AI safety research is to build a company focused on automating work (such as a recruiter phone screen or talk therapy) and building an organization with safety at its core like Anthropic. This only works if it’s critical to do safety research to advance this organization’s capabilities. For example, automating a recruiter phone screen would likely require a high degree of explainability / interpretability (especially with respect to bias) in automating a decision, and automating talk therapy would require scalable oversight research to make sure the therapist is reaching the right conclusions.
I think Inflection is sort of like this ("talk therapy" and "creating best friend and companion" are very similar things). And Mustafa Suleyman seems to me a safety-conscious person.
Cybersecurity approaches
I think you are missing a few more important directions that you would call "security approaches", and I call "digital trust infrastructure": "decentralised identity, secure communication (see Layers 1 and 2 in Trust Over IP Stack), proof-of-humanness, proof of AI (such as, a proof that such and such artifact is created with such and such agent, e.g., provided by OpenAI -- watermarking failed, so need new robust solutions with zero-knowledge proofs)."
Stretching this even further, reputation and "Web of Trust" systems that we discuss with @mako yass in this thread are also probably important for creating the "stable equilibrium" of the civilisation on which AGI can land, and there are business opportunities there, such as combating spam on media platforms.
Further still from "AI safety" and towards differential technology development (newly branded as d/acc by Vitalik Buterin) and trying to create the aforementioned "stable equilibrium", we can talk about what Jim Rutt keeps calling a trillion-dollar opportunity, namely "info agents" that manage people's information intake. My Proposal for improving the global online discourse through personalised comment ordering on all websites is related to that, too.
Yeah I was giving a few examples of ideas there, but agreed that identity and proof of human/AI are important problems to tackle, and that the topic of filtering out harmful information and sorting through the sea of info can be a big opportunity.
Deterministic framework for agents
One of the primary blockers to using LLMs in production is that they can have all sorts of unexpected behavior. This will only increase in worlds where agents are widespread, which will be a big problem for companies looking to deploy LM agents. The idea here is to build a developer framework that puts LM agents on heavy guardrails, strictly defining the set of actions an LM agent can take given its state and environment. This set of actions will be deterministic, well understood by the developer, easy to use, and human legible. If this becomes the de facto standard for building agents in enterprise use cases, it will be a safer future.
Speaking of restricting LM agents/LM behaviour, Towards Realistic ODDs For Foundation Model Based AI Offerings project is precisely about that, though I don't know how to turn it into a business that is independent of a leading LLM vendor (OpenAI, Anthropic, Google).
But I think we shouldn't necessary constrain ourselves to "LM agents". Even though this is the most capable paradigm at the moment, it's not clear whether it will remain so in the future. E.g., Yann LeCun keeps repeating that LLMs as a paradigm are "doomed" (as a path to superior, human+ level of competence and reasoning robustness) and bets on his hierarchical representation learning and prediction agent architecture (H-JEPA). There are also other approaches, e.g., OpenCog Hyperon, or the approach of www.liquid.ai (from their name, as well as some earlier interviews of Joscha Bach who is apparently a part of their team, I can deduce that their approach involves artificial neurons who can reassign their connectivity with other neurons, somewhat in the spirit of Cooperative GNNs).
And even if LeCun is wrong in his prediction, from the AI Safety perspective, it might not make sense to "join the race" towards LLM-based AGI if we consider it in some ways (at least, practically) irreparably uncontrollable or uninterpretable. There is still a big scientific question mark about this, though, as well as about LeCun's prediction.
If you actually belief that the LM paradigm towards ubiquitous agency in the economy and society is flawed (as I do), pursuing alternative AI paradigms, even thinking your chances of global success are small, would save you some "dignity points". And this is the stance that Verses.ai, Digital Gaia, Gaia Consortium, and Bioform Labs, are taking, advocating for and developing the paradigm of Bayesian agents. Though, the key arguments for this paradigm (vs. language modelling) is not interpretability or "local controllability/robustness", but rather losing out information necessary for reliable cooperation, credit assignment, and "global" controllability/robustness[1] through "mixing" of Bayesian reference frames into a single bundle (LLM). This perhaps sounds cryptic, sorry. This deserves a much longer discussion and hopefully we will publish something about this soon.
OpenAI’s GPT framework is a potential competitor here. Building great developer frameworks for LM agents is part of their vision of the future. I think they may be optimizing first for ease of use out of the box and consumer use cases which could make them neglect some key features in a framework like this, but they pose a substantial threat to this business.
In this context, I want to make an analogy between programming language/framework/IDE/tooling competition and the competition of AI/agent platforms.
Programming languages and frameworks are more often discussed and compared in terms of:
This aligns quite accurately with the lines of comparison for agent and AI platforms:
Through modern control theory and theory of feedback, see the work by John Doyle's group.
If you actually belief that the LM paradigm towards ubiquitous agency in the economy and society is flawed (as I do), pursuing alternative AI paradigms, even thinking your chances of global success are small, would save you some "dignity points". And this is the stance that Verses.ai, Digital Gaia, Gaia Consortium, and Bioform Labs, are taking, advocating for and developing the paradigm of Bayesian agents. Though, the key arguments for this paradigm (vs. language modelling) is not interpretability or "local controllability/robustness", but rather losing out information necessary for reliable cooperation, credit assignment, and "global" controllability/robustness[1] through "mixing" of Bayesian reference frames into a single bundle (LLM). This perhaps sounds cryptic, sorry. This deserves a much longer discussion and hopefully we will publish something about this soon.
Just to develop on this in the context of this post (how can we make something for-profit to advance AI safety?), I want to highlight a direction of thought that I didn't notice in your post: creating economic value by developing the mechanisms for multi-agent coordination and cooperation. This is what falls under "understanding cooperation" and "understanding agency" categories in this agenda list, although I'd replace "understanding" with "building" there.
Solving practical problems is a great way to keep the research grounded to reality, but also battle-testing it.
There are plenty economically valuable and neglected opportunities for improving coordination:
Apart from just creating better mechanisms and algorithms for coordination from building businesses in all these diverse verticals (and hoping that these coordination algorithms will be transferable to some abstract "AI coordination" or "human-AI coordination"), there is a macro-strategy of sharing information between all these domain specific models, thus creating a loosely coupled, multi-way mega-model for the world as a whole. Our bet at Gaia Consortium is that this "world model merge" is very important in ameliorating multi-polar risks that @Andrew_Critch written about here and also Dan Hendrycks generally refers to as "AI Race" risks.
The strategy that I described above is also highly aligned with Earth Systems Predictability vision ("a roadmap for a planetary nervous system") by Trillium Tech, which is also a quasi-for-profit org.
the capabilities evaluator are two separate entities, in which case I think that the auditor is more scalable and the better business, as testing for dangerous capabilities will likely be quite manual for some time.
Strong upvote for considering an auditing direction. A possible future I can imagine is that AI alignment auditing acts a major part of what preserves the world's modern infrastructure - as we have done the same in finance, engineering and cybersecurity. Additionally, it might be the case in the future that AI lab work that can be audited using a robust and practical auditing standards - connected to an alignment theory.[1] Is it possible to create an organisation now for AI auditing? I think yes, we need more organisations like Apollo Research.
Lastly, It might just be me, but evaluations seems to be a weaker version of audits? the finance world coins these processes as "reviews", the stakes are not as high compared to how external or internal audits are performed.[2]
I recently replied on this in a different post and unfortunately what is keeping us from this future is we do not have yet this alignment theory.
Reviews are conducted with less stricter standards eg. Conflicts of interest (COI) need not be declared, as it may require only legal inquiry/ advice (page 42 of AICPA) such making reviews prone to issues with COI.
Summary
This is a brain dump of some for-profit AI alignment organization ideas, along with context for why I believe a for-profit alignment organization can make a big contribution to AI safety. This is far from a complete list, and I welcome ideas and feedback. Also, if anyone wants to or is working on any of these ideas, I’d be happy to support in any way I can!
Context
I'm Eric, formerly co-founder of RippleMatch, an AI recruiting company with ~$80M raised, millions of users, and ~10% of the Fortune 500 as customers. I made the difficult decision to leave RippleMatch this year because I'm concerned about catastrophic risk from AI, and have been spending the last year thinking about ways to help. Given my background, I’ve been thinking a lot about for-profit ideas to help with alignment – many that can be VC-backed. Some of these ideas speak more directly to reducing catastrophic risk than others, but I think that all can put a founder in a strong position to help in the future.
Why I believe for-profit alignment orgs are valuable
I don’t think for-profit approaches are inherently better than building non-profits, pursuing government regulation, or other approaches, but I think that for-profit orgs can make a substantial impact while attracting a different pool of talent eager to work on the problem.
With VC dollars, a for-profit organization can potentially scale far more quickly than a non-profit. It could make a huge impact and not have its growth capped by donor generosity. As a result, there can be far more organizations working on safety in the ecosystem tapping into a different pool of resources. That said, any VC-backed company has a relatively low chance of success, so it’s a riskier approach.
Fundamentally, I believe that risk and compliance spend will grow extremely quickly over the coming decade, scaling with generative AI revenue. With comps in finance and cybersecurity, I’d guess that mid to high single digit percentages of overall AI spend will be on risk and compliance, which would suggest big businesses can be built here. Many startups tackling alignment will need to start by addressing short term safety concerns, but in doing so will position themselves to tackle long-term risks over time.
Onto the actual ideas!
Robustness approaches
Testing / benchmarking software
Test case management needs to look very different for LLMs compared to typical software. The idea is to sell companies deploying LLMs a SaaS platform with the ability to generate and manage test cases for their LLMs to make sure they are performing properly and ensure that performance doesn’t drift from version to version. This startup would also incorporate a marketplace of common benchmarks that companies can pull off the shelf if relevant to their use case (e.g. common adversarial prompts).
Currently, my impression is that most companies don’t use any software to manage their language model test suites, which is a problem given how often an LLM can fail to produce a good result.
Red-teaming as a service
Just as software companies penetration test their software, companies that use LLMs as well as companies who build frontier models will need to red-team their models with a wide variety of adversarial prompts. This would mostly test models for how they handle misuse and make them more robust against jailbreaking. Just as a proper penetration test employs both manual and automated penetration testing, this startup would require building / fine-tuning the best automated red-teaming LLM that likely draws on multiple frontier models, as well as employ the best manual red-teamers in the space. Enterprises would likely pay a subscription depending on their usage, which would likely be spiky.
There is a substantial appetite from labs building frontier models for red-teaming services, and it appears to me that red-teaming, evals, and data labeling sum up the services that labs are interested in at the current moment. I think that’s a small market, but the dollar values could be high for each individual customer.
Evals / auditing
Lots of folks are thinking about evals right now, and I agree the theory of change is strong. Any evals business right now would face a large amount of competition from nonprofits offering services for free as well as government audits that could be mandatory.
That said, I still think there's a substantial need for companies deploying LLMs as well as labs building frontier models to be audited for a bunch of things, like dangerous capabilities, misuse, security, bias, and compliance. Given how easy it is to fine-tune RLHF safeguards from models, I think it’s likely that all companies deploying frontier models, not just the model producers, will need to be audited and pay money to reduce risk.
The product will be a mix of software (think Vanta for SOC II) that tracks compliance across a number of practices that the company needs to adhere to, as well as services to audit for compliance and test for dangerous capabilities. There is a chance that the auditor (audit for compliance) and the capabilities evaluator are two separate entities, in which case I think that the auditor is more scalable and the better business, as testing for dangerous capabilities will likely be quite manual for some time.
Monitoring
There are a bunch of startups cropping up to monitor and observe language models in production, vying for what could be thought of as Datadog for LLMs. I think this is a big problem that will clearly exist as a software business, since language model monitoring is both tractable and very different than application monitoring. There are quite a few startups with ~$50M - $100M raised that were doing this for ML models in general (Arthur, Arize, etc) and have spent a bunch of time building out their services for LLMs, and this startup would also face a bunch of competition from APM companies like Datadog and New Relic who are well positioned from a customer perspective to own this space.
The product would ingest every inference / response of an LLM, make it easy to create dashboards to monitor for certain behaviors / failures, and incorporate an API that can surface problems either in dashboards or in real time escalating to humans. It would be able to measure model drift and help developers debug the behavior of their LLM application.
I think the safety case here would have some similarity to that of evals, where we may get a warning shot through monitoring of dangerous capabilities in production. Most of the behaviors that companies would monitor for would be mundane, however.
AI agents approaches
All the ideas below are a bit early given the lack of economically valuable agents that currently exist. However, I would wager that the next iteration of LLMs (e.g. GPT-5) will unlock a world of enterprise automation, as well as be able to perform basic consumer tasks like booking a flight or scheduling a dinner with friends. As soon as this is possible, companies will be incredibly incentivized to pursue these use cases because they have the potential to heavily reduce the cost of their labor. It also opens up a world of multi-agent interaction and plenty of safety problems.
I think a world where agents are widespread and performing tasks on our behalf is coming soon, so building safer agents in that world is helpful. I’m also inspired by a Paul Christiano post that discusses advancing agent capabilities as neutral / positive, and so advancing safer agents seems pretty good overall to me.
Agent testing environments
This is similar to building testing software for LLMs, but once systems become agentic / multi-step, it’s even harder to build test cases. More importantly, one would likely need to be able to easily build agent environments and set them up and tear them down automatically in addition to managing test cases successfully.
The primary issue here is technical – is it possible to build a solution that fits the use cases of most companies given that the environments that these agents will be expected to perform in will be incredibly diverse?
Deterministic framework for agents
One of the primary blockers to using LLMs in production is that they can have all sorts of unexpected behavior. This will only increase in worlds where agents are widespread, which will be a big problem for companies looking to deploy LM agents. The idea here is to build a developer framework that puts LM agents on heavy guardrails, strictly defining the set of actions an LM agent can take given its state and environment. This set of actions will be deterministic, well understood by the developer, easy to use, and human legible. If this becomes the de facto standard for building agents in enterprise use cases, it will be a safer future.
OpenAI’s GPT framework is a potential competitor here. Building great developer frameworks for LM agents is part of their vision of the future. I think they may be optimizing first for ease of use out of the box and consumer use cases which could make them neglect some key features in a framework like this, but they pose a substantial threat to this business.
That said, interoperability is potentially quite important to companies, and building a model agnostic framework could have advantages.
Cybersecurity approaches
Security agent
As LM agents get more capable, they’ll also improve their abilities to find exploits in security systems. I expect the number and quality of AI scams phishing attempts, vulnerability scans, etc to increase sharply. To combat this, everyone should have their own security agent on each of their devices. This agent will stay silent in the background, watching the user’s activity, reading emails, and listening to calls until it detects a problem. Upon detection, the agent can intervene loudly by popping up a warning with advice on how to proceed or shutting off the interaction, or it can intervene softly by escalating a notice to a company’s IT team.
One could also imagine consumer applications, where someone may want to ensure that grandma doesn’t fall for any AI scams, so they install a security agent on her phone, escalating unsafe behavior.
Safety-wise, I think this is important because this can help labs building frontier models improve their safety practices and reduce the risk that model weights get stolen, while also helping companies that provide critical infrastructure like power be more robust against attacks.
Endpoint and application monitoring
Similarly, once LM agents get sufficiently powerful, the best way to prevent malicious usage on a website will be to have intelligent models monitor and identify harmful activity at scale. Whether it’s an attacker trying to penetrate security defenses or a bot misusing a platform, language models monitoring usage logs and user mouse movements / activity could identify and quarantine harmful behavior. Bot behavior should be relatively easy to detect unless the bot has a bunch of tech that helps it evade detection.
Research approaches
Build capabilities, do research
One way to advance the state of AI safety research is to build a company focused on automating work (such as a recruiter phone screen or talk therapy) and building an organization with safety at its core like Anthropic. This only works if it’s critical to do safety research to advance this organization’s capabilities. For example, automating a recruiter phone screen would likely require a high degree of explainability / interpretability (especially with respect to bias) in automating a decision, and automating talk therapy would require scalable oversight research to make sure the therapist is reaching the right conclusions.
The primary concern with these types of companies is finding a space where safety research is truly linked to the success of the business.
Interpretability software
Building interpretability software that assists mechanistic interpretability research and greatly accelerates progress. Could also be sold to companies that need strong explanations for why their models are performing the way they are. This eventually may be mandated by regulation.
This would likely need to be sold as downloaded software packages, likely open source, with an open source business model charging for enterprise security, support, and services. It’s quite early for interpretability research, but one would expect that as the space grows, explainability and interpretability would be increasingly important not just for alignment, but to explain model outputs, and many more researchers would need access to neural activations / interpretability techniques in order to accomplish their goals.
High quality human data labeling
Labs building frontier models currently have high demand for high quality data labelers for RLHF, and as we get into more dangerous territory and get more powerful systems, we’ll need increasingly large amounts of human data labeled by people with significant expertise and from a diverse set of backgrounds. When we start getting deep into domain specific agents (like finance, health, or legal) or dangerous capabilities (bio, chemical, nuclear), we’ll need experts to create examples of safe and unsafe behavior for scalable oversight and for RLHF.
This idea is to build a Scale AI competitor focused on expert data labeling, probably employing grad students with lots of context on specific fields or who have expertise in non-English languages. This may also just be an expert recruitment marketplace for data labeling rather than full service, such as GLG is for expert calls.
One thing I'm uncertain about here is how good synthetic data will be for data labeling in general. Perhaps synthetic data will replace most human generated data at some point.
Other thoughts about building a for-profit alignment org
One challenge of building any for-profit organization is aligning the mission of the organization with profit incentives. The best way to tackle this is to make sure that, as much as possible, the way that the org makes money is consistent with its mission.
With respect to governance, I believe incorporating as a public benefit corporation with the mission of building safe AI is the right move. While stronger safety-oriented governance structures may be preferable, I think the recent OpenAI debacle will make raising money and operating with a non-standard governance structure difficult – without much benefit for organizations that do not pose much catastrophic risk themselves.
Advancing safety without advancing capabilities, usability, or deployability is hard. We should advance safety alongside capabilities versus the alternative of advancing capabilities without advancing safety.
One of my core principles in thinking about the future is to approach with humility and a high degree of uncertainty as to how things will play out. I believe that there are reasons for both pessimism and optimism, and that no single person has a sure idea of how the next decade is going to unfold. Because of this uncertainty, I assume that many future scenarios will be risky to humanity, with some posing risks sooner rather than later.
I think we should build organizations that help with both fast and slow takeoffs, and unipolar and multipolar failures. The future will be surprising to us in many ways, so it’s best to create organizations that can iterate quickly based on how things play out. As a result, each of these company ideas should ideally bet strongly on one thing being true, while reducing vulnerability to other factors.
Please reach out!
Thanks for reading this post. If you’re someone who would like to join or start a for-profit alignment organization, please reach out! I’ll be starting an organization in the new year and looking to hire around April, and there are a few organizations popping up looking to fund these types of orgs and support folks making a career transition.