Summary
This is a brain dump of some for-profit AI alignment organization ideas, along with context for why I believe a for-profit alignment organization can make a big contribution to AI safety. This is far from a complete list, and I welcome ideas and feedback. Also, if anyone wants to or is working on any of these ideas, I’d be happy to support in any way I can!
Context
I'm Eric, formerly co-founder of RippleMatch, an AI recruiting company with ~$80M raised, millions of users, and ~10% of the Fortune 500 as customers. I made the difficult decision to leave RippleMatch this year because I'm concerned about catastrophic risk from AI, and have been spending the last year thinking about ways to help. Given my background, I’ve been thinking a lot about for-profit ideas to help with alignment – many that can be VC-backed. Some of these ideas speak more directly to reducing catastrophic risk than others, but I think that all can put a founder in a strong position to help in the future.
Why I believe for-profit alignment orgs are valuable
I don’t think for-profit approaches are inherently better than building non-profits, pursuing government regulation, or other approaches, but I think that for-profit orgs can make a substantial impact while attracting a different pool of talent eager to work on the problem.
With VC dollars, a for-profit organization can potentially scale far more quickly than a non-profit. It could make a huge impact and not have its growth capped by donor generosity. As a result, there can be far more organizations working on safety in the ecosystem tapping into a different pool of resources. That said, any VC-backed company has a relatively low chance of success, so it’s a riskier approach.
Fundamentally, I believe that risk and compliance spend will grow extremely quickly over the coming decade, scaling with generative AI revenue. With comps in finance and cybersecurity, I’d guess that mid to high single digit percentages of overall AI spend will be on risk and compliance, which would suggest big businesses can be built here. Many startups tackling alignment will need to start by addressing short term safety concerns, but in doing so will position themselves to tackle long-term risks over time.
Onto the actual ideas!
Robustness approaches
Testing / benchmarking software
Test case management needs to look very different for LLMs compared to typical software. The idea is to sell companies deploying LLMs a SaaS platform with the ability to generate and manage test cases for their LLMs to make sure they are performing properly and ensure that performance doesn’t drift from version to version. This startup would also incorporate a marketplace of common benchmarks that companies can pull off the shelf if relevant to their use case (e.g. common adversarial prompts).
Currently, my impression is that most companies don’t use any software to manage their language model test suites, which is a problem given how often an LLM can fail to produce a good result.
Red-teaming as a service
Just as software companies penetration test their software, companies that use LLMs as well as companies who build frontier models will need to red-team their models with a wide variety of adversarial prompts. This would mostly test models for how they handle misuse and make them more robust against jailbreaking. Just as a proper penetration test employs both manual and automated penetration testing, this startup would require building / fine-tuning the best automated red-teaming LLM that likely draws on multiple frontier models, as well as employ the best manual red-teamers in the space. Enterprises would likely pay a subscription depending on their usage, which would likely be spiky.
There is a substantial appetite from labs building frontier models for red-teaming services, and it appears to me that red-teaming, evals, and data labeling sum up the services that labs are interested in at the current moment. I think that’s a small market, but the dollar values could be high for each individual customer.
Evals / auditing
Lots of folks are thinking about evals right now, and I agree the theory of change is strong. Any evals business right now would face a large amount of competition from nonprofits offering services for free as well as government audits that could be mandatory.
That said, I still think there's a substantial need for companies deploying LLMs as well as labs building frontier models to be audited for a bunch of things, like dangerous capabilities, misuse, security, bias, and compliance. Given how easy it is to fine-tune RLHF safeguards from models, I think it’s likely that all companies deploying frontier models, not just the model producers, will need to be audited and pay money to reduce risk.
The product will be a mix of software (think Vanta for SOC II) that tracks compliance across a number of practices that the company needs to adhere to, as well as services to audit for compliance and test for dangerous capabilities. There is a chance that the auditor (audit for compliance) and the capabilities evaluator are two separate entities, in which case I think that the auditor is more scalable and the better business, as testing for dangerous capabilities will likely be quite manual for some time.
Monitoring
There are a bunch of startups cropping up to monitor and observe language models in production, vying for what could be thought of as Datadog for LLMs. I think this is a big problem that will clearly exist as a software business, since language model monitoring is both tractable and very different than application monitoring. There are quite a few startups with ~$50M - $100M raised that were doing this for ML models in general (Arthur, Arize, etc) and have spent a bunch of time building out their services for LLMs, and this startup would also face a bunch of competition from APM companies like Datadog and New Relic who are well positioned from a customer perspective to own this space.
The product would ingest every inference / response of an LLM, make it easy to create dashboards to monitor for certain behaviors / failures, and incorporate an API that can surface problems either in dashboards or in real time escalating to humans. It would be able to measure model drift and help developers debug the behavior of their LLM application.
I think the safety case here would have some similarity to that of evals, where we may get a warning shot through monitoring of dangerous capabilities in production. Most of the behaviors that companies would monitor for would be mundane, however.
AI agents approaches
All the ideas below are a bit early given the lack of economically valuable agents that currently exist. However, I would wager that the next iteration of LLMs (e.g. GPT-5) will unlock a world of enterprise automation, as well as be able to perform basic consumer tasks like booking a flight or scheduling a dinner with friends. As soon as this is possible, companies will be incredibly incentivized to pursue these use cases because they have the potential to heavily reduce the cost of their labor. It also opens up a world of multi-agent interaction and plenty of safety problems.
I think a world where agents are widespread and performing tasks on our behalf is coming soon, so building safer agents in that world is helpful. I’m also inspired by a Paul Christiano post that discusses advancing agent capabilities as neutral / positive, and so advancing safer agents seems pretty good overall to me.
Agent testing environments
This is similar to building testing software for LLMs, but once systems become agentic / multi-step, it’s even harder to build test cases. More importantly, one would likely need to be able to easily build agent environments and set them up and tear them down automatically in addition to managing test cases successfully.
This idea is a bit early because GPT-4 doesn’t seem quite capable of doing the planning necessary for most economically valuable multi-step workflows, but I would wager that GPT-5 will unlock a world of enterprise automation. In that world, this type of solution would probably be necessary for all companies looking to build agents.
The primary issue here is technical – is it possible to build a solution that fits the use cases of most companies given that the environments that these agents will be expected to perform in will be incredibly diverse?
Deterministic framework for agents
One of the primary blockers to using LLMs in production is that they can have all sorts of unexpected behavior. This will only increase in worlds where agents are widespread, which will be a big problem for companies looking to deploy LM agents. The idea here is to build a developer framework that puts LM agents on heavy guardrails, strictly defining the set of actions an LM agent can take given its state and environment. This set of actions will be deterministic, well understood by the developer, easy to use, and human legible. If this becomes the de facto standard for building agents in enterprise use cases, it will be a safer future.
OpenAI’s GPT framework is a potential competitor here. Building great developer frameworks for LM agents is part of their vision of the future. I think they may be optimizing first for ease of use out of the box and consumer use cases which could make them neglect some key features in a framework like this, but they pose a substantial threat to this business.
That said, interoperability is potentially quite important to companies, and building a model agnostic framework could have advantages.
Cybersecurity approaches
Security agent
As LM agents get more capable, they’ll also improve their abilities to find exploits in security systems. I expect the number and quality of AI scams phishing attempts, vulnerability scans, etc to increase sharply. To combat this, everyone should have their own security agent on each of their devices. This agent will stay silent in the background, watching the user’s activity, reading emails, and listening to calls until it detects a problem. Upon detection, the agent can intervene loudly by popping up a warning with advice on how to proceed or shutting off the interaction, or it can intervene softly by escalating a notice to a company’s IT team.
One could also imagine consumer applications, where someone may want to ensure that grandma doesn’t fall for any AI scams, so they install a security agent on her phone, escalating unsafe behavior.
Safety-wise, I think this is important because this can help labs building frontier models improve their safety practices and reduce the risk that model weights get stolen, while also helping companies that provide critical infrastructure like power be more robust against attacks.
Endpoint and application monitoring
Similarly, once LM agents get sufficiently powerful, the best way to prevent malicious usage on a website will be to have intelligent models monitor and identify harmful activity at scale. Whether it’s an attacker trying to penetrate security defenses or a bot misusing a platform, language models monitoring usage logs and user mouse movements / activity could identify and quarantine harmful behavior. Bot behavior should be relatively easy to detect unless the bot has a bunch of tech that helps it evade detection.
Research approaches
Build capabilities, do research
One way to advance the state of AI safety research is to build a company focused on automating work (such as a recruiter phone screen or talk therapy) and building an organization with safety at its core like Anthropic. This only works if it’s critical to do safety research to advance this organization’s capabilities. For example, automating a recruiter phone screen would likely require a high degree of explainability / interpretability (especially with respect to bias) in automating a decision, and automating talk therapy would require scalable oversight research to make sure the therapist is reaching the right conclusions.
The primary concern with these types of companies is finding a space where safety research is truly linked to the success of the business.
Interpretability software
Building interpretability software that assists mechanistic interpretability research and greatly accelerates progress. Could also be sold to companies that need strong explanations for why their models are performing the way they are. This eventually may be mandated by regulation.
This would likely need to be sold as downloaded software packages, likely open source, with an open source business model charging for enterprise security, support, and services. It’s quite early for interpretability research, but one would expect that as the space grows, explainability and interpretability would be increasingly important not just for alignment, but to explain model outputs, and many more researchers would need access to neural activations / interpretability techniques in order to accomplish their goals.
High quality human data labeling
Labs building frontier models currently have high demand for high quality data labelers for RLHF, and as we get into more dangerous territory and get more powerful systems, we’ll need increasingly large amounts of human data labeled by people with significant expertise and from a diverse set of backgrounds. When we start getting deep into domain specific agents (like finance, health, or legal) or dangerous capabilities (bio, chemical, nuclear), we’ll need experts to create examples of safe and unsafe behavior for scalable oversight and for RLHF.
This idea is to build a Scale AI competitor focused on expert data labeling, probably employing grad students with lots of context on specific fields or who have expertise in non-English languages. This may also just be an expert recruitment marketplace for data labeling rather than full service, such as GLG is for expert calls.
One thing I'm uncertain about here is how good synthetic data will be for data labeling in general. Perhaps synthetic data will replace most human generated data at some point.
Other thoughts about building a for-profit alignment org
One challenge of building any for-profit organization is aligning the mission of the organization with profit incentives. The best way to tackle this is to make sure that, as much as possible, the way that the org makes money is consistent with its mission.
With respect to governance, I believe incorporating as a public benefit corporation with the mission of building safe AI is the right move. While stronger safety-oriented governance structures may be preferable, I think the recent OpenAI debacle will make raising money and operating with a non-standard governance structure difficult – without much benefit for organizations that do not pose much catastrophic risk themselves.
Advancing safety without advancing capabilities, usability, or deployability is hard. We should advance safety alongside capabilities versus the alternative of advancing capabilities without advancing safety.
One of my core principles in thinking about the future is to approach with humility and a high degree of uncertainty as to how things will play out. I believe that there are reasons for both pessimism and optimism, and that no single person has a sure idea of how the next decade is going to unfold. Because of this uncertainty, I assume that many future scenarios will be risky to humanity, with some posing risks sooner rather than later.
I think we should build organizations that help with both fast and slow takeoffs, and unipolar and multipolar failures. The future will be surprising to us in many ways, so it’s best to create organizations that can iterate quickly based on how things play out. As a result, each of these company ideas should ideally bet strongly on one thing being true, while reducing vulnerability to other factors.
Please reach out!
Thanks for reading this post. If you’re someone who would like to join or start a for-profit alignment organization, please reach out! I’ll be starting an organization in the new year and looking to hire around April, and there are a few organizations popping up looking to fund these types of orgs and support folks making a career transition.