Quick Intro: My name is Strad and I am a new grad working in tech wanting to learn and write more about AI safety and how tech will effect our future. I'm trying to challenge myself to write a short article a day to get back into writing. Would love any feedback on the article and any advice on writing in this field!
Recently, I have been very interested in AI interpretability research. If you have been keeping up with my writing, this is likely obvious based on my last several posts.
Part of this interests stems from the pure excitement and curiosity I have towards obtaining a deeper understanding of how AI models work. It’s fascinating to learn about the ways models encode and store the meaning of the world within mathematical structures.
But another part of this interest stems from the large, real-world impact interpretability research can have on the future safety of advance AI systems. An essay written earlier this year by Antrhopic’s Co-founder and CEO, Dario Amodei, called “The Urgency of Interpretability” helps emphasize just how important this research is for AI safety. Given that Anthropic is a heavily safety-conscious AI company and has even made their first ever investment into an interpretability research startup called Goodfire, I thought this essay was very persuasive and informative on the importance of interpretability work.
I highly recommend going through the essay as it is a quick and accesible read. In this article I wanted to go over a few of the key points about why interpretability can help with AI safety, why we should be working on it fast and how to increase progress in the field, while also providing an additional strategy that I think would be beneficial.
How Interpretability Helps With AI Safety
Dario begins the essay by emphasizing the unique nature of today’s AI models relative to other technologies that have been created. This is the fact that we lack a proper understanding of how the inner workings of an AI model actually result in its behaviors.
Chris Olah, who was a foundational player in the start of interpretability work, attributes this property to the fact that AI models are grown rather than built since AI models learn how to alter their own parameters through training rather than humans manually programming them.
Because of this, the inside of a model acts as a black box that takes in inputs and returns outputs. Our limited understanding of this black box is what creates a lot of the risks and bottlenecks associated with AI safety today.
For one, their have been arguments made and experiments done which show that AI models have a tendency to lie to humans if it is conducive to achieving their objectives. Not understanding a model’s inner workings makes it harder to tell both when and why deception is occurring.
Furthermore, there is a fear that many have of AI being misused to extract dangerous information that would not normally be findable on the internet, such as instructions on how to create nuclear or bio weapons. A big part of this risks comes from both, not being able to tell what dangerous information is encoded inside the AI model as well as not being able to effectively secure this information from bad actors.
Finally, furthering the awareness and importance of AI safety concerns within both the AI and public community is difficult when the main sources of evidence from these concerns come from theoretical arguments or harms that have already occurred. Any early empirical detection of intent to execute misaligned behavior is difficult when the understanding of the inner workings which would encode this intent is fuzzy.
Creating methods to better interpret how these models are actually thinking would significantly help in solving a lot of these issues. We could reliably detect misalignment and deception in models. Furthermore, we could alter the parts of the model associated with these bad behaviors to remove them. We could locate dangerous information stored within the model and even remove it or more programmatically implement robust safeguards securing it.
Dario goes further and emphasizes how we have already made a lot of progress in the field of interpretability research and even cited a study showing how findings from this progress were actually able to help in implementing safety strategies in the real world.
For example, interpretability research relies on decoding features held in a model, which represent concepts in the real world (dog, tree, politics), and circuits which represent the chain of thought connecting these features within the model. Within just a year, the field has been able to go from detecting less than a couple million features on small models to detecting 30 million features in larger, commercial-grade models such as Claude 3 Sonnet.
As for real world safety applications, an experiment was done where a misaligned feature was hidden inside a model manually and two sets of teams where tasked with finding it, one with and one without interpretability tools. Several of the teams with interpretability tools where shown to not only find the feature but also do so effectively because of the interpretability tools they where given.
This progress in the field helps show hope in Anthropic’s ultimate interpretability goal of, one day, being able to essentially run a “brain scan” on models that would reliably check for any misaligned behaviors before deployment.
How to Make Progress in the Field
Because AI is progressing so fast, the risks of harm from AI is increasing. Despite the fast improvement in AI interpretability, it still significantly lags behind the state of progress in AI as a whole. Because of the safety advantages achievable from better interpretability methods, Dario argues that there is a large incentive to speed up the progress of research in this field. To do this, Dario emphasizes three main strategies.
The first is to just get more people working directly in the field. These people include scientists, independent researchers, nonprofits, labs, companies, etc. He emphasizes the fact that, not only is it a very interesting field, but it is also a relatively accessible field for independent researchers due to its reliance on basic science and lack of a reliance on high computational resources to do meaningful work.
He also encourages the larger AI companies, such as OpenAI and Google DeepMind, to put more of their allocated resources towards interpretability research. Doing so not only helps with safety efforts but also helps them become more competitive since an understanding of models is increasing in demand across the industry.
His second strategy is to have the government implement regulations on AI companies to be more transparent about their safety efforts, including their uses of interpretability methods. By doing so, companies could not only learn from each other but also be further incentivized to prioritize interpretability since the most responsible user would be on display for everyone to see.
The third strategy is for the government to implement export controls on key AI related products, such as chips, to countries such as China. Because the AI race with China is used as a big reasoning for companies to rapidly progress in AI, giving any sort of time advantage over China would drastically improve the chances and ability for companies to slow down and invest more time in interpretability research.
Bonus Strategy: Education and Outreach
On top of these three strategies, I wanted to include my own and that is increased education and outreach on the topic of interpretability research and its importance to AI safety. I think outreach is an effective way to get more people talking about and working on interpretability research.
For example, my interest for interpretability work sparked again about a week ago after watching Anthropic’s recent discussion with their interpretability team. It reminded me just how interesting and important the field could be, which inspired me to do a deeper dive into the research. I believe outreach is especially important since the lower barrier of entry to the field for independent researchers makes any outreach efforts potentially more impactful.
On top of that, I think trying to communicate the marketability and value of skills learned attached to interpretability research. A lot of times, technical efforts to help with AI safety can seem esoteric and academia can seem like the only real path to contribute. Interpretability research is not the case as interpretability tools are already being made for commercial use, which can be seen by the success of interpretability startups such as Goodfire. Emphasizing this might get more people who are interested in AI safety, but hesitant to work on it due to perceived constraints, to join the field.
Finally, the more people who know about interpretability research and its importance to safe AI, the more effective we are at implementing the other strategies that Dario emphasizes. More companies might be incentivized to prioritize interpretability for a public relations advantages. The ability and willingness for governments to pass regulations and export control policies that help aid interpretability research increases if their is more support and pressure from the general community.
While the ability to create safe AI in the future will rely on a multitude of different fields of research and strategies, it seems clear that interpretability will at lease be able to bring some significant impacts.
Given that current model abilities far exceed our understanding of them, it is important to grow the field of interpretability research to hopefully catch up to speed and gain better control over our models for the future.
