Quick Intro: My name is Strad and I am a new grad working in tech wanting to learn and write more about AI safety and how tech will effect our future. I'm trying to challenge myself to write a short article a day to get back into writing. Would love any feedback on the article and any advice on writing in this field!
AI models are becoming more ingrained into the functioning of society, yet, we don't understand how they truly “think.” They inner workings are still largely a black box to us.
However, some companies are digging deeper to understand the inner reasoning that goes on inside of today’s top models. One company at the forefront of this work is a San-Francisco-based startup called Goodfire.
Goodfire is on a mission to make AI models understandable through the research and development of interpretability tools.
Goodfire’s Rationale For Interpretability
On a recent podcast with Sequoia Capital, Goodfire’s CEO, Eric Ho, was on to discuss his company’s mission, progress, and plans for the future.
In the podcast, he laid out the case for interpretability by explaining its necessity in creating safe AI that we have intentional control over. Currently, we’re able to reap significant benefits from LLMs despite them being a black box. However, we can only utilize a black box for so long before the safety concerns and lack of control becomes an issue.
Ho uses the analogy of steam engines back in the 1700s to make this point clear. At the time we did not have a full understanding of the physics that went into making steam engines work. Despite this, we benefited from steam engines for a very long time. However, it wasn’t uncommon for steam engines to blow up leading to all sorts of performance and safety issues. Once we understood thermodynamics better, we where able to make steam engines more safe and effective.
In a similar way, we can still benefit from AI without understanding its inner workings. However, if we could understand these inner workings, then we could better control how they work. This control would also allow us to spot, detect and stop safety and performance issues before they become a problem. For a future where we offload more mission critical task to these models, Ho emphasizes just how important having this control would be.
How Does Goodfire Work Towards Interpretability?
Goodfire works towards their mission by conducting frontier research in the field of mechanistic interpretability (MI). Their findings not only contribute to the field of MI as a whole, but also feed into their product, Ember, which helps make AI interpretability more accessible as a tool for other researchers and developers.
MI research relies on building blocks called features which represent the activation pattern of neurons within a model used to store a specific concept (such as “dog,” or “tree,” or “economics”). In its infancy, MI research faced a critical challenge known as superposition. Superposition is the phenomenon where neurons contribute to more than one feature. Because of this, the features within a model are all “tangled up” between the neurons. The process of distinguishing these features from one another was a real challenge for the field.
However, a breakthrough occurred with the discovery of an interpretability method which utilizes sparse auto encoders (SAE). SAEs are essentially AI models that can help distinguish the features between these “tangled” neurons and represent them in a clearer way. With this breakthrough, the field of MI was able to advance and now Goodfire works on further researching and refining these methods to allow for the scalability and automation of interpretability work.
One example of Goodfire’s work in action can be seen through their partnership with Arc Institute to help make their DNA foundation model more interpretable.
Goodfire’s interpretability work helped the company map common biological concepts that the model should know, such as tRNA and RNAs, to the actual neuronal activity in the model that represented these concepts. Further collaboration with the Arc Institute aims to see if new biological concepts could be found within the other features developed by their model during training.
Ember — Goodfire’s Flagship Interpretability Platform
In addition to the Goodfire’s research and collaboration efforts, they also have their own platform called Ember which helps other researchers and developers leverage the power of interpretability themselves.
With the platform, users are able to interpret their models to extract relevant features, giving them more control over AI systems. One way they can do this is through a feature called autosteering. A user can change the behavior of their model by simply prompting Ember to do so. Ember will find the relevant features and alter their strengths to fit the user’s request.
For example, in their demo video on the feature, they demonstrate how the user can type in a prompt such as “wise sage” and retrieve back descriptions of the associated features such as “The assistant should generate philosophical, metaphorical, wisdom quotes.” Ember can then automatically alter the strength of these features or the user can do so manually.
Another feature of Ember is its ability to prevent jailbreaks. Ember can determine which misleading prompts lower the activation of “refusal” features within a model, and artificially increase their strength in response to these types of prompts.
Ember also has the ability to create classifier models based on the features it finds. For example, by utilizing features found with Ember that represented “partial ownership stakes,” “gradual improvement,” and “business expansion,” Goodfire was able to build a classifier for financial sentiment analysis that was 75% accurate. They didn’t need to teach the model anything new cause the useful concepts were already inside it. This feature drastically improves the ease and speed in creating classifier models.
What Makes Goodfire Special?
Given that Goodfire recently received 50 million dollars in their Series A funding round, and are backed by big companies such as Anthropic, it’s clear that many see something special in the startup.
One big advantage Goodfire has is the amount of expertise and notoriety the people on their team have within the AI industry. Their team holds ex-researchers and engineers from major AI companies such as OpenAI and Google’s DeepMind. Their team also includes the researchers responsible for some of the biggest discoveries in MI such as the use of SAEs for feature discovery, auto-interpretability and the revealing of hidden knowledge in AI systems.
Another advantage come’s from Goodfire’s position in the field of MI research. Much of the work in MI today is either done by major AI companies such as Anthropic, or smaller research groups in academia and nonprofits. Goodfire has the unique advantage of being a commercial entity solely focused on interpretability.
By being a company, they have access to a lot more funding and resources than some of the smaller labs. As for the major AI companies, interpretability is just one of several priorities they have to balance. This puts limits on their ability to prioritize interpretability work. They also have an incentive to focus their interpretability efforts on their own models and modalities, rather than exploring the wider landscape of architectures and domains.
By being solely focused on interpretability, Goodfire not only works across multiple different models, but also across multiple different fields and modalities (i.e. vision, text, audio, genomic, robotic, etc.). The ability to research interpretability from such a broad range of perspectives gives them an advantage in potentially discovering insights that other companies might not.
Goodfire’s mission to creating safer AI systems also seems to resonate with a lot of big players in the AI space. For example, Anthropic CEO, Dario Amodei said this with regards to their investment in Goodfire:
“As AI capabilities advance, our ability to understand these systems must keep pace. Our investment in Goodfire reflects our belief that mechanistic interpretability is among the best bets to help us transform black-box neural networks into understandable, steerable systems — a critical foundation for the responsible development of powerful AI”
Being a leader in the AI industry, Amodei’s quote emphasizes the importance this industry places on interpretability as a tool for AI safety. As people become more aware of the potential for AI to cause serious harm in the future, companies working to mitigate these harms are becoming ever more important, helping Goodfire stand out.
How Goodfire Prioritizes Safety
Goodfire’s AI safety goals manifest through two areas of ongoing research emphasized by Eric Ho in his discussion with Sequoia Capital; auditing and model diffing.
Auditing works to give developers the ability to find problematic behavior within a model’s infrastructure so they could then remove it. This would allow them to detect potential misalignment with a model before it actually outputs or executes anything harmful.
Model diffing would allow developers to look at a model, checkpoint to checkpoint, and see what changed, how it changed and why it changed. This would allow them to see why bad behavior developed in the first place allowing them to prevent these processes from occurring in future iterations of the model.
On top of this, Goodfire continues to emphasize their commitment to collaborating with the AI safety community by seeking its help to evaluate and improve their safety measures, conduct thorough red-teaming exercises and contribute their own research to the field.
There is no doubt that being able to better understand the AI systems people work with everyday is going to become increasingly important as they progress. Goodfire understands this and is one of the few organizations soley focused on this goal. With their large backing, top tier experts and safety-driven mission, Goodfire is in a great position to be the team that finally deciphers the black boxes inside AI models.

I'll note that I've heard third-hand that Goodfire leadership (I believe the CEO but I could be mistaken) has pitched their interpretability work as capabilities-enhancing rather than primarily safety-focused to external people. Separately, and with lower confidence, I believe they may be actively fundraising on this framing.