Quick Intro: My name is Strad and I am a new grad working in tech wanting to learn and write more about AI safety and how tech will effect our future. I'm trying to challenge myself to write a short article a day to get back into writing. Would love any feedback on the article and any advice on writing in this field!
As AI models get more complex, the ability to interpret how they reason about the world is becoming more sought after. Goodfire is a rising, San-Francisco-based start up working on leveraging interpretability research to create useful products for AI developers.
Up until recently, interpretability work has mainly lied in the realm of research without any actual commercial products utilizing the findings from the field. However, in a recent demo, applied researcher, Mark Bissell from Goodfire, shows how the company’s new product, “Ember,” works as a tool to help AI developers and users interpret models themselves.
The two main commercial use cases of Ember demonstrated are “robust development” and “novel user interfaces.” On top of this, other potential use cases along with future directions are discussed.
Robust Development with AI
First, Mark goes over the downsides of developing and coding with LLMs. For one, many developers working with LLMs will experience “whack-a-mole” prompt edits, in which fixing one output issue leads to another.
A common solution to this is to use another LLM as a judge to monitor the accuracy of these outputs, but that usually leads to its own problems related to poor scalability and high cost.
Another possible solution is fine tuning the model, but this has its own host of issues such as the need for domain specific data, reward hacking, mode collapse, etc.
With Ember, a more robust solution to these issues is available through the use of feature steering.
A fundamental part of interpretability research is the idea of features. Features are essentially the parts of an AI model’s neural network that corresponds to a given concept that the model has learned. For example, a group of neurons might activate every time the concept of a “dog” is mentioned in the input or used to generate an output.
Mapping these groups of neurons to their corresponding features is crucial to understanding how the model thinks. Furthermore, it is very advantageous to a developer cause these features can give them more precise control over how their model behaves. This is done through feature steering which is when you turn a feature of interest up or down in order to “steer” the models behavior in a certain direction.
A famous example of this mentioned in the demo comes from Anthropic’s model Claude, in which they altered the feature associated with the “Golden Gate Bridge.” The researchers turned this feature all the way up in an alternate model they called “Golden Gate Claude.” They found, due to steering the feature, that this new version of Claude would bring up the Golden Gate Bridge in every response regardless of whether it was mentioned in the input.
While a funny example, it demonstrates the ability to have more precise control over the behavior of a model which has many practical applications. For example, in the demo, Ember is used to make a model more secure in hiding private information simply by finding the feature associated with “discussions of sensitive information” and turning it’s strength up.
The ability to find specific features based on the model’s output and steer it is just one of the many abilities that Ember has. Another ability demonstrated was it’s “Dynamic Prompting” in which new prompts could be inserted into the model when a feature for a given concept is activated above a certain strength threshold. The example shown was of a prompt used to advertise “Coca Cola” which gets inserted when the feature associated with “beverages” gets activated by the model.
Novel User Interfaces
The other main commercial use demonstrated was that of novel user interfaces with AI models. By connecting a user interface to the actual weights inside of a vision model, Goodfire was able to create a paint-like tool in which you can “paint” using features in the model and have an associated image generated.
In this paint UI, different features (lion, pyramid, wave, etc.) are displayed on the side. By clicking a given feature, you can then “paint” in a blank canvas with that feature, and than an associated image with that feature in the given position you painted is generated. You can try this demo for yourself right here. An example of my own “painting” can be seen below:

Image generated from “Ocean Wave,” “Tall Grass,” “Pyramid Structure,” and “Skyscraper Facades” features
The cool thing about this UI, is that the strength of each feature can be adjusted which changes how much it is emphasized in the image. Each feature can also be broken down into sub features which allows for more granular control over the output. For example, in the demo, they used the “Lion face” feature to paint a lion and then showed that if you lower the strength of its sub feature “Mane,” than the face becomes more tiger-like. This might show that the model understands a tiger as a lion minus its mane.
Other Use Cases and Directions
Other exciting use cases for interpretability research are mentioned as well such as explainable outputs, knowledge extraction, faster inference models, etc.
Knowledge extraction is a particularly exciting use case as it takes advantage of the superhuman nature of some of the models out their to extract novel information. For example, one of the companies Goodfire is partnered with is a biomedical company called the Arc Institute.
The Arc Institute owns Evo 2 which an AI model that has superhuman abilities to predict genomic data. Because of its superhuman ability, Goodfire plans to work with the Arc Institute to use interpretability methods to potentially discover novel biological concepts found within the model’s neurons. This strategy could potentially be used for a vast range of domains in which narrow superhuman models already exists, allowing for model-driven breakthroughs in many different fields of research.
Takeaway
The demo ends with a brief note from Mark emphasizing the fascination engineers have in taking things apart to truly understand how they work. The black box nature of AI models makes them ripe for this activity and their vast implementations across society makes interpreting them a very practical and lucrative task.
Ember already shows how just a basic understanding of a model’s features can help improve the control one has over its behaviors. Progress in interpretability seems to be speeding up so it will be exciting to see what new use cases will arise in the coming months and years.
