4 Lessons From Anthropic on Scaling Interpretability Research

Strad Slater

This is a linkpost for https://williamslater2003.medium.com/4-lessons-from-anthropic-on-scaling-interpretability-research-6a5a7a7c4b59?postPublishedType=initial

Quick Intro: My name is Strad and I am a new grad working in tech wanting to learn and write more about AI safety and how tech will effect our future. I'm trying to challenge myself to write a short article a day to get back into writing. Would love any feedback on the article and any advice on writing in this field!

As AI models get larger, the infrastructure and engineering methods needed to properly run them become more intricate and complex. This is due, in part, to the sheer size and amount of computations large models need.

Not only does an increase in scale effect how we create AI models, it also effects how we understand them. Mechanistic interpretability is a field of research that studies the way AI models think. The goal is to figure out how the inner parts within a model actual work to help it produce an output.

Anthropic, one of the leading AI companies and owner of Claude, puts a lot of its resources into interpretability research in order to help ensure the safety of models for the future. Through their research, they have ran into the issues and tradeoffs that come from scaling up their methods to larger models. These issues, tradeoffs and more are discussed in a discussion the researchers from their team had on their scalability efforts. In this article, I go over 4 key insights I learned from watching this discussion.

1. LLMs are storing general concepts, not just regurgitating data

A lot of people have this assumption that LLMs work by finding a piece of text in its training data that would best work as a response to a given input. However, through the research conducted by Anthropic’s AI interpretability team, it seems clear that LLMs are storing more abstract concepts that allow it to have general knowledge of the world and a greater flexibility in creating responses.

The researchers from the discussion support this idea with reference to multimodal and multilingual concepts found to be encoded into a models inner parts. Multimodal in the sense that, the parts in a model that activate for an image of a given concept, also light up for text describing that same concept (i.e. a picture of a dog and a piece of text saying “dog”). Multilingual in the sense that the parts in a model that activate for a given concept will light up regardless of the language used to represent the concept.

These examples show that the model is creating some deeper understanding of these concepts that transcend modality and language allowing it to apply them to more general contexts.

2. As models scale up, the math remains simple while the engineering becomes more complex

A Sparse Autoencoder (SAE) is a tool used in AI interpretability research that has been very important for the dissection and visualization of features that are discovered within a model. In the discussion, the researchers explained just how beautiful and relatively simple the actual math behind the SAE is compared to other, more complex methods of interpretability analyzed and how this simplicity scales well with model size.

The real challenge in scaling up this method for larger models is the engineering involved in designing and working with the distributed infrastructure it requires. For example, with previous experiments done on smaller models, the associated SAEs could be run on a single GPU. But in order for them to scale up to commercial LLMs, they needed to run SAEs across multiple GPUs which led to a whole load of other engineering challenges such as determining how to distribute the parameters of the SAE across these GPUs.

3. A lot of interpretability work is deciding how to make proper tradeoffs between the efficiency of code and the speed of results

The researchers explain how their team lies at a unique intersection between engineering and science. Their research is driven by a very pragmatic motivation of making their AI systems safer. This means they aren't as focused on optimizing their research for publishing papers. However, they also aren’t trying to release a fully finish software product, so they also aren’t necessarily trying to write the most polished optimized code.

A lot of their work comes down to deciding whether the code they are writing for a given experiment can be written fast to get a quick result, or if it needs to be more thoroughly thought out to optimize it and save time later down the line.

One example to demonstrate this is the problem they ran into of shuffling large amounts of data. In order for a model not to learn any patterns related to the sorting of its training data, it is important for this data to be shuffled before being input. However, as you increase the amount of data to the size of petabytes, which is what they where doing for their LLMs, the best methods for efficiently shuffling this data starts to change.

By continuing to scale up their models, the research team noticed that their shuffling methods where becoming a greater and greater bottleneck in how fast their experiments ran. They eventually reached a point where it just made sense to take the time to redesign and implement a new shuffling strategy in order to allow faster run times for their later experiments.

This ability to look ahead and see what parts of their code will be used once and then thrown away vs useful for a long time warranting optimization is something the team emphasized as being important for their researchers to have.

4. Interpretability research is a very interdisciplinary field where both research and engineering skills are crucial

One thing the researchers in this discussion really emphasized was just how important it is to have a breadth of knowledge when working in AI interpretability. The methods for interpretability are constantly changing as the scale increases and more ideas are tested. This means the type of skills and issues researchers in the field work with are also constantly changing.

This, along with the fact that the team is not trying to optimize for engineering or science alone, means it’s often more valuable for a given employee to be able to work well across a wide range of engineering and science tasks rather than be really skilled at one narrow domain of expertise. This sentiment applies even more so to interpretability teams in which the synergies between people’s different skillsets such as engineering and science, can be greatly useful for the progression of interpretability work.

The researchers demonstrate this by spotlighting the fact that the engineers and scientists on their team are very much intertwined in their work rather than nicely separated into two different groups. Collaboration across skillsets and professional backgrounds is constantly happening within Anthropic’s interpretability team and is one of the key contributors to its success so far.

While these were some of the insights I found interesting from the discussion, there was a lot more discussed so I definitely encourage you to take a listen here!

EA Forum Bot Site
EA Forum

4 Lessons From Anthropic on Scaling Interpretability Research

4

1. LLMs are storing general concepts, not just regurgitating data

2. As models scale up, the math remains simple while the engineering becomes more complex

3. A lot of interpretability work is deciding how to make proper tradeoffs between the efficiency of code and the speed of results

4. Interpretability research is a very interdisciplinary field where both research and engineering skills are crucial

4

Reactions

More posts like this