While creators of AI chatbots like ChatGPT can explain how they train and how the underlying technology works, they cannot fully explain what their work does with that information they are trained to do
In many cases, AI developers are surprised by what their work can and cannot do, so it's a key issue to solve For example, the Udio team created an AI music model, but found that they could write and perform stand-up comedies
Even field leaders are struggling to work on how LLMs and other frontier models are using information to solve what they are doing, but OpenAI seems to have taken the first step in deciphering this mystery
Much remains unclear, but OpenAI researchers say they have discovered 4 million features in GPT-1,600, revealing what the model "thinks" of
They did so using a technology called sparse auto encoders, which are like machine learning models that can identify "more important" features This is in contrast to other types of auto encoders that consider all the features and make them less useful
Suppose you're talking about a car with a friend You still have knowledge of how to cook your favorite dish, but the concept is not very likely to come up in the discussion of cars
OpenAI said finding sparse auto encoders is a more useful set of important features and concepts for generating prompt answers It is similar to a smaller set of concepts that a person relies on in a particular discussion
However, while sparse auto encoders can find the characteristics of a particular model, it is only one step to interpret it More work is needed to understand how the model fully uses these features
OpenAI believes that understanding how models work is important because we can find better ways to approach model safety
Another challenge is training sparse auto encoders, which require more computing power to handle the required limitations and are complicated for a variety of reasons, such as avoiding overfitting1
However, OpenAI states that it has developed a new state-of-the-art methodology that allows sparse auto encoders to scale out to tens of millions of features in frontier AI models such as GPT-4 and GPT-4o
To confirm the interpretability of such features, OpenAI states that these features will be activated Fragments of the document were enumerated These included phrases related to price increases and rhetorical questions
This is the first step to show what a large language model is focused on, but OpenAI also acknowledges that there are some limitations
First of all, many of the features they discovered are still difficult to interpret with many activations without clear patterns In addition, they also do not yet have a good way to check the validity of the interpretation
In the short term, OpenAI hopes that the features they find will help monitor and steer the behavior of language models
In the long run, OpenAI wants interpretability to provide a new way to infer about the safety and robustness of models Understanding how and why AI models work that way can help people trust it when making important decisions
Comments