In context: The constant improvements AI companies have been making to their models might lead you to think we’ve finally figured out how large language models (LLMs) work. But nope – LLMs continue to be one of the least understood mass-market technologies ever. But Anthropic is attempting to change that with a new technique called circuit tracing, which has helped the company map out some of the inner workings of its Claude 3.5 Haiku model.
Circuit tracing is a relatively new technique that lets researchers track how an AI model builds its answers step by step – like following the wiring in a brain. It works by chaining together different components of a model. Anthropic used it to spy on Claude’s inner workings. This revealed some truly odd, sometimes inhuman ways of arriving at an answer that the bot wouldn’t even admit to using when asked.
All in all, the team inspected 10 different behaviors in Claude. Three stood out.
One was pretty simple and involved answering the question “What’s the opposite of small?” in different languages. You’d think Claude might have separate components for English, French, or Chinese. But no, it first figures out the answer (something related to “bigness”) using language-neutral circuits first, then picks the right words to match the question’s language.
This means Claude isn’t just regurgitating memorized translations – it’s applying abstract concepts across languages, almost like a human would.
Then there’s math. Ask Claude to add 36 and 59, and instead of following the standard method (adding the ones place, carrying the ten, etc.), it does something way weirder. It starts approximating by adding “40ish and 60ish” or “57ish and 36ish” and eventually lands on “92ish.” Meanwhile, another part of the model focuses on the digits 6 and 9, realizing the answer must end in a 5. Combine those two weird steps, and it arrives at 95.
However, if you ask Claude how it solved the problem, it’ll confidently describe the standard grade-school method, concealing its actual, bizarre reasoning process.
Poetry is even stranger. The researchers tasked Claude with writing a rhyming couplet, giving it the prompt “A rhyming couplet: He saw a carrot and had to grab it.” Here, the model settled on the word “rabbit” as the word to rhyme with while it was processing “grab it.” Then, it appeared to construct the next line with that ending already decided, eventually spitting out the line “His hunger was like a starving rabbit.”
This suggests LLMs might have more foresight than we assumed and that they don’t always just predict one word after another to form a coherent answer.
All in all, these findings are a big deal – they prove we can finally see how these models operate, at least in part.
Still, Joshua Batson, a research scientist at the company, admitted to MIT that this is just “tip-of-the-iceberg” stuff. Tracing even a single response takes hours and there’s still a lot of figuring out left to do.
Source link