Gemma Scope: helping the safety community shed light on the inner workings of language models
With the recent announcement of Gemma Scope, a new set of tools to help researchers better understand and interpret their open-source models, the interpretability community has made great progress in understanding small models with sparse autoencoders and developing relevant techniques like causaal intervention, automatic circuit analyti, feature interpretation, and evaluating sparse autoencoders. These developments will significantly aid the mechanistic interpretability research community as they explore more complex capabilities like chain-of-thoughts and hallucination alleviations that only arise with larger models. In particular, Gemma Scope’s new state-of-the-art JumpReLU architecture promises to make interpretable small models the best model family for analyzing problems like hallucination alleviations. With help from Neuroscientist Dr. Tom Lieberum and AI researchers at Neuronpedia, Dr. Neel Sanseviero, Dr. Art Huang, Dr. Jennifer Lowe, and many others in the community, the collective effort of Google’s Terms and Conditions has accepted the Google Privacy Policy.