AI Is a Black Box. Anthropic Figured Out a Way to Look Inside

hedge ( @hedge@beehaw.org ) · 1 month ago

AI Is a Black Box. Anthropic Figured Out a Way to Look Inside

astronaut_sloth ( @astronaut_sloth@mander.xyz ) · 1 month ago

The original paper itself, for those who are interested.

Overall, this is really interesting research and a really good “first step.” I will be interested to see if this can be replicated on other models. One thing that really stood out, though, was that certain details are obfuscated because of Sonnet being proprietary. Hopefully follow-on work is done on one of the open source models to confirm the method.

One of the notable limitations is quantifying activation’s correlation to text meaning, which will make any sort of controls difficult. Sure, you can just massively increase or decrease a weight, and for some things that will be fine, but for real manual fine tuning, that will prove to be a difficulty.

I suspect this method is likely generalizable (maybe with some tweaks?), and I’d really be interested to see how this type of analysis could be done on other neural networks.