AI Is a Black Box. Anthropic Figured Out a Way to Look Inside

hedge ( @hedge@beehaw.org ) · 28 days ago

AI Is a Black Box. Anthropic Figured Out a Way to Look Inside

Onno (VK6FLAB) ( @vk6flab@lemmy.radio ) · 28 days ago

The underlying issue with an LLM is that there is no “learning”. The model itself doesn’t dynamically change whilst it’s being used.

This article sets out a process that gives the ability to alter the model, by “dialling up” (or down) concepts. In other words, it’s changing the balance of the weight of concepts across the whole model.

Altering one concept is hardly “learning”, especially since it’s being done externally by researchers, but it’s a start.

A much larger problem is that the energy consumption is several orders of magnitude larger than that of our brain. I’m not convinced that we have enough energy to make a standalone “AI”.

What machine learning actually gave us is the ability to automatically improve a digital model of things, like weather prediction, something that took hours on a supercomputer to give you a week of forecast, now can be achieved on a laptop in minutes with a much longer range and accuracy. Machine learning made that possible.

An LLM is attempting the same thing with human language. It’s tantalising, but ultimately I think the idea applied to language to create “AI” is doomed.

Paragone ( @Paragone@beehaw.org ) · 8 days ago

To the best of my knowledge, back-propagation IS learning, whether it’s happening in a neural-net on a chip, or whether we’re doing it, through feedback, & altering our understanding ( so both hard-logic & our wetware use the method for learning, though we use a rather sloppy implimentation of it. )

& altering the relative-significances of concepts IS learning.

( I’m not commenting on whether the new-relation-between-those-concepts is wrong or right, only on the mechanism )

so, I can’t understand your position.

Please don’t deem my comment worthy of answering: I’m only putting this here for the record, is all.

Everybody can downvote my comment into oblivion, & everything in the world’ll still be fine.

Onno (VK6FLAB) ( @vk6flab@lemmy.radio ) · 8 days ago

Back propagation happens during the creation of the model, not after it’s deployed.

astronaut_sloth ( @astronaut_sloth@mander.xyz ) · 27 days ago

The original paper itself, for those who are interested.

Overall, this is really interesting research and a really good “first step.” I will be interested to see if this can be replicated on other models. One thing that really stood out, though, was that certain details are obfuscated because of Sonnet being proprietary. Hopefully follow-on work is done on one of the open source models to confirm the method.

One of the notable limitations is quantifying activation’s correlation to text meaning, which will make any sort of controls difficult. Sure, you can just massively increase or decrease a weight, and for some things that will be fine, but for real manual fine tuning, that will prove to be a difficulty.

I suspect this method is likely generalizable (maybe with some tweaks?), and I’d really be interested to see how this type of analysis could be done on other neural networks.

Quexotic ( @Quexotic@beehaw.org ) · 28 days ago

https://archive.is/20240521152252/https://www.wired.com/story/anthropic-black-box-ai-research-neurons-features/

Ilandar ( @Ilandar@aussie.zone ) · 28 days ago

This sounds promising but I do wonder how undermined any progress they make will be by:

the speed of advancements in AI
the fact that this research doesn’t necessarily apply to other LLMs
the fact that LLMs are being released/leaked to the public, so anyone who has access to them has the potential to jailbreak the AI and circumvent any safety precautions researchers implement as a result of this work