• Researchers who work on transformer models understand how the algorithm works, but they don’t yet know how their simple programs can generalize as much as they do.

    They do!

    You can even train small networks by hand with pen and paper. You can also manually design small models without training them at all.

    The interesting part is that this dated tech is producing such good results now that we throw our modern hardware at it.