• While sucky, this feels inevitable.

    With LLMs and the massive wave of spam coming out right now make caching content way more expensive. And then Google gains no value from this. Long tail spam attacks are already strangling google lately.

    I think the only way to run a search engine in the mid 2020s is to download the data, process the page in memory, extract to metadata+embeddings and store only those. There’s no value in storing the rendered page offline for later analysis since you’re likely not doing that later analysis.

    Internet Archive hopefully can fare better by being curated by humans and storing data infrequently when important, whereas Google needs to scan a lot of info frequently with nearly no human input.