Monthly Archives: May 2024

Recently, the New York Times reported about widespread discontent with Google AI search helper. There are plenty of anecdotes about Google AI incorrectly recommending that people add food-safe glue to their culinary endeavors, leave animals in hot cars, or smoke cigarettes while pregnant. While the plural of anecdote is not data and while it’s bad to dislike something simply because it’s popular to dislike it, I think there’s a compelling reason the Google AI in search is just bad.

I lied, there are two.

Let’s start with the smaller of the two reasons: energy. If Google had volumes like DuckDuckGo or OpenAI it might not be an issue, but I think it’s important to remember the scale of Google. Google gets about ~100k searches per second, or about 8.5 billion searches per day.

Let’s try some back-of-the-envelope math: The best-performing model I could find with sub 1-B parameters on the Hugging Face large language model leaderboard was a 7M. (7 million parameters, for the uninitiated.) Big caveat: this model was not fine tuned for summarization, I don’t know the overall performance, I don’t know if it will do summarization, and I don’t know if it will generate good embeddings. The important part is that we have a reasonable count for the minimum viable model. A batched inference, assuming a maximally efficient operation, will be qp = query*projmatrix, kp = key*projmatrix, vp = value*projmatrix, (three flops per weight), qp dot kp (another flop per weight), softmax (another flop), and (qp dot kp) dot vp (another flop). That’s six flops per weight, thereabout, so a 7M model will take 42 MFlops per inference at optimal efficiency. I think it’s also important to reiterate that this is a lower bound. I’m not counting any normalization, assuming a full batch, and not counting any activations other than the softmax. At our 8 billion queries per day that’s an additional 357,000,000,000,000,000 flops (357 petaflops) per day. (See below for notes on caching.)

Admittedly 357 PFlops is a drop in the bucket compared to other compute, but nevertheless, let’s look at it from an energy perspective: the most efficient supercomputer (JEDI, as of 2024) has about 72.7 GFLOPS/watt. (Source: Green500, May 2024) We have a net additional cost of 4,910,591 Watts for their daily search.

4.9 megawatts for the feature, very rough ballpark, for a 7M parameter model. That’s maybe 1% of the smallest nuclear reactor in the US? If they use a 1B model, this value becomes around 701 megawatts per day.

This figure ignores query caching, which would bring compute down, but it also ignores indexing for RAG, which would bring compute up. It also assumes TPUs are as efficient as the most efficient supercomputer in the world.

I think it’s a judgement call at this point. 5-700 megawatts extra doesn’t seem like it’s worth it to me, but it’s not so obscene that it’s completely impossible to tolerate.

That brings us to the second and more important matter: hallucination.

In the best case, a user is smart enough to look up the answer and verify that it’s correct. In the worst case, they take a wrong answer and use it as gospel. The incidence rate of bad generations might be low, but it’s sufficiently high that people always have to check, making the feature useless. (Amusingly, since the people need to check regardless and the AI response comes at the top, this forces people to use that tiny bit of extra time scrolling down.) Even if the bad generations are rare, they’re wrong often enough to be embarrassing to the industry. Google was always competent at more subtle implementations. Their computational photography was absolutely state of the art. Photo search was top notch. Clever algorithmic ranking used to be great. It was the quiet, carefully considered applications of ML that were most compelling. This feels like a vocal “behold my greatness” preceding a staggering display of mediocrity.

A link to the original Reddit thread that held this comment: