Recently, the New York Times reported about widespread discontent with Google AI search helper. There are plenty of anecdotes about Google AI incorrectly recommending that people add food-safe glue to their culinary endeavors, leave animals in hot cars, or smoke cigarettes while pregnant. While the plural of anecdote is not data and while it’s bad to dislike something simply because it’s popular to dislike it, I think there’s a compelling reason the Google AI in search is just bad.

I lied, there are two.

Let’s start with the smaller of the two reasons: energy. If Google had volumes like DuckDuckGo or OpenAI it might not be an issue, but I think it’s important to remember the scale of Google. Google gets about ~100k searches per second, or about 8.5 billion searches per day.

Let’s try some back-of-the-envelope math: The best-performing model I could find with sub 1-B parameters on the Hugging Face large language model leaderboard was a 7M. (7 million parameters, for the uninitiated.) Big caveat: this model was not fine tuned for summarization, I don’t know the overall performance, I don’t know if it will do summarization, and I don’t know if it will generate good embeddings. The important part is that we have a reasonable count for the minimum viable model. A batched inference, assuming a maximally efficient operation, will be qp = query*projmatrix, kp = key*projmatrix, vp = value*projmatrix, (three flops per weight), qp dot kp (another flop per weight), softmax (another flop), and (qp dot kp) dot vp (another flop). That’s six flops per weight, thereabout, so a 7M model will take 42 MFlops per inference at optimal efficiency. I think it’s also important to reiterate that this is a lower bound. I’m not counting any normalization, assuming a full batch, and not counting any activations other than the softmax. At our 8 billion queries per day that’s an additional 357,000,000,000,000,000 flops (357 petaflops) per day. (See below for notes on caching.)

Admittedly 357 PFlops is a drop in the bucket compared to other compute, but nevertheless, let’s look at it from an energy perspective: the most efficient supercomputer (JEDI, as of 2024) has about 72.7 GFLOPS/watt. (Source: Green500, May 2024) We have a net additional cost of 4,910,591 Watts for their daily search.

4.9 megawatts for the feature, very rough ballpark, for a 7M parameter model. That’s maybe 1% of the smallest nuclear reactor in the US? If they use a 1B model, this value becomes around 701 megawatts per day.

This figure ignores query caching, which would bring compute down, but it also ignores indexing for RAG, which would bring compute up. It also assumes TPUs are as efficient as the most efficient supercomputer in the world.

I think it’s a judgement call at this point. 5-700 megawatts extra doesn’t seem like it’s worth it to me, but it’s not so obscene that it’s completely impossible to tolerate.

That brings us to the second and more important matter: hallucination.

In the best case, a user is smart enough to look up the answer and verify that it’s correct. In the worst case, they take a wrong answer and use it as gospel. The incidence rate of bad generations might be low, but it’s sufficiently high that people always have to check, making the feature useless. (Amusingly, since the people need to check regardless and the AI response comes at the top, this forces people to use that tiny bit of extra time scrolling down.) Even if the bad generations are rare, they’re wrong often enough to be embarrassing to the industry. Google was always competent at more subtle implementations. Their computational photography was absolutely state of the art. Photo search was top notch. Clever algorithmic ranking used to be great. It was the quiet, carefully considered applications of ML that were most compelling. This feels like a vocal “behold my greatness” preceding a staggering display of mediocrity.

A link to the original Reddit thread that held this comment:

Motion capture is neat! There are a bunch of tools available now, iMotion, assorted Kinect libraries, etc. I’ve tried most of them but haven’t found anything that I particularly enjoy using. Most markerless systems suffer from drift, poor localization, jitter, and a host of other issues. Inertial motion capture units also have drift of their own and misbehave when there are metallic objects or magnetic fields nearby. I don’t think I’ll be able to get around any of the issues from the markerless problems unless I put together a fiducial system, but I can still try to make a useful piece of software.

Motion Capture Mk5 is a system that captures human poses and streams the data across the network to connected clients. The protocol should be absurdly simple so that a client can be written in almost any language in the span of a day. This makes the tool multi-purpose — it can be used to stream capture data to Blender for animation or VR Chat for real-time interaction or some other software for some other reason. Letting the application run on a network machine means if you, like me, have a living space that’s slightly constrained, you can capture from a volume that’s larger while your machine sits off to the side and out of the way.

What skills do I hope to practice along the way?

  • UI development in Rust with mixed 2D/3D
  • Computer-vision (if fiducials come into play)
  • Model integration into Rust via Tract or Candle (for depth estimation or pose estimation)
  • Rust networking and IPC

What kinds of deliverables will we see at the end?

I’d like to finish the month with an application that captures and sends the estimated positions and rotations of 13 joints. Ideally, the model will have depth estimation internally, too, so I can duplicate and modify the code into a SLAM tool. A Blender client or Godot client would not be out of the question.

Open Questions:

Should we start by using a pre-canned pose model? There’s a risk it will take more time to get a model building and integrated than it will to train from scratch, but it could be a good jumping off point.

Should we try finding fiducial markers? AprilTag-rs could be a good option, but it doesn’t build on Windows, last I checked.

I have a notebook with a backlog of projects on which to work. For August, I’d like something on the easier side since I’ll be moving for a good chunk of it and probably exhausted for the rest.

Possibility 1: Workout Calculator

Three to four times a week I find myself needing to throw together a workout plan. That means making a set of sets that cover the main muscles, let rest the ones that need to recover, and get the heartrate up. There are some existing solutions online but honestly they’re all apps that are covered in ads and they annoy the hell out of me. This would probably be a mobile-first app; I’m thinking written in Godot.

Possibility 2: Sketch to Animation

I would very much like to set up a model which will take a character sheet and a pose sketch to make a finalized posed frame for an animation. This would be similar to the “three-quarters view from front and side” project because I hate drawing 3/4ths views. This is mostly targeted at people that like animation but aren’t really keen on detailed character work. Let someone do the detailed work and let another person make animations, then combine their efforts automatically while removing the tedium of redrawing lots of the same content. Sure, there exist animation solutions already which cover a lot of this, but the idea still captures my attention.

Possibility 3: OpenFX Davinci Resolve Plugin

I’ve been meaning to put together a plugin for resolve so I can do something like implement thin-plate-splines (i.e., porting puppet warp from After Effects). I don’t think this would be a finished product in itself; I’d only want to get the .ofx file built and running in Resolve and showing some kind of IO. Could be as simple as ‘invert color’, as long as it runs.

Possibility 4: Easily Addressable Short Term Memory Networks (EAST MN?)

It’s been a while since I did anything with language models, and I’ve been thinking about options to make smaller networks that are more powerful. Transformers do not have any historical state (positional encoding passed as input doesn’t count), which is a blessing and a curse. I’m curious about whether it would be possible to add a memory component to transformers that’s addrassible in the same way attention is.

I have 36 hours left to decide. I’m leaning towards #3 or #4 — #3 because I like writing Rust and #4 because it would be useful and would sate my curiosity.