Scott Adams, the author of Dilbert, passed away last month. While his work will live on, I was curious about the best way to build a Dilbert search engine.
The first step is to extract the text. Pavan tested over half a dozen LLMs on ~30 Dilbert strips to see which one did the best.
Summary: Gemini 3 Flash does the best, and would cost ~$20 to process the entire Dilbert archive. But if you want a local solution, Qwen 3 VL 32b is the best.
| Model | Score (%) | Text (40) | Spkr (25) | Caps (15) | Panel (10) | Halluc (10) | ||
|---|---|---|---|---|---|---|---|---|
| gemini-3-flash-preview | 99.3% | 39.9 | 24.4 | 15.0 | 10.0 | 10.0 | ||
| qwen3-vl-32b-instruct | 96.0% | 39.8 | 21.6 | 15.0 | 9.9 | 9.7 | ||
| llama-4-maverick | 85.1% | 38.5 | 16.3 | 13.2 | 9.1 | 8.1 | ||
| llama-4-scout | 84.1% | 39.0 | 16.4 | 12.5 | 8.7 | 7.5 | ||
| gemma-3-27b-it | 81.3% | 37.8 | 13.1 | 14.4 | 8.4 | 7.6 | ||
| nemotron-nano-12b-v2-vl-free | 81.3% | 38.6 | 13.1 | 14.4 | 8.5 | 6.6 | ||
| molmo-2-8b-free | 70.4% | 36.2 | 16.4 | 0.5 | 8.8 | 8.4 |
That accuracy of 99.3% is impressive. Here's the biggest error it made:
Can you spot the error? The model attributed the text to Dogbert instead of the computer. (But you could argue that Dogbert is the one typing it...)
Here's another error:
Can you spot the error? In Panel 2, it's Dilbert speaking, not Dogbert.
In fact, the only transcription error Gemini 3 Flash made was writing "McDONALD'S" instead of "MCDONALD'S" (see panel 2), and not hyphenating a line-break in "PRESEN-TATION" (see panel 4).
Qwen 3 VL 32b made almost as few errors. The bigger gap is in speaker detection, where the models fall of steeply.
I spent 7 years typing out every one of the ~3,000 Calvin & Hobbes strips by hand. For these ~12,000 Dilbert strips, it might take a few hours and a few dollars for the same.