
Same question, same model, four ways of knowing a city — the Milwaukee showdown goes live, including the wrong answer the database gave me.
Last week I told you the arena was going up but the bell hadn’t rung. This week you can step into it. There’s a new experiment live in The Lab at hardais.com — the MKE Tech Showdown — and you can run it with your own questions right now.
The setup is deliberately simple: the same AI model on both sides, answering the same question, with exactly one difference. The left side is handed the Milwaukee tech inventory I’ve been building; the right side is given nothing. Anything that differs between the two answers is the database talking and nothing else. Type a question about the Milwaukee tech scene, hit run, and watch the two answers diverge in real time. This week’s Building Intelligence walks through what came back when I ran it — including a wrong answer it gave because of a mistake I made — but the more interesting thing is to try your own.
One thing I want to be straight about, because it matters more than the demo looking impressive: this is not a finished product. The inventory behind it is narrowly seeded — a small set of vetted organizations, not a complete map of Milwaukee tech — so the grounded side is strong on what it knows and silent on everything else. That silence is by design, not a bug; an honest inventory says nothing rather than guessing. No winner is declared, nothing is scored, and the “real fight” this whole series has been pointing toward — a small, purpose-built model against the big general-purpose ones — is still ahead of us. What’s open today is the early, honest version: a working demonstration of why the database changes the answer, built on a foundation I’m still filling in the open.
Go break it, ask it something it can’t answer, and tell me what’s missing — that feedback is genuinely part of how the inventory grows.
Back in April, this series set itself a bar. We took one plain, checkable question and put it to a row of AI models, ungrounded, to see what they actually knew about this city:
“I’m a new software engineer relocating to Milwaukee. Name five local tech events, meetups, or community organizations I should know about in 2026, and briefly describe what each one is for.”
The answers ranged from confident to vague to flat wrong, and that became the measuring stick — every future edition of Building Intelligence would be judged against how that question gets answered. This week the question comes back. And for the first time, it goes to a model that has been handed a real Milwaukee inventory to draw on.
Last week we built the arena. This week it’s real enough to run a round in. Over at hardais.com/mke — also living inside The Lab — there’s a showdown you can run yourself: the same model on both sides, the only difference being that the left side is given the Milwaukee inventory and the right side is not. Same weights, same question, same wording. Whatever diverges between the two answers is the database talking and nothing else. That’s the whole experiment — hold the model constant so that grounding is the single variable.
Here’s what came back. Given the inventory, the model named five real, specific, locally-grounded groups — mitobyte and its Code + Brews nights, mke tech (the group formerly known as Web414), the Milwaukee Slack, Global AI Milwaukee, and Milwaukee SPIN — with cadences and focus areas attached. Given nothing, the same model did something I want to give it credit for: it declined. It said, plainly, that it didn’t have current, reliable specifics about Milwaukee’s 2026 tech scene and would rather say so than risk sending a newcomer toward stale information. That’s the honest reframe of this whole demonstration. The story isn’t “look how the model hallucinates without a database.” It’s quieter and more useful than that: the database is the difference between a specific, sourced, genuinely helpful answer and an honest “I can’t help you.” Same model, same integrity. One of them just had something real to stand on.
Then the grounded side said something wrong — and it was my fault, not the model’s.
The first time I ran this, the grounded answer listed the Association of Technology Professionals as a Pewaukee organization. It is not. AITP is based in Appleton; what I’d actually seen was AITP sponsoring an event in Pewaukee, and I’d entered it into the inventory as though Pewaukee were home. The model didn’t invent that error — it faithfully repeated mine. It said exactly what my database told it to say, which is precisely what a grounded model is supposed to do. That’s worth sitting with: a system that reports your sources back to you, verbatim and confidently, is the system you want — which is exactly why the sources have to be right.
The fix wasn’t to quietly delete the row. AITP is a real organization; it’s just not a Milwaukee-local one. When I built this inventory’s schema, I gave every organization a flag called local_eligible — a single switch that records one editorial judgment, “does this belong in a Milwaukee answer?”, kept separate from whether the org is real or worth knowing. Correcting my mistake meant flipping that one flag to false. AITP stays in the database; the rule simply stops it from surfacing in a local answer. And flipping that switch didn’t only remove the wrong entry — it pulled a genuinely local group, Milwaukee SPIN (the Software Process Improvement Network, monthly seminars on software-engineering practice), up into the slot AITP had been occupying. One flag, set with care, did the cleanup and the upgrade at once.
That’s the lesson under the whole project, as plainly as I can put it: grounding a model doesn’t make its answers true. It moves the responsibility for truth off the model’s training and onto my curation. The discipline has to live upstream — in what I let into the database and how honestly I flag it — because everything downstream just faithfully repeats it. Which answers, in advance, an objection I can feel coming: isn’t feeding the model a database of answers just teaching to the test — isn’t that cheating? Watch what just happened. I fed it an answer, the answer was wrong, and it handed my wrong answer right back. You can’t cheat your way to correctness by feeding answers when some of the answers are bad. The integrity of a grounded system is exactly the integrity of its sources — no more, no less. (That “isn’t this cheating?” question deserves a full edition of its own, and it’s coming. Not this week.)
One honest caveat, because it’s the current state and not a flaw to bury: the inventory is narrowly seeded right now — a small set of vetted organizations, not a complete map of Milwaukee tech. The grounded answer is strong on what’s in it and silent on everything else, which is the correct behavior for an inventory this young. Go run the question yourself at hardais.com/mke — then read this week’s Fireside Chat, where the same question gets a wider test: the same model with live web search turned on, and an archived answer from April, lined up beside these two. The arena’s open. Pull up a chair.
This week’s Building Intelligence showed a model answering well because it was handed the Milwaukee inventory. That ought to raise a fair question for anyone who’s read about how these systems usually work: how does the inventory actually get into the model? The answer right now is almost embarrassingly simple — I hand it the whole thing. Every usable organization, all at once, dropped into the conversation before the question. And the technically-minded reader is already objecting: don’t you need a vector database for that? Isn’t this what RAG is for? Not yet. And the reason I don’t is worth more than the reason I eventually will.
Start with what a vector database is actually for, because it’s narrower than the hype suggests. It does exactly one job: retrieval by meaning. You take each entry — an organization, a description — and convert it into an embedding, a long list of numbers that captures what the text means rather than which words it uses. Store those, and you can later find the entries most related to a question even when they share no words with it: ask “who works on computer vision” and it surfaces the org tagged “image recognition,” because the meanings sit close together. That’s the whole function — a relevance filter. It is not a memory, not a brain, not a quality upgrade. It’s a mechanism for deciding which pieces of a large pile to bother showing the model.
Which is precisely why I don’t need one. A model can only consider so much text at once — that limit is its context window, and everything you want it to use has to fit inside it. My inventory is currently a small, hand-vetted set of organizations. It fits in the window with room to spare. So there is nothing to filter: I show the model everything, it reads everything, it answers. A relevance filter only matters once there’s too much to show — and there isn’t. Reaching for a vector database here would be installing a sophisticated sorting system for a shelf with twelve books on it.
So when does it become necessary? At one threshold, arriving in three flavors: when the inventory grows large enough that it no longer fits in the context window; when it technically fits but sending all of it on every question gets wasteful — each query costs more and runs slower the more you stuff into it; or when the answers actually get worse, because the handful of facts that matter are buried under hundreds that don’t, and models, like people, attend less well when the signal is drowned in noise. All three are problems of scale. None of them is true at my current size. That’s the honest framing: needing a vector database is a milestone I haven’t reached, not a prerequisite I’m skipping.
Here’s the part I want to state plainly as expectation, not measurement, because I haven’t tested it yet: I suspect I’ll hit that wall sooner than a big hosted model would. The reason isn’t that the open-weight model I plan to run is somehow needier — it’s that the models I can afford to run myself tend to have smaller context windows than the frontier systems in the cloud. A tighter window means “hand it everything” breaks at a lower amount of data. Same inventory, same mechanism — just a threshold that arrives earlier. So I’d guess a vector database is in this project’s future, and possibly its near future. But I’ll report it the week I actually run into the wall, not before — and I won’t claim it makes the answers better when it comes, because it won’t. It’ll make it possible to feed a large inventory without drowning the model. Quality comes from the curation and the model; retrieval only governs what reaches them.
And there’s a real cost to building it early, which is the other half of the discipline. The moment I add embeddings, every fact lives in two places — the database row, and the vector encoding what that row said. Edit the row and the vector is instantly stale until I regenerate it. That’s a synchronization chore that has to fire on every single change, and if it ever quietly fails, retrieval starts surfacing the old version of a fact I’ve already fixed. Make that concrete with this week’s own correction: I fixed a wrong entry by flipping its eligibility flag, and the database immediately stopped surfacing it. Now picture a vector layer that didn’t get the memo — the row says “not local,” but a stale embedding keeps pulling the wrong org up as a match anyway. During the phase where I’m editing the inventory constantly, building that second source of truth doesn’t buy a single better answer; it just starts a maintenance meter running and adds a fresh way to be wrong. Last week’s lesson was that doing the rigorous structural work first earned me the right to move fast and loose later. This is the same principle pointed the other way: the rigorous move right now is not building infrastructure ahead of the problem it solves. Premature structure is its own kind of undisciplined.
Most weeks, the Fireside Chat asks several different models the same question and lines up their answers. This week is different. We’re asking one model — Claude — the same question four times, changing only one thing each time: what it’s allowed to draw on. Same model, same wording, four different relationships to information.
The question is the one this series has been measured against since April: “I’m a new software engineer relocating to Milwaukee. Name five local tech events, meetups, or community organizations I should know about in 2026, and briefly describe what each one is for.”
The four cards below, in order: Claude answering from memory alone on April 22; Claude from memory alone today; Claude today with live web search turned on; and Claude today with the Milwaukee inventory from this week’s Building Intelligence showdown. As you read, remember it’s the same model every time, so the question isn’t “which is smartest.” It’s quieter than that: where does reliable local knowledge actually come from — and does the model know when to trust itself? My reading is below the cards. See if yours matches before you get there.
I'm a new software engineer relocating to Milwaukee. Name five local tech events, meetups, or community organizations I should know about in 2026, and briefly describe what each one is for.
Start with the two memory-only cards, six weeks apart. In April, Claude answered with no hesitation — five named organizations, cadences, focus areas. Today, the same model declines: it says it lacks reliable current specifics and would rather say so than risk sending a newcomer toward stale information. Same model, opposite posture. I won’t re-litigate April’s specific names here — that scrutiny happened in an earlier edition — but notice the bind a memory-only model is in: it can be specific or it can be reliable, and it has no way to know which one it’s being. April chose specific. June chose honest. The entire point of grounding is to stop forcing that choice.
Which is why the web-search card is the one that matters most. Turning on live search erases the easy win: with the open internet, Claude does name real, in-our-inventory groups — Mitobyte, the MKE Tech Hub Coalition, Global AI Milwaukee. So against web search, “can it even answer” is no longer the database’s edge. The comparison moves to something harder and more honest — quality — and here the card has real problems. Not invented facts, but the subtler failures that live wherever the open web is trusted uncritically.
Three of its five recommendations — CypherCon (April 1–2), BSides Milwaukee (April 3), and the Milwaukee AI Summit (March 3) — had already happened by the time it answered, and it served all three in the future tense, “is scheduled for,” to someone relocating now. The sharp part: I explicitly told it to use web search for recent content, and the trace shows it trying — “let me verify a couple are still active.” It still handed back a first-quarter calendar as though it were ahead of us. That’s the lesson in one card: web search is not a freshness fix. The open web is full of pre-event listings that never get un-published, and live search faithfully inherits the staleness of whatever it retrieves. A date being findable doesn’t make it current.
Look at where it found things, too — the citations are right in the card. A “top tech conferences” listicle (four times over), a bootcamp blog, a business-data aggregator, a tourism listing. All secondary, all rephrased into confident, authoritative prose. It named the Milwaukee AI Summit without mentioning that Concurrency runs it, and CypherCon without naming an organizer at all — so the answer can’t tell you who’s actually accountable for any of it. And one quiet tell: it lists “Web414” as a current group. Our inventory knows Web414 rebranded to “mke tech.” On that one local fact, the curated database is more current than live web search — because the rebrand isn’t what the SEO sources say.
Now the honest counter, because the database doesn’t win clean. Web search surfaced CypherCon, BSides, and the AI Summit — none of which are in our inventory. That isn’t search catching something I missed; it’s search casting wider into ground I haven’t vetted. CypherCon looks like a real, local event; BSides has a thin web presence; the AI Summit was real but is described as if it’s still ahead. Breadth and unverified-ness arrived together — which is exactly the trade a curated inventory refuses to make. So, honestly: the database wins on provenance, currency, and local freshness; web search wins on breadth. And breadth is the one gap that’s an engineering problem rather than a conceptual loss.
That’s the whole Fireside in one frame. Asked about Milwaukee, one model showed four ways of relating to local truth: remember it (April — confident, and you’d better hope it’s right), refuse it (today, ungrounded — honest and useless), look it up (web search — current-ish, broad, and only as trustworthy as the pages it lands on), or be handed a vetted copy (the inventory — specific, sourced, and only as good as the curation behind it). The grounded answer isn’t the winner because the model got smarter. It’s the only one where a human took responsibility for what went in. Run the question yourself at hardais.com/mke, then score all four against the rubric we built back in April. See if you land where I did.
“Patience is also a form of action.”
— Auguste Rodin — He was a French sculptor generally considered the founder of modern sculpture for his ability to convey complex emotional states through clay and bronze. His most famous works, such as The Thinker and The Kiss, broke away from traditional mythological themes to focus on the human condition. Throughout his career, he faced intense criticism from the artistic establishment but eventually gained international acclaim for his dedication and realism.