
Explore AI's dual role in revitalizing app launches and compromising data security, plus learn about Diffusion Models and using Anchored Prompts effectively.
Following through on the commitment made two weeks ago — accountability for AI-assisted work isn't just a statement, it's a practice. Here's how this newsletter is actually architected, in plain terms, and what's happening behind the scenes this week.
The newsletter has four distinct AI-driven roles, and I want each one disclosed explicitly so you always know what produced what:
For this edition:
I reserve all editorial judgment — topic selection, content direction, Building Intelligence drafting, Under the Hood writing, and the final review on every section. The AI models are doing the work I ask them to do in the roles I've defined. The editor is me.
This week's Under the Hood shows you my three locally-installed models as a table — 8.0B, 8.2B, 8.9B parameters; different context lengths; different capabilities. The table is useful, but it raises a question: what do those numbers actually mean, and why would anyone care? That's what this section is for.
When you see a model described as "7B" or "8B" or "70B," that number refers to parameters — individual numerical values stored inside the model. Think of each parameter as a small dial. A model labeled 8B has roughly 8 billion dials. During training, the process turns those dials — very slightly, very carefully — until the model's outputs better match the patterns in its training data. Once training is done, those dials are locked in place. The exact configuration of all 8 billion dials is what the model knows.
More parameters generally means more capacity — the model can represent more nuanced relationships, handle more varied inputs, and produce more sophisticated outputs. The cost is proportional. More parameters means more memory to load the model, more compute to run it, and dramatically more compute to train or fine-tune it. Rough size classes for context:
A question worth sitting with: if larger models are more capable, why not just fine-tune a 70B model for this Milwaukee project? Wouldn't it learn the same specific knowledge on top of its broader capabilities and just be better at everything?
The practical answer is that fine-tuning a 70B model requires hardware most practitioners don't have access to, and the cost makes experimentation impractical. The more interesting answer is that whether a larger general model beats a smaller focused one on a specific domain comes down to three things: data quality, training approach, and question specificity — not raw parameter count. That's an opportunity, not a limitation. A well-constructed dataset, a thoughtful fine-tuning approach, and clearly scoped evaluation questions can produce results that compete with — and sometimes outperform — much larger general-purpose models on narrow tasks. This is where real practitioners still have room to stand out in an industry often dominated by three major providers.
There's another distinction that matters as much as size, and it's not visible in the parameter count. The difference is between a base model and an instruct model.
A base model completes text. Feed it "The capital of Wisconsin is" and it continues with "Madison." It was trained on massive quantities of text from the internet, and its fundamental skill is predicting what word comes next. If you ask a base model a question in natural language, it may answer — but it may also just continue your question as if you were in the middle of writing an essay. It has no concept of "turn-taking" in a conversation.
An instruct model has been further trained — usually after the initial base training — on examples of human instructions paired with good responses. This second training phase teaches the model to follow directions, answer questions, and hold a conversation in a structured back-and-forth way. Every major AI assistant you've interacted with — ChatGPT, Claude, Gemini — is an instruct model under the hood.
For a fine-tuning project like the one this series is building toward, we want an instruct model as our starting point. We're teaching it new facts about Milwaukee, not teaching it how to have a conversation from scratch. Starting from an instruct model means the conversational scaffolding already exists, and fine-tuning just needs to add the specific knowledge we care about.
How do you tell whether a model is base or instruct? Sometimes the name makes it obvious — "Llama 3 8B Instruct" versus plain "Llama 3 8B." Sometimes you have to look at the metadata. Running ollama show on my three models reveals stop tokens like <|start_header_id|> and <|im_start|> — special markers used to indicate conversation turns. Base models have no need for those tokens. Their presence is the fingerprint of an instruct model. All three of my local models are instruct-tuned. That's what we want.
The three models on my laptop are all within a few hundred million parameters of each other — but their context lengths range from 8,000 tokens to 262,000, and their licenses, capability profiles, and underlying architectures all differ. The parameter count is one variable. It's not the only one, and depending on the project, it may not even be the most important one.
Next week: with the concepts in hand, we look at each of the three candidates honestly, weigh them against each other, and commit to one.
Last week I said the next step was picking a model. What happened instead is that I took honest stock of what I'm already working with, and found the answer to "what model?" was already on my hard drive — along with a few surprises about what I thought I knew.
Before getting into this, two terms worth defining clearly because they come up repeatedly. Inference is the act of running a trained model to get an answer — you give it a prompt, it produces a response. That's different from training, where the model is being taught by example and its internal values are being adjusted. Everything in this edition is about inference. Training comes in a later week, and it's where the real hardware constraints land. GPU stands for Graphics Processing Unit — a specialized chip originally built for rendering graphics but now the workhorse for AI because the math involved is highly parallel. VRAM is Video RAM — memory that lives on the GPU itself, much faster than regular system RAM, and the main constraint on whether a given model will run at all.
Here are the specs: Samsung laptop running Windows 11, 11th-generation Intel i7-1165G7 processor, 16 GB of RAM, 1 TB of internal storage with about 557 GB free, and Intel Iris Xe integrated graphics with 128 MB of dedicated video memory. It's honestly the middle of what most students and early-career IT professionals are actually using. Not a gaming rig. Not a workstation. Just a capable everyday laptop from 2021-era hardware.
The line that matters most for this project is the graphics. Intel Iris Xe is integrated graphics — the GPU shares a chip with the main processor and pulls from system RAM rather than having dedicated video memory of its own. The 128 MB listed is the minimum reservation. A dedicated GPU is a separate physical card with its own VRAM — typical AI-capable cards have 8, 16, 24, or 40+ GB of VRAM specifically built for the massively parallel math that language models require. Integrated graphics are functional for everyday computing and light 3D work. They are not what AI inference wants to be run on.
Having 557 GB free sounds like plenty until you realize a single 8-billion- parameter language model takes 5–6 GB, training data can run into the tens of gigabytes, and every tool in the pipeline (Python environments, model caches, working files, git history) adds up. The AI stack eats disk fast, and I didn't want to fill my C drive with it.
My solution was an external SSD. Specifically, a Samsung T7 Shield 2TB portable SSD, which I bought from Amazon for about $250. USB 3.2 Gen 2, roughly 1 GB/sec transfer speeds, IP65-rated against water and dust — physically tough enough to travel in a bag without worry. On that drive I created a folder called AI_Env, and everything AI-related lives there: Python, VS Code, Git, Ollama, the model weights, and the project work-in-progress.
Important to be clear about what this fixes and what it doesn't. The external drive solves the storage problem — I now have effectively 2.5 TB of working space instead of 557 GB. It does not solve the GPU problem. No amount of external storage gives me VRAM. For inference that's workable. For training, I'll still need the lab. The bonus is portability — this setup travels with me to any machine, which matters for a student who works from more than one location.
Ollama is a local runtime for language models. It handles downloading models, running an inference server in the background, and exposing a command-line interface so you can talk to the model from the terminal. It's not the only option — LM Studio is the main alternative with a graphical interface instead of a terminal — but Ollama is what I use, and it's what's installed in my AI_Env folder.
Ollama pulls models from its own curated registry at ollama.com/library. That registry is a thin layer on top of the real source: Hugging Face, which is where essentially every major open-weight language model is originally published. When I want a wider selection — including variants, fine-tunes, and newer models that haven't made it into Ollama's curated list yet — I go "shopping" directly on Hugging Face. There are over 2 million models on the platform, and the number is growing fast. The vast majority aren't conversational language models — it hosts image generators, speech models, embedding models, specialist fine-tunes for every conceivable niche — but the subset relevant to any given project is still large enough to require real evaluation work. How to navigate that is its own topic, and one worth covering in its own edition.
Here's what I already had pulled:
| Name | Params | Context | License | Notable capabilities |
|---|---|---|---|---|
llama3:latest | 8.0B | 8,192 | Meta Llama 3 Community | Text only |
qwen3:8b | 8.2B | 40,960 | Apache 2.0 | Tools, extended reasoning |
ministral-3:8b | 8.9B | 262,144 | Apache 2.0 | Tools, vision |
Three models, all roughly 8-billion-parameter class, from three different organizations — Meta, Alibaba's Qwen team, and Mistral AI. Each has a different license and a different capability profile. The context length range — 8K to 262K tokens — is the biggest practical difference. Context length is how much text a model can consider in a single prompt-response cycle, and that 32x spread between the smallest and largest here is significant for any use case involving long documents.
One surprise is worth naming. The model labeled ministral-3:8b sounds like it should be Ministral 3B, a small 3-billion- parameter model Mistral AI released last year. It isn't. Running ollama show on it reveals an architecture of mistral3 and 8.9 billion parameters — this is actually a variant of Mistral's newer Mistral 3 series (the multimodal family that includes Mistral Small 3). The name in the registry is one person's shorthand, not an official naming convention. That's a real-world lesson: a model's name is not its ground truth. The metadata in ollama show is. If I'd committed to fine-tuning "Ministral 3B" based on the name alone, I'd have been training against a model that isn't what the name suggests.
When I started the Ollama server, the startup logs told me exactly what to expect:
inference compute: id=cpu library=cpu
total="15.7 GiB" available="3.5 GiB"
entering low vram mode: total vram="0 B" threshold="20.0 GiB"Translated: Ollama scanned the system for usable GPU memory and found zero bytes. Intel Iris Xe integrated graphics don't count as inference-capable hardware — they share system RAM, use drivers Ollama doesn't directly support, and aren't optimized for AI math. So inference runs entirely on the CPU, using system RAM instead of VRAM. Ollama also activated "low VRAM mode" — a degraded operating mode for hardware far below the 20 GB of VRAM it considers comfortable.
To see what this looked like in practice, I ran the same simple prompt through each model — "Hello, what year is it?" — and captured both the responses and the time it took each one to answer. The responses themselves are in this week's Fireside Chats section below, because the differences between them are worth reading on their own. The speeds tell the hardware story: Llama 3 answered in 18 seconds, Qwen 3 took 32 seconds, Ministral-3 took 1 minute 13 seconds. Usable. Not fast. A cloud model running on a purpose-built inference server would handle all three prompts in under a second combined.
Three models already on disk, all running (slowly) on CPU-only inference. No GPU acceleration and no way to train here. That's a cleaner picture than I had a week ago, and it means the next step is no longer "what do I need to buy" — it's "which of these three do I actually pick, and why?"
That's the next edition's topic. Before I can commit to one, there's conceptual work to do: understanding what 8B parameters actually buys you, what the difference between a base and an instruct model means, and what criteria separate a model worth building on from one that isn't. That's what Building Intelligence is covering — this week and next.
This week's Fireside Chat is a departure from the usual format. Instead of asking Gemini, Claude, and GPT the same question, I ran a four-word prompt — "Hello, what year is it?" — through the three local, open-source models running on my laptop. Same hardware. Same prompt. Same moment. Three completely different answers. Before the responses, brief background on who made each of these models, because the companies behind them shape what the models are.
Llama 3 was released by Meta, the American technology company headquartered in Menlo Park, California — the same company that owns Facebook, Instagram, and WhatsApp. Meta's Llama family is the most widely downloaded open-weight language model family to date. Released under a custom Meta Llama community license that permits commercial use with some restrictions at very large scale. Meta has positioned open models as a core part of their AI strategy rather than keeping their best work locked behind an API.
Qwen 3 comes from Alibaba Cloud's Qwen team, part of the Chinese technology conglomerate best known for Alibaba.com, Taobao, Tmall, and their massive cloud business. Headquartered in Hangzhou, China. Qwen has emerged as one of the most competitive Chinese open-weight model families over the past year, released under the Apache 2.0 license — one of the most permissive open-source licenses in existence, with no commercial restrictions.
Mistral 3 is from Mistral AI, a French startup founded in 2023 by alumni of Google DeepMind and Meta's FAIR lab. Headquartered in Paris. Mistral has positioned itself as Europe's flagship open-weight model provider and has become a credible counterweight to the American and Chinese labs in the open-source conversation. Mistral 3 is their newer multimodal family, also released under Apache 2.0.
Why run this specific prompt? Because a local model has no access to the real date — no web search, no system clock, no tool use unless you explicitly build a system around it. Everything a local model "knows" comes from its training data. Ask it what year it is, and it has to do one of three things: refuse, admit its limits, or just answer based on what it was trained on. The three responses below show each model choosing a different path — and each choice reflects a different design philosophy from a different company in a different country.
As you read: which model refused to commit? Which one reasoned out loud? Which one just answered — and was that answer actually correct? The differences aren't quirks. They're design choices.
Hello, what year is it?
“Curiosity is the fuel that drives our pursuit of progress and understanding.”
— Ada Lovelace — Ada Lovelace was a pioneering mathematician and writer, best known for her work on Charles Babbage's early mechanical general-purpose computer, the Analytical Engine. She is often regarded as the first computer programmer due to her detailed notes on the engine, which include an algorithm considered to be the first intended for implementation on a computer. Her visionary thinking about the potential of computers went far beyond mere calculation, imagining how they could revolutionize fields like art and science.