A Plain Guide to Local AI

Part One

How It Works — Hardware, Models

LLM and Llama — Category and Instance

LLM — Large Language Model

The general category. Any neural network trained at scale to predict, understand, and generate text. GPT-4, Claude, Gemini, and Llama are all LLMs, the same way that iPhone, Android, and BlackBerry are all smartphones. The term describes the architecture and scale, not any one product.

Llama

Meta's specific family of open-weight LLMs — Llama 2, Llama 3, Llama 3.1, and so on. It matters in practice because Meta released the model weights publicly, which means anyone can download and run Llama locally. Most of the local AI ecosystem — the tools, the quantization libraries, the inference servers — was built around Llama as the reference implementation. When people talk about running AI locally on their own hardware, Llama is usually the underlying model, or a fine-tune built on top of one.

Parameters — The B Numbers

Every AI language model has parameters — numerical weights learned during training that collectively encode everything the model knows: grammar, facts, reasoning patterns, ways of following instructions. More parameters means more capacity to hold nuanced representations and handle complex tasks. The B numbers — 7B, 13B, 34B, 70B — count those parameters in billions.

The practical difference is not subtle. A 7B model is competent for simple tasks but announces its limits quickly when pushed. A 34B model handles multi-step reasoning, follows complex instructions, and maintains coherent context across long conversations. A 70B model approaches the capability of earlier generations of frontier commercial models on structured analytical tasks. For agentic work — where the model must plan, use tools, and recover when something goes wrong — the difference between 34B and 13B is the difference between a capable assistant and a frustrating one.

More parameters is not always better in every context. A 7B model is faster, cheaper to run, and sufficient for many tasks. The right model size depends on what the model needs to do — not on what sounds most impressive.

VRAM — The Hard Constraint

VRAM

Video RAM — the memory on the graphics card. This is the binding constraint in local AI. A model that does not fit in VRAM cannot run on that hardware, regardless of how much system RAM the machine has. When someone says a card "runs 34B models," they mean the model fits within the card's VRAM budget. An RTX 3090 has 24 gigabytes of VRAM. A 34B model at full precision requires roughly 68 gigabytes — far more than any single consumer card holds. That gap is bridged by quantization.

Quantization — Fitting More Into Less

Quantization

A compression technique that reduces the numerical precision of a model's parameters so the model occupies less memory. A 70B model stored at full 16-bit precision requires around 140 gigabytes of VRAM. Compressed to 4-bit precision (called Q4), the same model requires roughly 40 gigabytes — still large, but achievable across two high-end consumer cards or one professional GPU. The quality tradeoff is real but, for most practical tasks, may be acceptable.

Tokens Per Second — What Speed Feels Like

Tokens per second

The rate at which a model generates text. A token is roughly three-quarters of a word. Human reading pace is around 4–5 tokens per second. Below that threshold, a model feels slow and halting. Above 15–20 tokens per second, it feels responsive. Modern consumer GPUs running mid-size models typically land between 20 and 80 tokens per second depending on model size, quantization level, and hardware. Smaller models are faster; larger models trade speed for quality.

Open-Weight vs Frontier

Frontier models

The leading edge of AI capability — GPT-4o, Claude, Gemini Ultra. These models run on their developers' servers. You interact with them through an API: your data travels to their infrastructure, is processed there, and returns. The model itself is never in your possession. The developer can see your usage, update or withdraw the model, change the pricing, and set the terms.

Open-weight models

Models whose trained parameters have been publicly released as downloadable files. You run them on your own hardware. Once downloaded, no one else will see your prompts or outputs, revoke your access, or change the model's behavior. "Open-weight" is more accurate than "open source" — the weights are shared, but the training data and pipeline usually are not. The key point for sovereignty: the model is yours to run, modify, and fine-tune on your own infrastructure, on your own terms.

Part Two

Who Built the Model, and Does It Matter?

Two Different Problems

When people ask whether a locally-run AI model poses sovereignty risks, they usually mean one of two things. The first is surveillance: can the company that made the model see what you're doing with it? The second is influence: has the company shaped what the model believes, how it reasons, and whose way of knowing it treats as the default?

The surveillance question has a clean answer: once a model's weights are downloaded and running on hardware you control, the company that built it cannot see your prompts, your outputs, or your data. There is no phone-home mechanism, no telemetry, no connection back to the developer. The operational sovereignty is real.

The influence question is harder and more interesting. It cannot be resolved by where the hardware sits.

What Gets Baked In During Training

A language model learns from the data it was trained on. For every major open-weight model above 50 billion parameters, that training corpus was overwhelmingly English-language text from the Western internet — Wikipedia, books, news archives, academic papers, code repositories, social media. The model's baseline sense of what is normal, what is authoritative, what deserves explanation and what can be assumed — all of it reflects the world that produced that data.

This shows up in specific ways. A model trained on Western internet text may handle questions about Indigenous governance, land relationships, or oral knowledge traditions with less precision than it handles questions about corporate finance or European history — not because it was programmed to, but because its training data was thinner there. It may treat certain ways of knowing as needing justification while treating others as self-evident. It may apply Western legal and ethical frameworks as defaults in contexts where they do not apply.

This is not Meta watching. It is Meta — and every other large training effort — having made choices about whose knowledge counts as knowledge, which were encoded into the model before anyone downloaded it.

The Players Above 50B

Model / Family	Made by	Origin	Sovereignty note
Llama 3.x	Meta	United States	Advertising-economy company. Training reflects predominantly Western, English-language web. Liberal commercial license with scale thresholds.
Mixtral / Mistral	Mistral AI	France	European company, no advertising-economy model. Genuinely open licensing on most releases. Mixture-of-experts architecture delivers 70B-class quality at lower hardware cost. Strongest non-Meta alternative at this tier.
Qwen 2.5 72B	Alibaba	China	Technically excellent — among the best open-weight models at this scale. Alibaba is subject to Chinese data governance law. The sovereignty concern shifts but does not disappear.
DeepSeek V3 / R1	DeepSeek	China	Exceptional reasoning capability, widely regarded as a breakthrough release. Chinese company with same jurisdictional concerns as Qwen. Open weights; same operational sovereignty as other options once running locally.
Falcon 40B / 180B	TII	UAE	Technology Innovation Institute is a UAE government research entity. No advertising model; different geopolitical profile. Less actively maintained than Llama or Mistral ecosystems.
Command R+	Cohere	Canada	Canadian company, strong on retrieval-augmented generation and agentic tasks. Weights available but ecosystem is smaller. Worth watching for document-heavy and knowledge-base applications.

Sovereignty concern levels reflect the degree to which the originating organization's commercial or geopolitical interests could influence model behavior or create downstream risk — not operational surveillance, which is resolved by local hosting in all cases.

The Fine-Tuning Argument

Every open-weight model has two layers of identity: what the base training gave it, and what fine-tuning can add or change. Fine-tuning means continuing the model's training on a smaller, targeted dataset — in this case, a dataset that a community controls, in a community language, encoding community knowledge and values.

For the tasks that matter in Neon Forest Networks deployments — responding in a community language, reasoning about community governance and land relationships, handling cultural context accurately — the fine-tuned layer will dominate over the base model's generic tendencies. The base model provides general language capability: grammar, coherence, the ability to follow instructions. The fine-tune shapes the specific behavior that community members will actually experience.

This does not mean the base model's biases disappear. They remain in the substrate, and they will surface in contexts that the fine-tuning did not anticipate. A community that has fine-tuned a model on three hundred hours of elder recordings will get a model that handles those specific domains with care and precision. The same model, asked a question outside that domain, will fall back toward whatever its base training encoded.

The honest framing: fine-tuning narrows the problem without eliminating it. For the most important use cases, it narrows it enough to matter. And Neon Forest Networks can ensure that the fine-tuning process is entirely localized — which is more than any community gets from a cloud-based AI provider.

The Practical Conclusion

For Neon Forest Networks, the base model choice is a known constraint to manage, not a problem to solve once and set aside. Mistral is the most defensible default — European company, liberal licensing, no advertising-economy alignment, technically strong. Where Mistral's capability gaps matter for a specific deployment, Llama remains the more practical choice given its larger ecosystem and tooling support. Either way, the fine-tuning process — conducted on community hardware, on community data, under community governance — is where Neon Forest Networks can enable a difference in AI sovereignty. The base model is the starting point. The community's knowledge is what shapes the model that community members actually talk to.

First articulated: 2026-03-15. Status: DRAFT — reference document, non-technical audience.