Building and Running Local Language Models in C# – Quickstart Edition

Local Language Model Workflow

Open Table of contents

Why bother with local models?
Prerequisites
Step 1 – choose your model
Step 2 – bootstrap a .NET console app
Step 3 – wire in the runtime
Step 4 – run, chat, repeat
Performance tips
Troubleshooting & FAQ
Next steps
Wrap-up

Why bother with local models?

Privacy, cost, latency and let’s be honest, the sheer giddy thrill of hearing your laptop fans spin up like a small jet engine.

Factor	Cloud LLM API	Local LLM
Unit cost	£0.002–£0.12 / 1 k tokens (varies by model)	Electricity and a half-decent PSU
Latency	100-400 ms	5-20 ms (RAM permitting)
Data residency	Largely dependent on setup	Your own aluminium chassis
Fine-tuning	Often pay-walled	Full control, LoRA all the things
Scaling	Virtually infinite	Limited only by how loud your GPU fans get

Prerequisites

.NET 8 SDK (or newer)
Git + LFS – most GGUF weights sit on Hugging Face
CPU with AVX2 (good) or GPU with ≥ 6 GB VRAM (better)
~8 GB spare RAM for a 3-4 B model (double that for Llama 3 8B)

Tip: If your CPU dates from the Windows 7 era, treat this guide as a polite nudge to upgrade.

Step 1 – choose your model

Model	Params	Disk (Q4_K_M)	What it’s good at
Phi-2	2.7 B	~1.8 GB	Tiny, creative writing, micro-tasks
Mistral-7B-Instruct	7 B	~4.2 GB	Chat, summarisation
Llama 3 8B-Instruct	8 B	~4.5 GB	General purpose, fewer hallucinations
Code Llama 7B	7 B	~4.3 GB	Code completion / explanation

Grab one with:

git lfs install
git clone https://huggingface.co/TheBloke/phi-2-GGUF

Copy a *.gguf file (the q4_k_m variant keeps RAM usage sane) into a Models folder.

Step 2 – bootstrap a .NET console app

mkdir LocalLLM && cd LocalLLM
dotnet new console --framework net8.0
dotnet add package LLamaSharp --prerelease
dotnet add package LLamaSharp.Backend.Cpu      # or …Backend.Gpu.Cuda for NVIDIA

Drop your chosen *.gguf into ./Models.

Step 3 – wire in the runtime

using LLama;
using LLama.Common;

var modelPath = Path.Combine(AppContext.BaseDirectory, "Models", "phi-2-q4_k_m.gguf");

var parameters = new ModelParams(modelPath)
{
    ContextSize   = 1024,
    Seed          = 1337,
    GpuLayerCount = 0            // bump this if you added the CUDA backend
};

using var model   = LLamaWeights.LoadFromFile(parameters);
using var context = model.CreateContext(parameters);
var executor      = new InteractiveExecutor(context);

Console.Write("You   : ");
while (true)
{
    var user = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(user)) break;

    await foreach (var chunk in executor.ChatAsync(user))
        Console.Write(chunk);

    Console.WriteLine("\n\nYou   : ");
}

First-run tax: LlamaSharp has to compile a BLAS kernel; subsequent launches are instant.

Step 4 – run, chat, repeat

dotnet run

You   : Why do programmers prefer dark mode?
LLM   : Because light attracts bugs!

Yes, I know it’s one of the oldest jokes out there, I don’t apologise.

Performance tips

Quick win	Impact
Quantise down (Q4_K_M)	50 % less RAM, ≈ 95 % quality
Off-load N layers to GPU	2-4× tokens/sec
Increase batch size	Better throughput on long prompts
n_ctx ≤ 1 k	Keeps VRAM sane on 6-8 GB GPUs

Troubleshooting & FAQ

Unsupported CPU instruction set

Your processor is pre-AVX2. Compile llama.cpp with -march=native, or, you know, embrace that new-hardware smell.

Memory usage explodes past half a prompt

Lower ContextSize, or grab a q5_0 / q4_0 model file. RAM is a harsh mistress.

Can I still use ML.NET + ONNX?

Sure—BERT-style encoders run great in ONNX. For chatty generative use-cases, GGUF + LlamaSharp is simply more ergonomic in 2025.

Next steps

Wrap in ASP.NET – expose a /v1/chat/completions endpoint.
Add function calling – LlamaSharp 0.12 ships native function-tool mapping.
Fine-tune – LoRA adapters + mlf-core make weekend model-training a thing.
Desktop ship – dotnet publish -r win-x64 -p:PublishSingleFile=true and hand QA an EXE.

Wrap-up

Local LLMs have graduated from science-project status to genuine productivity boosters. With C#, the .NET SDK, and a few NuGet packages you can spin up a private, zero-cost chatbot before your CI pipeline finishes compiling main. (That’s a developer joke, I promise.)

Ping me on whatever social network is still cool by the time you read this and show me what you build!