Skip to content
CTCO
Go back

Building and Running Local Language Models in C# – Quickstart Edition

Published:  at  11:20 AM
4 min read

Local Language Model Workflow

Table of contents

Open Table of contents

Why bother with local models?

Privacy, cost, latency and let’s be honest, the sheer giddy thrill of hearing your laptop fans spin up like a small jet engine.

FactorCloud LLM APILocal LLM
Unit cost£0.002–£0.12 / 1 k tokens (varies by model)Electricity and a half-decent PSU
Latency100-400 ms5-20 ms (RAM permitting)
Data residencyLargely dependent on setupYour own aluminium chassis
Fine-tuningOften pay-walledFull control, LoRA all the things
ScalingVirtually infiniteLimited only by how loud your GPU fans get

Prerequisites

  1. .NET 8 SDK (or newer)
  2. Git + LFS – most GGUF weights sit on Hugging Face
  3. CPU with AVX2 (good) or GPU with ≥ 6 GB VRAM (better)
  4. ~8 GB spare RAM for a 3-4 B model (double that for Llama 3 8B)

Tip: If your CPU dates from the Windows 7 era, treat this guide as a polite nudge to upgrade.


Step 1 – choose your model

ModelParamsDisk (Q4_K_M)What it’s good at
Phi-22.7 B~1.8 GBTiny, creative writing, micro-tasks
Mistral-7B-Instruct7 B~4.2 GBChat, summarisation
Llama 3 8B-Instruct8 B~4.5 GBGeneral purpose, fewer hallucinations
Code Llama 7B7 B~4.3 GBCode completion / explanation

Grab one with:

git lfs install
git clone https://huggingface.co/TheBloke/phi-2-GGUF

Copy a *.gguf file (the q4_k_m variant keeps RAM usage sane) into a Models folder.


Step 2 – bootstrap a .NET console app

mkdir LocalLLM && cd LocalLLM
dotnet new console --framework net8.0
dotnet add package LLamaSharp --prerelease
dotnet add package LLamaSharp.Backend.Cpu      # or …Backend.Gpu.Cuda for NVIDIA

Drop your chosen *.gguf into ./Models.


Step 3 – wire in the runtime

using LLama;
using LLama.Common;

var modelPath = Path.Combine(AppContext.BaseDirectory, "Models", "phi-2-q4_k_m.gguf");

var parameters = new ModelParams(modelPath)
{
    ContextSize   = 1024,
    Seed          = 1337,
    GpuLayerCount = 0            // bump this if you added the CUDA backend
};

using var model   = LLamaWeights.LoadFromFile(parameters);
using var context = model.CreateContext(parameters);
var executor      = new InteractiveExecutor(context);

Console.Write("You   : ");
while (true)
{
    var user = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(user)) break;

    await foreach (var chunk in executor.ChatAsync(user))
        Console.Write(chunk);

    Console.WriteLine("\n\nYou   : ");
}

First-run tax: LlamaSharp has to compile a BLAS kernel; subsequent launches are instant.


Step 4 – run, chat, repeat

dotnet run
You   : Why do programmers prefer dark mode?
LLM   : Because light attracts bugs!

Yes, I know it’s one of the oldest jokes out there, I don’t apologise.


Performance tips

Quick winImpact
Quantise down (Q4_K_M)50 % less RAM, ≈ 95 % quality
Off-load N layers to GPU2-4× tokens/sec
Increase batch sizeBetter throughput on long prompts
n_ctx ≤ 1 kKeeps VRAM sane on 6-8 GB GPUs

Troubleshooting & FAQ

Unsupported CPU instruction set

Your processor is pre-AVX2. Compile llama.cpp with -march=native, or, you know, embrace that new-hardware smell.



Memory usage explodes past half a prompt

Lower ContextSize, or grab a q5_0 / q4_0 model file. RAM is a harsh mistress.



Can I still use ML.NET + ONNX?

Sure—BERT-style encoders run great in ONNX. For chatty generative use-cases, GGUF + LlamaSharp is simply more ergonomic in 2025.


Next steps

  1. Wrap in ASP.NET – expose a /v1/chat/completions endpoint.
  2. Add function calling – LlamaSharp 0.12 ships native function-tool mapping.
  3. Fine-tune – LoRA adapters + mlf-core make weekend model-training a thing.
  4. Desktop shipdotnet publish -r win-x64 -p:PublishSingleFile=true and hand QA an EXE.

Wrap-up

Local LLMs have graduated from science-project status to genuine productivity boosters. With C#, the .NET SDK, and a few NuGet packages you can spin up a private, zero-cost chatbot before your CI pipeline finishes compiling main. (That’s a developer joke, I promise.)

Ping me on whatever social network is still cool by the time you read this and show me what you build!



Previous Post
How to tackle Complex Software Delivery Projects
Next Post
Blog Refresh 2.0 – Improvements (May 2025)