Installing a Local LLM on Windows using llama.cpp
Large Language Models (LLMs) such as ChatGPT usually run on powerful cloud servers. However, modern optimization techniques allow smaller models to run directly on personal computers. This guide explains how beginners can install and run a local LLM on Windows using the lightweight runtime llama.cpp.
What is llama.cpp?
llama.cpp is an open-source inference engine written in C/C++ designed to run LLM models efficiently on CPUs. It supports quantized models in the GGUF format, which reduces memory usage while keeping reasonable performance.
Basic workflow:
User Prompt
↓
llama.cpp runtime
↓
GGUF Model
↓
Generated Response
System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| Operating System | Windows 10 / 11 | Windows 11 |
| RAM | 8 GB | 16 GB |
| Disk Space | 10 GB | SSD Storage |
| CPU | AVX capable | Modern multi-core CPU |
Step 1 — Install Required Tools
Before building llama.cpp, install the following tools.
Install Git
https://git-scm.comVerify installation:
git --version
Install CMake
https://cmake.org/downloadVerify installation:
cmake --version
Install Visual Studio Build Tools
Download Visual Studio Community or Build Tools and enable:
- Desktop Development with C++
- MSVC Compiler
- Windows SDK
Step 2 — Download llama.cpp
Clone the repository using Git.
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp
Step 3 — Build llama.cpp
Create a build directory and compile the project.
mkdir build cd build cmake .. cmake --build . --config Release
After compilation, executables will appear in:
build/bin/Release
Step 4 — Download a GGUF Model
llama.cpp requires models in GGUF format. These models are usually downloaded from model repositories such as Hugging Face.
Common quantization formats:
| Quantization | Approx RAM Usage |
|---|---|
| Q4 | ~4 GB |
| Q5 | ~5 GB |
| Q8 | ~8 GB |
Place the downloaded model inside a folder such as:
models/
Step 5 — Run the Model
Navigate to the compiled binary folder.
cd build/bin/ReleaseRun the CLI interface:
llama-cli.exe -m ../../../models/model.ggufExample prompt:
User: Explain how DNS works Assistant:
Step 6 — Run llama.cpp as an API Server
You can also run llama.cpp as a local API server.
llama-server.exe -m ../../../models/model.gguf -c 4096Server endpoint:
http://localhost:8080This allows integration with web applications, automation scripts, or development tools.
Choosing the Right Model
| Model Size | RAM Needed | Use Case |
|---|---|---|
| 3B | 4-6 GB | Basic chat |
| 7B | 8-10 GB | Programming and reasoning |
| 13B | 16+ GB | Advanced tasks |
Advantages of Local LLMs
- Complete privacy — data never leaves your computer
- No API usage costs
- Works offline after downloading the model
- Fully customizable for development
Conclusion
Running a local LLM on Windows using llama.cpp allows developers and enthusiasts to experiment with AI without relying on cloud services. The setup involves installing build tools, compiling the runtime, downloading a GGUF model, and launching the model through the CLI or API server.
As model optimization improves, local AI systems are becoming increasingly practical for learning, experimentation, and development.

