Running LLMs and Stable Diffusion Locally: My Setup & Getting Started
For a while now, I've been using local GPTs – specifically Large Language Models (LLMs) and image generation. The idea of running these powerful tools on my hardware, without relying on cloud services, is incredibly appealing. It’s about control, privacy, and frankly, a bit of tech nerdiness! This post details how I’ve set up my local LLM environment, including integrating Stable Diffusion for some amazing image creation.
My Local Setup
Let’s start with the basics. My system is running Ubuntu 25.04 (codename "Plucky") on an Intel-based machine. Here's a breakdown:
- Operating System: Ubuntu 25.04
- Kernel: 6.14.0-33-generic
- GPU: NVIDIA RTX 500 Ada Generation (Driver Version: 580.95.05, CUDA Version: 13.0)
- Display Server: Wayland [$XDG_SESSION_TYPE] - Crucially, I switched to Wayland as it significantly impacted GPU utilization. Initially, I experienced limitations with GPU performance compared to X11, but I’ve found optimized settings to mitigate this.
You can verify these details yourself using commands like uname -a
and nvidia-smi
.
The Core: llama.cpp
At the heart of my setup is llama.cpp
, a fantastic
project developed by Gergganov. It’s designed to run LLMs efficiently on
CPUs and, crucially, with GPU acceleration using NVIDIA CUDA. It’s
remarkably easy to get running, and it’s what allows me to run models
like DeepSeek-Coder and others locally.
Here’s the breakdown of my installation process (based on the instructions found here: https://github.com/chandpriyankara/llama.cpp-setup)
-
Install Dependencies:
git clone https://github.com/ggerganov/llama.cpp
– Downloads the llama.cpp repository.cd llama.cpp/
– Navigates into the cloned directory.sudo apt install cmake
– Installs CMake, a build tool.sudo apt install nvidia-cuda-toolkit
– Installs the NVIDIA CUDA Toolkit for GPU support.sudo apt install libcurl4-openssl-dev
– Installs necessary libraries for network requests.
-
GPU Setup (Critical – NVIDIA Drivers are Key!): This was the biggest hurdle.
groups $USER
– Sets the user group to video.newgrp video
- Refreshes the group membershipsudo usermod -aG video $USER
- Adds user to the video groupsudo reboot
- Reboots the system to ensure the group membership takes effect.- Installing Proprietary NVIDIA Drivers (Version 580): This was absolutely vital. I used Ubuntu 25.04's firmware tool to load it. This provided the necessary CUDA support and resolved many performance issues.
nvcc --version
– Verifies CUDA installation.nvidia-smi
– Shows the GPU information.
-
CMake Configuration:
cmake -B build -DGGML_CUDA=ON -DLLAMA_BUILD_SERVER=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_CUDA_FORCE_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=89 -G Ninja
– Configures the build environment, enabling CUDA support.
-
Build:
cmake --build build --config Release
– Builds the llama.cpp project in Release mode for optimized performance. You may add -j x to limit build threads
Running the Models – Finding the Sweet Spot
- DeepSeek-Coder: I’ve found that for the RTX 500, maximizing GPU layers isn’t always the best approach. Initially, I was aiming for 36 layers, but that consumed too much memory. Experimentation is key!
- Stable Diffusion: I'm using
stable-diffusion.cpp
to run Stable Diffusion locally. This is a bit more involved, requiring the download of a checkpoint file and some additional steps. You can find more details and instructions on the project’s GitHub page: https://github.com/leejet/stable-diffusion.cpp - Gmma3: This is my current stable model which gives good enough intelligence an d speed (~40tokens/sec)
My Final Running Command (and why I tweaked it):
llama-server -fa on --mlock --n-gpu-layers 36 -m gemma-3-4b-it-Q4_K_M.gguf --host 127.0.0.1 --port 11435
I found that --n-gpu-layers 36
gave me the best balance of performance and memory usage on my system.
Addressing Wayland Limitations & Optimizations
Initially, I encountered limitations with GPU utilization under
Wayland compared to X11. I discovered that simply switching to Wayland
wasn't enough. By ensuring the correct NVIDIA drivers were installed
and experimenting with the --n-gpu-layers
setting, I was able to achieve significant performance gains. The mlock
flag is also crucial for consistent performance.
Resources
- llama.cpp Repository: https://github.com/ggerganov/llama.cpp
- DeepSeek-Coder: https://huggingface.co/bartowski/DeepSeek-Coder-V2-Instruct-GGUF
- DeepSeek-Coder Stable Diffusion Variant: https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct-GGUF
- Stable Diffusion.cpp: https://github.com/leejet/stable-diffusion.cpp
Comments
Post a Comment