Running LLMs and Stable Diffusion Locally: My Setup & Getting Started


For a while now, I've been using local GPTs – specifically Large Language Models (LLMs) and image generation. The idea of running these powerful tools on my hardware, without relying on cloud services, is incredibly appealing. It’s about control, privacy, and frankly, a bit of tech nerdiness! This post details how I’ve set up my local LLM environment, including integrating Stable Diffusion for some amazing image creation.

 

 


My Local Setup

Let’s start with the basics. My system is running Ubuntu 25.04 (codename "Plucky") on an Intel-based machine. Here's a breakdown:

  • Operating System: Ubuntu 25.04
  • Kernel: 6.14.0-33-generic
  • GPU: NVIDIA RTX 500 Ada Generation (Driver Version: 580.95.05, CUDA Version: 13.0)
  • Display Server: Wayland [$XDG_SESSION_TYPE] - Crucially, I switched to Wayland as it significantly impacted GPU utilization. Initially, I experienced limitations with GPU performance compared to X11, but I’ve found optimized settings to mitigate this.

You can verify these details yourself using commands like uname -a and nvidia-smi.








 

 

The Core: llama.cpp

At the heart of my setup is llama.cpp, a fantastic project developed by Gergganov. It’s designed to run LLMs efficiently on CPUs and, crucially, with GPU acceleration using NVIDIA CUDA. It’s remarkably easy to get running, and it’s what allows me to run models like DeepSeek-Coder and others locally.

Here’s the breakdown of my installation process (based on the instructions found here: https://github.com/chandpriyankara/llama.cpp-setup)

  1. Install Dependencies:

    • git clone https://github.com/ggerganov/llama.cpp – Downloads the llama.cpp repository.
    • cd llama.cpp/ – Navigates into the cloned directory.
    • sudo apt install cmake – Installs CMake, a build tool.
    • sudo apt install nvidia-cuda-toolkit – Installs the NVIDIA CUDA Toolkit for GPU support.
    • sudo apt install libcurl4-openssl-dev – Installs necessary libraries for network requests.
  2. GPU Setup (Critical – NVIDIA Drivers are Key!): This was the biggest hurdle.

    • groups $USER – Sets the user group to video.
    • newgrp video - Refreshes the group membership
    • sudo usermod -aG video $USER - Adds user to the video group
    • sudo reboot - Reboots the system to ensure the group membership takes effect.
    • Installing Proprietary NVIDIA Drivers (Version 580): This was absolutely vital. I used Ubuntu 25.04's firmware tool to load it. This provided the necessary CUDA support and resolved many performance issues.
    • nvcc --version – Verifies CUDA installation.
    • nvidia-smi – Shows the GPU information.

     

  3. CMake Configuration:

    • cmake -B build -DGGML_CUDA=ON -DLLAMA_BUILD_SERVER=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_CUDA_FORCE_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=89 -G Ninja – Configures the build environment, enabling CUDA support.
  4. Build:

    • cmake --build build --config Release – Builds the llama.cpp project in Release mode for optimized performance. You may add -j x to limit build threads

Running the Models – Finding the Sweet Spot

  • DeepSeek-Coder: I’ve found that for the RTX 500, maximizing GPU layers isn’t always the best approach. Initially, I was aiming for 36 layers, but that consumed too much memory. Experimentation is key!
  • Stable Diffusion: I'm using stable-diffusion.cpp to run Stable Diffusion locally. This is a bit more involved, requiring the download of a checkpoint file and some additional steps. You can find more details and instructions on the project’s GitHub page: https://github.com/leejet/stable-diffusion.cpp
  • Gmma3: This is my current stable model which gives good enough intelligence an d speed (~40tokens/sec)  

My Final Running Command (and why I tweaked it):

llama-server -fa on --mlock --n-gpu-layers 36 -m gemma-3-4b-it-Q4_K_M.gguf --host 127.0.0.1 --port 11435

I found that --n-gpu-layers 36 gave me the best balance of performance and memory usage on my system.

 

 

 

Addressing Wayland Limitations & Optimizations


Initially, I encountered limitations with GPU utilization under Wayland compared to X11. I discovered that simply switching to Wayland wasn't enough. By ensuring the correct NVIDIA drivers were installed and experimenting with the --n-gpu-layers setting, I was able to achieve significant performance gains. The mlock flag is also crucial for consistent performance.

Resources


Comments

Popular posts from this blog

SSL certification of you web site

Installing MPICH2 on Ubuntu

ALLTALK WIRELESS SIGN LANGUAGE INTERPRETER