Skip to content

Deploy a ChatGPT‑like LLM service on Jetstream

Tutorial last updated in September 2025

In this tutorial we deploy a Large Language Model (LLM) on Jetstream, run inference locally on the smallest currently available GPU node (g3.medium, 10 GB VRAM), then install a web chat interface (Open WebUI) and serve it with HTTPS using Caddy.

Before spinning up your own GPU, consider the managed Jetstream LLM inference service. It may be more cost‑ and time‑effective if you just need API access to standard models.

We will deploy a single (quantized) model: Meta Llama 3.1 8B Instruct Q3_K_M (GGUF). This quantized 8B model fits comfortably in ~8 GB of GPU memory, so it runs on a g3.medium (10 GB) with a little headroom. (The older g3.small flavor has been retired.)

If you later choose a different quantization or a larger context length, or move to an unquantized 8B / 70B model, you’ll need a larger flavor—adjust accordingly.

This tutorial is adapted from work by Tijmen de Haan, the author of Cosmosage.

Model choice & sizing

Jetstream GPU flavors (current key options):

Instance Type Approx. GPU Memory (GB)
g3.medium 10
g3.large 20
g3.xl 40 (full A100)

We pick the quantized Llama 3.1 8B Instruct Q3_K_M variant (GGUF format). Its VRAM residency during inference is about ~8 GB with default context settings, leaving some margin on g3.medium. Always keep a couple of GB free to avoid OOM errors when increasing context length or concurrency.

Ensure the model is an Instruct fine‑tuned variant (it is) so it responds well to chat prompts.

Create a Jetstream instance

Log in to Exosphere, request an Ubuntu 24 g3.medium instance (name it chat) and SSH into it using either your SSH key or the passphrase generated by Exosphere.

Load Miniforge

A centrally provided Miniforge module is available on Jetstream images. Load it (each new shell) and then create the two Conda environments used below (one for the model server, one for the web UI).

module load miniforge
conda init

After running conda init, reload your shell so conda is available: run exec bash -l (avoids logging out and back in).

Serve the model with llama.cpp (OpenAI‑compatible server)

We use llama.cpp via the llama-cpp-python package, which provides an OpenAI‑style HTTP API (default port 8000) that Open WebUI can connect to.

Create an environment and install (remember to module load miniforge first in any new shell).

The last pip install step may take several minutes to compile llama.cpp from source, so please be patient.

conda create -y -n llama python=3.11
conda activate llama
conda install -y cmake ninja scikit-build-core huggingface_hub
module load nvhpc/24.7/nvhpc
# Enable CUDA acceleration with explicit compilers, arch, release build
CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_COMPILER=$(which nvcc) -DCMAKE_C_COMPILER=$(which gcc) -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_BUILD_TYPE=Release" \
    pip install --no-cache-dir --no-build-isolation --force-reinstall "llama-cpp-python[server]==0.3.16"

Download the quantized GGUF file (Q3_K_M variant) from the QuantFactory model page: https://huggingface.co/QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF

mkdir -p ~/models
hf download QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF \
    Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf \
    --local-dir ~/models

Test run (Ctrl-C to stop):

python -m llama_cpp.server \
    --model /home/exouser/models/Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf \
    --chat_format llama-3 \
    --n_ctx 8192 \
    --n_gpu_layers -1 \
    --port 8000

--n_gpu_layers -1 tells llama.cpp to offload all model layers to the GPU (full GPU inference). Without this flag the default is CPU layers (n_gpu_layers=0), which results in only ~1 GB of VRAM being used and much slower generation. Full offload of this 8B Q3_K_M model plus context buffers should occupy roughly 8–9 GB VRAM at --n_ctx 8192 on first real requests. If it fails to start with an out‑of‑memory (OOM) error you have a few mitigation options (apply one, then retry):

  • Lower context length: e.g. --n_ctx 4096 (largest single lever; roughly linear VRAM impact for KV cache).
  • Partially offload: replace --n_gpu_layers -1 with a number (e.g. --n_gpu_layers 20). Remaining layers will run on CPU (slower, but reduces VRAM need).
  • Use a lower‑bit quantization (e.g. Q2_K) or a smaller model.

You can inspect VRAM usage with watch -n 2 nvidia-smi after starting the server.

Quick note on the “KV cache”: During generation the model reuses previously computed attention Key and Value tensors (instead of recalculating them each new token). These tensors are stored per layer and per processed token; as your prompt and conversation grow, the cache grows linearly with the number of tokens kept in context. That’s why idle VRAM (~weights only) is lower (~6 GB) and rises toward the higher number (up to ~8–9 GB here) only after longer prompts / chats. Reducing --n_ctx caps the maximum KV cache size; clearing history or restarting frees it.

If it starts without errors, create a systemd service so it restarts automatically.

Quick option: If you prefer a single copy/paste that creates both the llama and open-webui systemd services at once, skip the next two manual unit file sections and jump ahead to the subsection titled “(Optional) One-liner to create both services” below. You can always come back here for the longer, step-by-step version and troubleshooting notes.

Using sudo to run your preferred text editor, create /etc/systemd/system/llama.service with the following contents:

[Unit]
Description=Llama.cpp OpenAI-compatible server
After=network.target

[Service]
User=exouser
Group=exouser
WorkingDirectory=/home/exouser
ExecStart=/bin/bash -lc "module load nvhpc/24.7/nvhpc miniforge && conda run -n llama python -m llama_cpp.server --model /home/exouser/models/Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf --chat_format llama-3 --n_ctx 8192 --n_gpu_layers -1 --port 8000"
Restart=always

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl enable llama
sudo systemctl start llama

Troubleshooting:

  • Logs: sudo journalctl -u llama -f
  • Status: sudo systemctl status llama
  • GPU usage: nvidia-smi (≈6 GB idle right after start with full offload; can grow toward ~8–9 GB under long prompts/conversations as KV cache fills)

Configure the chat interface

The chat interface is provided by Open Web UI.

Create the environment (in a new shell remember to module load miniforge first):

module load miniforge
conda create -y -n open-webui python=3.11
conda activate open-webui
pip install open-webui
open-webui serve

If this starts with no error, we can kill it with Ctrl-C and create a service for it.

Using sudo to run your preferred text editor, create /etc/systemd/system/webui.service with the following contents:

[Unit]
Description=Open Web UI serving
After=network.target

[Service]
User=exouser
Group=exouser
WorkingDirectory=/home/exouser

# Activating the conda environment and starting the service
ExecStart=/bin/bash -lc "module load miniforge && conda run -n open-webui open-webui serve"
Restart=always
# PATH managed by module + conda

[Install]
WantedBy=multi-user.target

Then enable and start the service:

sudo systemctl enable webui
sudo systemctl start webui

(Optional) One-liner to create both services

If you already created the Conda environments (llama and open-webui) and downloaded the model, you can create, enable, and start both systemd services (model server + Open WebUI) in a single copy/paste. Adjust MODEL, N_CTX, USER, and NVHPC_MOD if needed before running:

MODEL=/home/exouser/models/Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf N_CTX=8192 USER=exouser NVHPC_MOD=nvhpc/24.7/nvhpc ; sudo tee /etc/systemd/system/llama.service >/dev/null <<EOF && sudo tee /etc/systemd/system/webui.service >/dev/null <<EOF2 && sudo systemctl daemon-reload && sudo systemctl enable --now llama webui
[Unit]
Description=Llama.cpp OpenAI-compatible server
After=network.target

[Service]
User=$USER
Group=$USER
WorkingDirectory=/home/$USER
ExecStart=/bin/bash -lc "module load $NVHPC_MOD miniforge && conda run -n llama python -m llama_cpp.server --model $MODEL --chat_format llama-3 --n_ctx $N_CTX --n_gpu_layers -1 --port 8000"
Restart=always

[Install]
WantedBy=multi-user.target
EOF
[Unit]
Description=Open Web UI serving
After=network.target

[Service]
User=$USER
Group=$USER
WorkingDirectory=/home/$USER
ExecStart=/bin/bash -lc "module load miniforge && conda run -n open-webui open-webui serve"
Restart=always

[Install]
WantedBy=multi-user.target
EOF2

To later change context length: edit /etc/systemd/system/llama.service, modify --n_ctx, then run:

sudo systemctl daemon-reload
sudo systemctl restart llama

Configure web server for HTTPS

Finally we can use Caddy to serve the web interface with HTTPS.

Install Caddy. Note that the version of Caddy available in the Ubuntu APT repositories is often outdated. Follow the instructions to install Caddy on Ubuntu. You can copy-paste all the lines at once.

Modify the Caddyfile to serve the web interface. (Note that sensible-editor will prompt you to choose a text editor; select the number for /bin/nano if you aren’t sure what else to pick.)

sudo sensible-editor /etc/caddy/Caddyfile

to:

chat.xxx000000.projects.jetstream-cloud.org {

        reverse_proxy localhost:8080
}

Where chat is the name of your instance, and xxx000000 is the allocation code. You can find the full hostname (e.g. chat.xxx000000.projects.jetstream-cloud.org) in Exosphere: open your instance’s details page, scroll to the Credentials section, and copy the value shown under Hostname.

Then reload Caddy:

sudo systemctl reload caddy

Connect the model and test the chat interface

Point your browser to https://chat.xxx000000.projects.jetstream-cloud.org and you should see the chat interface.

Create an account, click on the profile icon on the top right and enter the “Admin panel” section, open “Settings” then “Connections”. Once you create the first account, that will become admin, if anyone else creates an account they will be a regular user and need to be approved by the admin user. This is the only protection available in this setup, an attacker could still leverage vulnerabilities on Open WebUI to gain access. If you require more security the easiest way is to just open the firewall (using ufw) to only allow connections from your IP.

Under “OpenAI API” enter the URL http://localhost:8000/v1 and leave the API key empty (the local llama.cpp server is unsecured by default on localhost).

Click on the “Verify connection” button, then to “Save” on the bottom.

Finally you can start chatting with the model!

If you change context length (--n_ctx) or increase concurrent users you may approach the 10 GB limit. Reduce --n_ctx (e.g. 4096) if you encounter out‑of‑memory errors.

Scaling up or changing models

Want a larger model or higher quality? Options:

  • Use a higher‑bit quantization (Q4_K_M / Q5_K_M) for better quality (needs more VRAM).
  • Move to unquantized FP16 8B (≈16 GB VRAM) on g3.large or bigger.
  • Increase context length (each 1k tokens adds memory usage). If you see OOM, lower --n_ctx.

For production workloads, consider the managed Jetstream inference service or frameworks like vllm on larger GPUs for higher throughput.