Deploy a ChatGPT like LLM service on Jetstream¶

Tutorial last updated in April 2025

In this tutorial we will deploy a Large Language Model (LLM) on Jetstream, running inference locally on the smallest GPU node available, then we will install a web-chat interface (Open-WebUI) and serve it with HTTPS using Caddy.

Before embarking on running your own GPU node, consider using the Jetstream LLM inference service. This managed solution allows you to interact with pre-deployed models via a simple API key, eliminating the need to provision and maintain your own infrastructure. The hosted service may be more cost-effective and time-efficient for many use cases.

For experimentation purposes we are using the smaller and cheapest GPU node available on Jetstream, the g3.small which has a virtual GPU with 8GB of memory, and deploy the meta-llama/Llama-3.2-1B-Instruct model which is a 1.3B parameter model.

However the same instructions can be used to deploy any other model available on the Hugging Face model hub.

This tutorial is based on work by Tijmen de Haan, the author of Cosmosage.

Choose a model¶

Jetstream has GPU nodes with 4 NVIDIA A100 GPUs, a user can create a Virtual Machine with 1 entire GPU or a fraction of it.

The most important requirement is the GPU memory available to load the model parameters, Jetstream provides:

Instance Type	GPU Memory (GB)
g3.small	8
g3.medium	10
g3.large	20
g3.xl	40

So g3.xl is the largest available and gets an entire A100 GPU with 40GB of memory.

Therefore we need to make sure that the model we want to deploy fits in the available memory.

Llama 3.2 1B model uses about 6.5GB of memory, so it fits in the g3.small instance.

We also need to make sure the model has been fine-tuned for responding to text prompts, generally those models are marked as Instruct.

See the end of the tutorial for using a quantized model to fit a larger model in the same GPU.

Create a Jetstream instance¶

Login to Exosphere, request a Ubuntu 24 g3.small instance, name it chat and ssh into it using either the SSH key or the passphrase generated by Exosphere.

Install miniforge¶

We will use Miniforge to create 2 separate Python environments, one for the Hugging Face model serving and one for the web interface.

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

Configure `vllm` to serve the model¶

This tutorial has been tested with vllm 0.8.3, newer versions could have different default configuration and might increase memory usage. It is recommended to first use vllm 0.8.3 and check everything is working, then execute again with a newer version of vllm to check it is still working and compare performance and memory utilization.

Create the environment:

conda create -y -n vllm python=3.11
conda activate vllm
pip install vllm==0.8.3

As we are using a Llama model, we need specific authorization, you can login to Hugging Face and request access to the model on the model page.

Next we authenticate with Hugging Face to download the model, you do not need to store the credentials in git:

huggingface-cli login

Then we can serve the model:

vllm serve "meta-llama/Llama-3.2-1B-Instruct" --max-model-len=8192 --gpu-memory-utilization 0.85 --max-num-seqs 256

If this starts with no error, we can kill it with Ctrl-C and create a service for it. The additional flags in vllm are necessary to reduce memory utilization and make more of the GPU memory available to vllm, you can modify them if you are using a larger GPU or a smaller model.

Create a file /etc/systemd/system/vllm.service with the following content:

[Unit]
Description=VLLM model serving
After=network.target

[Service]
User=exouser
Group=exouser
WorkingDirectory=/home/exouser

# Activating the conda environment and starting the service
ExecStart=/bin/bash -c "source /home/exouser/miniforge3/etc/profile.d/conda.sh && conda activate vllm && vllm serve 'meta-llama/Llama-3.2-1B-Instruct' --max-model-len=8192  --gpu-memory-utilization 0.85 --max-num-seqs 256"
Restart=always
Environment=PATH=/home/exouser/miniforge3/envs/llm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

[Install]
WantedBy=multi-user.target

Then enable and start the service:

sudo systemctl enable vllm
sudo systemctl start vllm

In case of errors:

Check the logs with sudo journalctl -u vllm
Check the status with sudo systemctl status vllm

You can also check the GPU usage with nvidia-smi:

Tue Apr  8 18:13:07 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-8C                  On  | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |   7220MiB /  8192MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     11615      C   ...miniforge3/envs/vllm/bin/python3.11     7219MiB |
+---------------------------------------------------------------------------------------+

Configure the chat interface¶

The chat interface is provided by Open Web UI.

Create the environment:

conda create -y -n open-webui python=3.11
conda activate open-webui
pip install open-webui
open-webui serve

If this starts with no error, we can kill it with Ctrl-C and create a service for it.

Create a file /etc/systemd/system/webui.service with the following content:

[Unit]
Description=Open Web UI serving
After=network.target

[Service]
User=exouser
Group=exouser
WorkingDirectory=/home/exouser

# Activating the conda environment and starting the service
ExecStart=/bin/bash -c "source /home/exouser/miniforge3/etc/profile.d/conda.sh && conda activate open-webui && open-webui serve"
Restart=always
Environment=PATH=/home/exouser/miniforge3/envs/open-webui/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

[Install]
WantedBy=multi-user.target

Then enable and start the service:

sudo systemctl enable webui
sudo systemctl start webui

Configure web server for HTTPS¶

Finally we can use Caddy to serve the web interface with HTTPS.

Follow the instructions to Install Caddy

Modify the Caddyfile to serve the web interface:

sudo vim /etc/caddy/Caddyfile

to:

chat.xxx000000.projects.jetstream-cloud.org {

        reverse_proxy localhost:8080
}

Where xxx000000 is the allocation code of your Jetstream instance.

Then reload Caddy:

sudo systemctl reload caddy

Connect the model and test the chat interface¶

Point your browser to https://chat.xxx000000.projects.jetstream-cloud.org and you should see the chat interface.

Create an account, click on the profile icon on the top right and enter the “Admin panel” section, open “Settings” then “Connections”. Once you create the first account, that will become admin, if anyone else creates an account they will be a regular user and need to be approved by the admin user. This is the only protection available in this setup, an attacker could still leverage vulnerabilities on Open WebUI to gain access. If you require more security the easiest way is to just open the firewall (using ufw) to only allow connections from your IP.

Under “OpenAI API” enter the URL http://localhost:8000/v1 and leave the API key empty.

Click on the “Verify connection” button, then to “Save” on the bottom.

Finally you can start chatting with the model!

Use a larger model using quantization¶

The weights of LLMs can be quantized to a lower precision to reduce the GPU memory required to run them, often larger models with quantization outperform smaller models with no quantization. Hugging Face has several quantized models, the most popular are GGUF models, but vllm has just experimental support for that format, so better search explicitely for a model “quantized for vllm”, for example hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4.

sudo systemctl stop vllm
vllm serve "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4" --max_model_len 6000 --gpu_memory_utilization .95 --max-num-seqs 8

Modify the systemd service replacing the relevant line with the line above.

Then restart the service:

sudo systemctl daemon-reload
sudo systemctl start vllm

Check nvidia-smi, memory consumption should be about 6.9 GB.

Unfortunately we needed to also decrease max-model-len to fit in such a small GPU, so the model will only support 6000 tokens, so it would be best to deploy this model on a slightly larger Virtual Machine and increase the number of tokens. A lower max-num-seqs reduces the capability of service concurrent users but also helps with memory usage.