Large Language Model Inference Service¶
We host a large language model (LLM) inference service for the Jetstream2 and IU Research Cloud communities.
This is currently in an early prototype stage, and we welcome suggestions for future refinement, either as a support ticket to help@jetstream-cloud.org or in our community chat on Matrix.
What does the service provide?¶
First, vLLM serves an OpenAI-compatible inference API to integrate with your projects and applications.
Second, Open WebUI provides a browser-based chat interface, similar to ChatGPT.
Which model is it?¶
We currently offer NVIDIA’s fine-tuning of Llama 3.1, Llama-3.1-Nemotron-70B-Instruct. We chose this model for two reasons:
- As of November 2024, it is the highest-rated non-proprietary (self-hostable) model on Chatbot Arena (Overall category).
- It fits on the hardware that we have available at 8-bit quantization while leaving room for a relatively large (48k token) context window.
As an instruct-tuned model, it is best-suited for following instructions and conversation-style prompts. It may not work well for completion or fill-in-middle tasks.
Why is this valuable / worth using?¶
It’s an unlimited-use API that we provide at no cost to our communities. (APIs from OpenAI, Anthropic, and similar providers all cost money to use.) The inference service provides a larger, more capable LLM than would fit on a g3.xl
-size Jetstream2 instance (and much larger than will run on most personal computers). Also, your prompt and response data does not leave IU systems, and nobody will use it for training or data mining purposes.
If you are a Jetstream2 user, this service does not cost any SUs to use. It is available as long as you have an active allocation (and an instance to connect from).
What can I do with it?¶
Sky’s the limit!
- Programming and debugging assistant.
- Access it from your preferred code editor using Continue.
- Use it with LangChain to develop LLM-powered applications.
- Literature review assistant; give it a scientific paper and ask for a summary.
- Brainstorming assistant to help develop hypotheses, experimental protocols, and approaches to data analysis.
- Writing and proofreading assistant.
- Tutor, surrogate thesis advisor.
Just remember that an LLM will readily hallucinate (“make things up”) while performing all of these tasks. Think of it as a confident, well-read intern with a complete lack of epistemic awareness. If you open a support ticket saying that it told you the ball is still in the cup, or that there are only two Rs in the word strawberry, we won’t be able to help.
How to access it?¶
Connections to the inference service are limited to Jetstream2 or IU Research Cloud networks and instances. This is how we are limiting access to authorized users. If you try to connect from anywhere else, the server will return an HTTP 401 (unauthorized) response.
(If you think you’re receiving the unauthorized message in error, please create a support ticket specifying the public IP address that you’re connecting from, i.e. the output of curl ifconfig.co
.)
Accessing From a Jetstream2 instance¶
Connecting to the API: curl
or otherwise connect to https://llm.jetstream-cloud.org/v1/
. An example query:
curl https://llm.jetstream-cloud.org/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",
"prompt": "What is the difference between SSH and SSL",
"max_tokens": 64,
"temperature": 0.7
}'
Connecting to the chat UI: open a web desktop session on your instance. Then, inside the web desktop, open a web browser (like Firefox) and browse to llm.jetstream-cloud.org
.
Accessing from your own computer¶
You can make connections to the inference service from a computer that is not a Jetstream2 or IU Research Cloud instance, but you must tunnel the connection through an existing Jetstream2 or IU Research Cloud instance that you have access to.
There are several ways to do this; here are two examples. The sshuttle method is simpler but requires installing software (sshuttle) on the client computer. The port forwarding method requires root access on the client computer, but requires no additional client-side software.
Tunneling via sshuttle¶
First, install sshuttle if you haven’t already. (sudo apt install sshuttle
on Ubuntu, brew install sshuttle
or sudo port install sshuttle
on Mac OS).
Then, run this command:
sshuttle -r exouser@your-instance-floating-ip-here 149.165.157.253/32
This directs sshuttle to connect to your instance, and forwards all connections to 149.165.157.253 (the inference server) through the instancel.
Now you can connect to the API at https://llm.jetstream-cloud.org/v1
, or open your browser to https://llm.jetstream-cloud.org
. Note that you must leave the sshuttle connection open while you’re using the inference service.
Tunneling via SSH Port Forwarding¶
First, add this to your local computer’s /etc/hosts
file:
127.0.0.1 llm.jetstream-cloud.org
This directs your computer to resolve network connections to llm.jetstream-cloud.org
to itself (the loopback address). Note that you usually need to become the root user (i.e. sudo
) in order to modify your computer’s /etc/hosts
file.
Next, create an SSH connection with TCP port forwarding:
ssh -L 1234:149.165.157.253:443 exouser@your-instance-floating-ip-here
In this example, we’re forwarding local TCP port 1234 (on your computer) through the SSH server (i.e. your instance) to the destination 149.165.157.253:443 (i.e. the inference server). You do not need to use the shell inside this SSH session, but you must leave the connection open while you’re using the inference service. (If the connection closes or breaks, e.g. because you close your laptop and go somewhere else, you must re-start it in order to continue using the service.)
Now you can connect to the API at https://llm.jetstream-cloud.org:1234/v1
, or open your browser to https://llm.jetstream-cloud.org:1234
.
How to use the API¶
We use vLLM to expose an OpenAI-compatible API. Generally, it works as a drop-in replacement for applications that integrate with the OpenAI API. The basics of OpenAI’s API reference documentation apply.
Note that you do not need to specify an API key. At this early stage, we are controlling access by restricting the networks that clients can request from, not by issuing API keys. If your application insists that you provide an API key, any non-empty string should work.
curl
example¶
curl https://llm.jetstream-cloud.org/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",
"prompt": "What is the difference between SSH and SSL",
"max_tokens": 64,
"temperature": 0.7
}'
{"id":"cmpl-5acaad6cdd144b6b9369d06ee10096da","object":"text_completion","created":1732310616,"model":"neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic","choices":[{"index":0,"text":"/TLS?\nSSH (Secure Shell) and SSL/TLS (Secure Sockets Layer/Transport Layer Security) are both cryptographic protocols used to provide secure communication over a network. However, they serve different purposes and are used in different contexts. Here are the main differences between SSH and SSL/TLS:\n1. **Purpose","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":73,
Python example¶
pip install openai
, then create a Python script with these contents, and run it.
from openai import OpenAI
client = OpenAI(base_url="https://llm.jetstream-cloud.org/v1", api_key="empty")
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "What is the difference between SSH and SSL",
}
],
model="neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",
)
print(chat_completion.choices[0].message.content)
Using API with Your IDE¶
Using with VSCode or VSCodium¶
Install the Continue extension. In the extension’s config.json
, set the models
like so:
"models": [
{
"provider": "openai",
"title": "Jetstream2 Inference Service",
"apiBase": "https://llm.jetstream-cloud.org/v1/",
"model": "neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",
"useLegacyCompletionsEndpoint": true
}
],
The chat pane should now work.
How to use Open WebUI¶
The first time you access the UI, you will need to sign up for an account. Please sign up with the same email address that’s associated with your personal ACCESS ID. (We may delete any accounts not associated with an ACCESS ID.) Use any unique password; it is separate from your ACCESS account.
Once you’re signed in, there are several ways to interact.
- You can chat with it via text.
- You can provide audio input (which it will transcribe to text), or start a “call” where you speak your prompt and it will speak a response.
- You can upload a file and ask questions about its contents.
- You can set up Retrieval-Augmented Generation with your own source documents.
Consult the Open WebUI documentation for more detail.
What hardware is behind this service?¶
The service runs on an NVIDIA Grace Hopper (GH200) server with an NVIDIA H100 GPU (96 GB of VRAM). Based on our own testing, users can expect inference at 35 tokens per second. (Prompt evaluation is much faster.)
The service supports up to 4 simultaneous requests. It will queue additional simultaneous requests for processing when a slot opens up. If you notice a long delay before receiving a response, the server is likely operating at capacity.
Terms of use¶
Use of this service is subject to Jetstream2 and IU’s acceptable use policies. If what you’re doing is not for a research, education, or learning purpose, please take it somewhere else. Systems administrators are able to view all user interactions.
The chat history in Open WebUI is not backed up and could be lost at any time. So, if you want to keep anything important from your chat sessions, you should copy it somewhere else.