Large Language Model Inference Service¶
We host a large language model (LLM) inference service for the Jetstream2 and IU Research Cloud communities.
This is currently in an early prototype stage, and we welcome suggestions for future refinement, either as a support ticket to help@jetstream-cloud.org or in our community chat on Matrix.
What does the service provide?¶
First, vLLM serves an OpenAI-compatible inference API to integrate with your projects and applications.
Second, Open WebUI provides a browser-based chat interface, similar to ChatGPT.
Which model is it?¶
Starting 2024 December 9, we offer Llama-3.3-70B-Instruct at 8-bit quantization. We chose it because:
- It benchmarks better than our previous model
- It approaches or exceeds the capability of leading proprietary models like GPT-4o.
- It fits on the hardware that we have available while leaving room for a relatively large (48k token) context window.
As an instruct-tuned model, it is best-suited for following instructions and conversation-style prompts. It may not work well for completion or fill-in-middle tasks.
The models that we offer are subject to change as the state of the art improves rapidly.
- Prior to 2024 December 9, we offered Llama-3.1-Nemotron-70B-Instruct.
Why is this valuable / worth using?¶
It’s an unlimited-use API that we provide at no cost to our communities. (APIs from OpenAI, Anthropic, and similar providers all cost money to use.) The inference service provides a larger, more capable LLM than would fit on a g3.xl
-size Jetstream2 instance (and much larger than will run on most personal computers). Also, your prompt and response data does not leave IU systems, and nobody will use it for training or data mining purposes.
If you are a Jetstream2 user, this service does not cost any SUs to use. It is available as long as you have an active allocation (and an instance to connect from).
What can I do with it?¶
Sky’s the limit!
- Programming and debugging assistant.
- Access it from your preferred code editor using Continue.
- Use it with LangChain to develop LLM-powered applications.
- Literature review assistant; give it a scientific paper and ask for a summary.
- Brainstorming assistant to help develop hypotheses, experimental protocols, and approaches to data analysis.
- Writing and proofreading assistant.
- Tutor, surrogate thesis advisor.
Just remember that an LLM will readily hallucinate (“make things up”) while performing all of these tasks. Think of it as a confident, well-read intern with a complete lack of epistemic awareness. If you open a support ticket saying that it told you the ball is still in the cup, or that there are only two Rs in the word strawberry, we won’t be able to help.
How to access it?¶
Connections to the inference service are limited to Jetstream2 or IU Research Cloud networks and instances. This is how we are limiting access to authorized users. If you try to connect from anywhere else, the server will return an HTTP 401 (unauthorized) response.
(If you think you’re receiving the unauthorized message in error, please create a support ticket specifying the public IP address that you’re connecting from, i.e. the output of curl ifconfig.co
.)
Accessing From a Jetstream2 instance¶
Connecting to the API: curl
or otherwise connect to https://llm.jetstream-cloud.org/v1/
. An example query:
curl https://llm.jetstream-cloud.org/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Llama-3.3-70B-Instruct-FP8-dynamic",
"prompt": "What is the difference between SSH and SSL",
"max_tokens": 64,
"temperature": 0.7
}'
Connecting to the chat UI: open a web desktop session on your instance. Then, inside the web desktop, open a web browser (like Firefox) and browse to llm.jetstream-cloud.org
.
Accessing from your own computer¶
You can make connections to the inference service from a computer that is not a Jetstream2 or IU Research Cloud instance, but you must tunnel the connection through an existing Jetstream2 or IU Research Cloud instance that you have access to.
There are several ways to do this; here are two examples. The sshuttle method is simpler but requires installing software (sshuttle) on the client computer. The port forwarding method requires root access on the client computer, but requires no additional client-side software.
Tunneling via sshuttle¶
First, install sshuttle if you haven’t already. (sudo apt install sshuttle
on Ubuntu, brew install sshuttle
or sudo port install sshuttle
on Mac OS).
Then, run this command:
sshuttle -r exouser@your-instance-floating-ip-here 149.165.157.253/32
This directs sshuttle to connect to your instance, and forwards all connections to 149.165.157.253 (the inference server) through the instancel.
Now you can connect to the API at https://llm.jetstream-cloud.org/v1
, or open your browser to https://llm.jetstream-cloud.org
. Note that you must leave the sshuttle connection open while you’re using the inference service.
Tunneling via SSH Port Forwarding¶
First, add this to your local computer’s /etc/hosts
file:
127.0.0.1 llm.jetstream-cloud.org
This directs your computer to resolve network connections to llm.jetstream-cloud.org
to itself (the loopback address). Note that you usually need to become the root user (i.e. sudo
) in order to modify your computer’s /etc/hosts
file.
Next, create an SSH connection with TCP port forwarding:
ssh -L 1234:149.165.157.253:443 exouser@your-instance-floating-ip-here
In this example, we’re forwarding local TCP port 1234 (on your computer) through the SSH server (i.e. your instance) to the destination 149.165.157.253:443 (i.e. the inference server). You do not need to use the shell inside this SSH session, but you must leave the connection open while you’re using the inference service. (If the connection closes or breaks, e.g. because you close your laptop and go somewhere else, you must re-start it in order to continue using the service.)
Now you can connect to the API at https://llm.jetstream-cloud.org:1234/v1
, or open your browser to https://llm.jetstream-cloud.org:1234
.
How to use the API¶
We use vLLM to expose an OpenAI-compatible API. Generally, it works as a drop-in replacement for applications that integrate with the OpenAI API. The basics of OpenAI’s API reference documentation apply.
Note that you do not need to specify an API key. At this early stage, we are controlling access by restricting the networks that clients can request from, not by issuing API keys. If your application insists that you provide an API key, any non-empty string should work.
curl
example¶
curl https://llm.jetstream-cloud.org/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Llama-3.3-70B-Instruct-FP8-dynamic",
"prompt": "What is the difference between SSH and SSL",
"max_tokens": 64,
"temperature": 0.7
}'
{"id":"cmpl-5acaad6cdd144b6b9369d06ee10096da","object":"text_completion","created":1732310616,"model":"Llama-3.3-70B-Instruct-FP8-dynamic","choices":[{"index":0,"text":"/TLS?\nSSH (Secure Shell) and SSL/TLS (Secure Sockets Layer/Transport Layer Security) are both cryptographic protocols used to provide secure communication over a network. However, they serve different purposes and are used in different contexts. Here are the main differences between SSH and SSL/TLS:\n1. **Purpose","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":73,
Python example¶
pip install openai
, then create a Python script with these contents, and run it.
from openai import OpenAI
client = OpenAI(base_url="https://llm.jetstream-cloud.org/v1", api_key="empty")
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "What is the difference between SSH and SSL",
}
],
model="Llama-3.3-70B-Instruct-FP8-dynamic",
)
print(chat_completion.choices[0].message.content)
Command line example¶
You can also use the llm
to access the LLM from the command line,
this is particularly convenient because you can integrate it with bash commands.
First, install llm
in your favorite Python virtual environment:
pip install llm
Then, find where the configuration files are located:
dirname "$(llm logs path)"
Add a file named extra-openai-models.yaml
to the directory that was printed by the previous command, with the following content:
- model_id: llama3.370B
model_name: "Llama-3.3-70B-Instruct-FP8-dynamic"
api_base: "https://llm.jetstream-cloud.org/v1/"
And set it default:
llm models default llama3.370B
Finally you can use it interactively (-s
sets the system prompt):
curl https://docs.jetstream-cloud.org/general/inference-service/ | html2text | llm -s "make a 1 paragraph summary"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 69412 100 69412 0 0 105k 0 --:--:-- --:--:-- --:--:-- 105k
Here is a 1-paragraph summary of the Jetstream2 Large Language Model Inference Service documentation:
**Summary**: Jetstream2 offers a free, unlimited-use Large Language Model (LLM) Inference Service, powered by Llama 3.3, for its community. The service provides an OpenAI-compatible API and a browser-based chat interface (Open WebUI) for tasks like programming assistance, literature reviews, brainstorming, and writing aid. Access is restricted to Jetstream2 or IU Research Cloud networks and instances, but can be tunneled through from external computers. The service runs on an NVIDIA Grace Hopper server with an H100 GPU, supporting up to 4 simultaneous requests, and is subject to Jetstream2's acceptable use policies, primarily for research, education, or learning purposes.
or you can start a chat session on the command line (-c
continues the conversation):
llm chat -c
Using API with Your IDE¶
Using with VSCode or VSCodium¶
Install the Continue extension. In the extension’s config.json
, set the models
like so:
"models": [
{
"provider": "openai",
"title": "Jetstream2 Inference Service",
"apiBase": "https://llm.jetstream-cloud.org/v1/",
"model": "Llama-3.3-70B-Instruct-FP8-dynamic",
"useLegacyCompletionsEndpoint": true
}
],
The chat pane should now work.
How to use Open WebUI¶
The first time you access the UI, you will need to sign up for an account. Please sign up with the same email address that’s associated with your personal ACCESS ID. (We may delete any accounts not associated with an ACCESS ID.) Use any unique password; it is separate from your ACCESS account.
Once you’re signed in, there are several ways to interact.
- You can chat with it via text.
- You can provide audio input (which it will transcribe to text), or start a “call” where you speak your prompt and it will speak a response.
- You can upload a file and ask questions about its contents.
- You can set up Retrieval-Augmented Generation with your own source documents.
Consult the Open WebUI documentation for more detail.
What hardware is behind this service?¶
The service runs on an NVIDIA Grace Hopper (GH200) server with an NVIDIA H100 GPU (96 GB of VRAM). Based on our own testing, users can expect inference at up to 35 tokens per second. (Prompt evaluation is much faster.) The service supports hundreds of simultaneous requests.
Terms of use¶
Use of this service is subject to Jetstream2 and IU’s acceptable use policies. If what you’re doing is not for a research, education, or learning purpose, please take it somewhere else. Systems administrators are able to view all user interactions.
The chat history in Open WebUI is not backed up and could be lost at any time. So, if you want to keep anything important from your chat sessions, you should copy it somewhere else.