Large Language Model Inference Service¶
We host a large language model (LLM) inference service for the Jetstream2 and IU Research Cloud communities.
This service is evolving. It may occasionally go offline for a few minutes at a time, as we make updates and improvements. We welcome suggestions for future refinement, either as a support ticket to help@jetstream-cloud.org or in our community chat on Matrix.
What does the service provide?¶
Two things:
- A browser-based chat interface via Open WebUI, similar to ChatGPT.
- OpenAI-compatible inference APIs to integrate with your projects and applications.
Which models do you offer?¶
As of 2025 February 15, we offer three models:
- DeepSeek R1, a chains-of-thought reasoning model, at native (FP8) quantization.
- R1 is now the world’s most capable open-weights LLM, and nearly the most capable LLM overall.
- We serve the full 671-billion-parameter DeepSeek model, not one of the (much) smaller R1 distillations of Llama or Qwen.
- Llama-3.3-70B-Instruct, a general-purpose instruct-tuned model, at 8-bit quantization.
- Qwen2.5-VL-72B-Instruct, a vision-language model that accepts image uploads, at 8-bit quantization.
The models that we offer are subject to change as the state of the art improves rapidly.
- On 2025 February 15, we added Qwen2.5-VL-72B-Instruct.
- On 2025 January 29, we added DeepSeek R1.
- On 2024 December 9, we added Llama-3.3-70B-Instruct and discontinued Llama-3.1-Nemotron-70B-Instruct.
- Prior to 2024 December 9, we offered Llama-3.1-Nemotron-70B-Instruct.
Which model should I use?¶
Use DeepSeek R1 for your most complex, nuanced questions and tasks, when you don’t mind waiting a minute for the best answer. R1 does well following detailed instructions, ideally in the first message in a given chat window. As a chains-of-thought reasoning model, DeepSeek R1 performs similar to OpenAI o1, with the additional benefit of exposing the “thinking” behind its final answers.
Use Llama 3.3 for faster, general-purpose interactions where a “good-enough” answer will suffice. Llama 3.3 will also do better with extended back-and-forth conversation, where you ask for refinement or clarification several times. As an instruct-tuned model, Llama 3.3 is best-suited for following instructions and conversation-style prompts. It may not work well for completion or fill-in-middle tasks.
Use Qwen2.5-VL to work with images. In your prompt, you can upload an image and also provide questions or instructions. It can recognize objects and features in a photo, transcribe text, and much more.
Why is this valuable / worth using?¶
It’s an unlimited-use API that we provide at no cost to our communities. (APIs from OpenAI, Anthropic, DeepSeek, and similar providers all cost money to use.) The inference service provides larger, more capable LLMs than would fit on a g3.xl
-size Jetstream2 instance (and much larger than will run on most personal computers).
Your prompt and response data are encrypted in transit, and processed only on systems located at IU. Nobody will use it for AI training or data mining purposes.
This service does not cost any Jetstream2 SUs to use. It is available as long as you have an ACCESS account (sign up here if needed).
What can I do with it?¶
Sky’s the limit!
- Programming and debugging assistant.
- Access it from your preferred code editor using Continue.
- Use it with LangChain to develop LLM-powered applications.
- Literature review assistant; give it a scientific paper and ask for a summary.
- Brainstorming assistant to help develop hypotheses, experimental protocols, and approaches to data analysis.
- Writing and proofreading assistant.
- Tutor, surrogate thesis advisor.
Just remember that an LLM will readily hallucinate (“make things up”) while performing all of these tasks. Think of it as a confident, well-read intern with a complete lack of epistemic awareness. If you open a support ticket saying that it told you the ball is still in the cup, or that there are only two Rs in the word strawberry, we won’t be able to help.
Accessing and Using the Chat UI¶
Connections to the chat inferface are no longer limited to devices on specific networks. You can connect from anywhere on the internet, including from your phone.
Browse to llm.jetstream-cloud.org. At the login page, click “Continue with ACCESS single sign-on”. Log in with the “ACCESS CI (XSEDE)” identity provider, the same way that you log into other Jetstream2 interfaces like Exosphere.
Once you’re signed in, there are several ways to interact.
- You can chat with it via text.
- You can provide audio input (which it will transcribe to text), or start a “call” where you speak your prompt and it will speak a response.
- You can upload a file and ask questions about its contents.
- You can set up Retrieval-Augmented Generation with your own source documents.
Consult the Open WebUI documentation for more detail.
Accessing the APIs¶
To access the APIs, there are two different methods.
- First method: Using Open WebUI as a proxy to the inference back-ends.
- This allows you to make API calls from anywhere on the internet, but it requires you to use an authenticated API token.
- Open WebUI exposes a more limited API surface compared to vLLM or SGLang.
- To generate an API token from within the chat UI:
- First log in with your ACCESS account.
- Then click your user ID (lower-left corner), then Settings, then Account, then API keys, them create a new secret key.
- Copy out the resulting key.
- Treat this key like a password, do not share it.
- Second method: Direct connections to the vLLM and SGLang inference servers.
- This allows you to make API calls with no token at all, but to prevent abuse, access is limited to Jetstream2 or IU Research Cloud networks and instances.
- If you try to connect from anywhere else, the server will return an HTTP 401 (unauthorized) response.
- It is possible to tunnel connections from a different computer through a Jetstream2 instance; instructions follow.
- vLLM and SGLang expose a more featureful API surface than Open WebUI exposes.
API Endpoints¶
- If connecting via OpenWebUI proxy (first method above), use
https://llm.jetstream-cloud.org/api/
.- You must pass an Authorization header containing your API token (e.g.
-H "Authorization: bearer your-token-here"
if using curl).
- You must pass an Authorization header containing your API token (e.g.
- To access DeepSeek R1 directly (second method above), use
https://llm.jetstream-cloud.org/sglang/v1/
. - To access Llama 3.3 directly (second method above), use
https://llm.jetstream-cloud.org/vllm/v1/
. - To access Qwen2.5-VL directly (second method above), use
https://llm.jetstream-cloud.org/qwen2.5-vl/v1/
.
You can exchange these in the examples below to access different models.
Accessing APIs From a Jetstream2 instance¶
curl
or otherwise connect to https://llm.jetstream-cloud.org/vllm/v1/
. An example query directly to vLLM:
curl https://llm.jetstream-cloud.org/vllm/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Llama-3.3-70B-Instruct-FP8-Dynamic",
"prompt": "What is the difference between SSH and SSL",
"max_tokens": 64,
"temperature": 0.7
}'
Accessing APIs from your own computer (via Open WebUI proxy)¶
curl
or otherwise connect to https://llm.jetstream-cloud.org/api/
, passing your API token as an argument. An example query:
curl https://llm.jetstream-cloud.org/api/chat/completions \
-H "Authorization: bearer your-token-here" \
-H 'Content-Type: application/json' \
-d '{
"model": "Llama-3.3-70B-Instruct-FP8-Dynamic",
"messages": [
{
"role": "user",
"content": "What is the difference between SSH and SSL"
}
],
"max_tokens": 64
}'
Accessing APIs from your own computer (tunnel to access vLLM / SGLang directly)¶
You can make connections to vLLM and SGLang from a computer that is not a Jetstream2 or IU Research Cloud instance, but you must tunnel the connection through an existing Jetstream2 or IU Research Cloud instance that you have access to.
There are several ways to do this; here are two examples. The sshuttle method is simpler but requires installing software (sshuttle) on the client computer. The port forwarding method requires root access on the client computer, but requires no additional client-side software.
Tunneling via sshuttle¶
First, install sshuttle if you haven’t already. (sudo apt install sshuttle
on Ubuntu, brew install sshuttle
or sudo port install sshuttle
on Mac OS).
Then, run this command:
sshuttle -r exouser@your-instance-floating-ip-here 149.165.157.253/32
This directs sshuttle to connect to your instance, and forwards all connections to 149.165.157.253 (the inference server) through the instancel.
Now you can connect to the API at (e.g.) https://llm.jetstream-cloud.org/vllm/v1
, or open your browser to https://llm.jetstream-cloud.org
. Note that you must leave the sshuttle connection open while you’re using the inference service.
Tunneling via SSH Port Forwarding¶
First, add this to your local computer’s /etc/hosts
file:
127.0.0.1 llm.jetstream-cloud.org
This directs your computer to resolve network connections to llm.jetstream-cloud.org
to itself (the loopback address). Note that you usually need to become the root user (i.e. sudo
) in order to modify your computer’s /etc/hosts
file.
Next, create an SSH connection with TCP port forwarding:
ssh -L 1234:149.165.157.253:443 exouser@your-instance-floating-ip-here
In this example, we’re forwarding local TCP port 1234 (on your computer) through the SSH server (i.e. your instance) to the destination 149.165.157.253:443 (i.e. the inference server). You do not need to use the shell inside this SSH session, but you must leave the connection open while you’re using the inference service. (If the connection closes or breaks, e.g. because you close your laptop and go somewhere else, you must re-start it in order to continue using the service.)
Now you can connect to the API at (e.g.) https://llm.jetstream-cloud.org:1234/vllm/v1
, or open your browser to https://llm.jetstream-cloud.org:1234
.
Using the APIs¶
Generally, any of the API connection options (Open WebUI proxy or direct to vLLM/SGLang) expose OpenAI-compatible APIs, but there may be nuances that apply to your chosen connection option in Open WebUI, vLLM, or SGLang documentation.
If you are connecting directly to vLLM or SGLang without an API token, but your application insists that you provide one anyway, any non-empty string should work.
Python example (from Jetstream2 instance or tunnelled connection)¶
pip install openai
, then create a Python script with these contents, and run it.
from openai import OpenAI
client = OpenAI(base_url="https://llm.jetstream-cloud.org/vllm/v1", api_key="empty")
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "What is the difference between SSH and SSL",
}
],
model="Llama-3.3-70B-Instruct-FP8-Dynamic",
)
print(chat_completion.choices[0].message.content)
Command line example (from Jetstream2 instance or tunnelled connection)¶
You can also use the llm
to access the LLM from the command line,
this is particularly convenient because you can integrate it with bash commands.
First, install llm
in your favorite Python virtual environment:
pip install llm
Then, find where the configuration files are located:
dirname "$(llm logs path)"
Add a file named extra-openai-models.yaml
to the directory that was printed by the previous command, with the following content:
- model_id: llama3.370B
model_name: "Llama-3.3-70B-Instruct-FP8-Dynamic"
api_base: "https://llm.jetstream-cloud.org/vllm/v1/"
And set it default:
llm models default llama3.370B
Finally you can use it interactively (-s
sets the system prompt):
curl https://docs.jetstream-cloud.org/general/inference-service/ | html2text | llm -s "make a 1 paragraph summary"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 69412 100 69412 0 0 105k 0 --:--:-- --:--:-- --:--:-- 105k
Here is a 1-paragraph summary of the Jetstream2 Large Language Model Inference Service documentation:
**Summary**: Jetstream2 offers a free, unlimited-use Large Language Model (LLM) Inference Service, powered by Llama 3.3, for its community. The service provides an OpenAI-compatible API and a browser-based chat interface (Open WebUI) for tasks like programming assistance, literature reviews, brainstorming, and writing aid. Access is restricted to Jetstream2 or IU Research Cloud networks and instances, but can be tunneled through from external computers. The service runs on an NVIDIA Grace Hopper server with an H100 GPU, supporting up to 4 simultaneous requests, and is subject to Jetstream2's acceptable use policies, primarily for research, education, or learning purposes.
or you can start a chat session on the command line (-c
continues the conversation):
llm chat -c
Using API with Your IDE¶
Using with VSCode or VSCodium (from Jetstream2 instance or tunnelled connection)¶
Install the Continue extension. In the extension’s config.json
, set the models
like so:
"models": [
{
"provider": "openai",
"title": "Jetstream2 Inference Service",
"apiBase": "https://llm.jetstream-cloud.org/vllm/v1/",
"model": "Llama-3.3-70B-Instruct-FP8-Dynamic",
"useLegacyCompletionsEndpoint": true
}
],
The chat pane should now work.
Using with JupyterLab via JupyterAI (from Jetstream2 instance or tunnelled connection)¶
Install the jupyter-ai
package, version 2.29.1
or higher, and langchain-openai
.
In JupyterLab, open the JupyterAI settings, and configure:
- Completion model =
OpenRouter :: *
- API Base url =
https://llm.jetstream-cloud.org/sglang/v1/
for Deepseek orhttps://llm.jetstream-cloud.org/vllm/v1/
for Llama. - Local model ID = currently
Deepseek R1
orLlama-3.3-70B-Instruct-FP8-Dynamic
, you can find the most recent available models using the API Base url, appendingmodels
to it, and checking the output in your browser, for example https://llm.jetstream-cloud.org/vllm/v1/models for vLLM. OPENROUTER_API_KEY
= “EMPTY”
Now you should be able to use the JupyterLab chat and the code assistant in the notebooks.
What hardware is behind this service?¶
The service now runs across two servers:
- vLLM serves Llama 3.3 from an NVIDIA Grace Hopper (GH200) server with an NVIDIA H100 GPU (96 GB of VRAM).
- Users can expect inference at up to 35 tokens per second.
- SGLang serves DeepSeek R1 from a server with 8 AMD MI300X GPUs (192 GB of VRAM each, 1536 GB total).
- Users can expect inference at up to 28 tokens per second. We expect to improve this as SGLang continues to implement optimizations for serving DeepSeek.
Each back-end supports hundreds of simultaneous requests, though per-request inference speed will decrease under heavy load.
Terms of use¶
Use of this service is subject to Jetstream2 and IU’s acceptable use policies. If what you’re doing is not for a research, education, or learning purpose, please take it somewhere else. Systems administrators are able to view all user interactions.
The chat history in Open WebUI is not backed up and could be lost at any time. So, if you want to keep anything important from your chat sessions, you should copy it somewhere else.