Inference Service Overview¶

We host a large language model (LLM) inference service for the Jetstream2 and IU Research Cloud communities.

Under Construction

This service is evolving. It may occasionally go offline for a few minutes at a time, as we make updates and improvements. Get the most current status information in the inference-service channel of our community chat.

Feedback Welcome

We welcome suggestions for future refinement, either as a support ticket to help@jetstream-cloud.org or in the inference-service channel of our community chat.

We provide the latest, most-capable open-weights LLMs via two interfaces:

A browser-based chat interface via Open WebUI, similar to ChatGPT.
- How to Access the Chat UI
OpenAI-compatible inference APIs to integrate with your projects and applications.
- How to Access the APIs
- API Usage Examples

screenshot of Open WebUI chat interface

Which models do you offer?¶

As of 2025 May 30, we offer two models:

DeepSeek R1, a chains-of-thought reasoning model, at native (FP8) quantization.
- R1 is now the world’s most capable open-weights LLM, and nearly the most capable LLM overall.
- We serve the full 671-billion-parameter DeepSeek model, not one of the (much) smaller R1 distillations of Llama or Qwen.
Llama 4 Scout, a general-purpose instruct-tuned model with vision capability, at 8-bit quantization.
gpt-oss-120b, a chains-of-thought reasoning model from OpenAI, at native (MXFP4) quantization.
- gpt-oss-120b has configurable thinking effort (low, medium, or high) that you can specify for each prompt.

The models that we offer are subject to change as the state of the art improves rapidly.

On 2025 August 27, we added gpt-oss-120b.
On 2025 May 30, we removed Llama-3.3-70B-Instruct.
On 2025 April 10, we added Llama 4 Scout, replacing Qwen2.5-VL-72B-Instruct.
On 2025 February 15, we added Qwen2.5-VL-72B-Instruct.
On 2025 January 29, we added DeepSeek R1.
On 2024 December 9, we added Llama-3.3-70B-Instruct and discontinued Llama-3.1-Nemotron-70B-Instruct.
Prior to 2024 December 9, we offered Llama-3.1-Nemotron-70B-Instruct.

Which model should I use?¶

Evaluating New Models

We are evaluating the latest models including DeepSeek-V3.1, gpt-oss-120b, and GLM-4.5V. We will likely revise the following guidance as a result. For the moment, try using gpt-oss-120b if you want a more-capable reasoning model than Llama 4 Scout, but you want faster responses than DeepSeek R1 can provide.

Use DeepSeek R1 for your most complex, nuanced questions and tasks, when you don’t mind waiting a minute for the best answer. R1 does well following detailed instructions, ideally in the first message in a given chat window. As a chains-of-thought reasoning model, DeepSeek R1 performs similar to OpenAI o1 or o3, with the additional benefit of exposing the “thinking” behind its final answers.

Use Llama 4 Scout for fast general-purpose interactions, and for working with images. In your prompt, you can upload an image and also provide questions or instructions. It can recognize objects in a photo and transcribe text. Llama 4 will also do better than DeepSeek R1 with extended back-and-forth conversation, where you ask for refinement and clarification several times. As an instruct-tuned model, Llama 4 is best-suited for following instructions and conversation-style prompts. It may not work well for completion or fill-in-middle tasks.

Why is this valuable / worth using?¶

It’s an unlimited-use API that we provide at no cost to our communities. (APIs from OpenAI, Anthropic, DeepSeek, and similar providers all cost money to use.) The inference service provides larger, more capable LLMs than would fit on a g3.xl-size Jetstream2 instance (and much larger than will run on most personal computers).

It is private and sovereign compared to services operated by OpenAI and DeepSeek (respectively).

This service does not cost any Jetstream2 SUs to use. It is available as long as you have an ACCESS account (sign up here if needed).

What can I do with it?¶

Sky’s the limit!

Programming and debugging assistant.
Access it from your preferred code editor using Continue.
Use it with LangChain to develop LLM-powered applications.
Literature review assistant; give it a scientific paper and ask for a summary.
Brainstorming assistant to help develop hypotheses, experimental protocols, and approaches to data analysis.
Writing and proofreading assistant.
Tutor, surrogate thesis advisor.

Hallucinations

An LLM will readily hallucinate (“make things up”) while performing all of these tasks. Think of it as a confident, well-read intern with a complete lack of epistemic awareness. If you open a support ticket saying that it told you the ball is still in the cup, or that there are only two Rs in the word strawberry, we won’t be able to help.

Join the Community¶

Join the inference-service channel in our community chat to connect with the Jetstream2 team and others using the inference service. This is a good place to ask for help and share feedback, and discussion also extends to the world of LLMs in general.

This channel also provides the most up-to-date information on the status of the inference service, including automated alerts when something goes offline.

(More about Jetstream2 community chat)

What hardware is behind this service?¶

The service now runs across three servers:

vLLM serves Llama 4 Scout from a 2 x NVIDIA H100 cloud instance running on Jetstream2.
- Users can expect inference at up to 83 tokens per second.
SGLang serves DeepSeek R1 from a server with 8 AMD MI300X GPUs (192 GB of VRAM each, 1536 GB total).
- Users can expect inference at up to 36 tokens per second. We expect to improve this as SGLang continues to implement optimizations for serving DeepSeek.
vLLM serves gpt-oss-120b from a 2 x NVIDIA H100 cloud instance running on Jetstream2.
- Users can expect inference at up to 180 tokens per second.

Each back-end supports hundreds of simultaneous requests, though per-request inference speed will decrease under heavy load.

Data Handling and Sovereignty¶

When using our inference service, data processing occurs exclusively on systems located within the IU Bloomington Data Center. All prompt and response data are encrypted in transit. We do not use your prompt or response data for any AI training or data mining. (We may collect metadata about your usage for service improvement purposes, reporting it only in aggregate.)

DeepSeek Privacy

We operate our own local copy of the DeepSeek R1 model, ensuring data sovereignty within US-based research infrastructure. We do not send your prompt or response data to any service operated by Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd. (or to any parties other than IU).

The only exception to local processing occurs if you enable the web search feature in the chat UI, which will use DuckDuckGo to perform web searches on your behalf, using the information you provide in the prompt.

Terms of use¶

This service is governed by Jetstream2 and IU’s acceptable use policies. We do encourage playful exploration for learning purposes, but if your use of the service is not in service of research or education, please take it elsewhere. Systems administrators are able to view user interactions to ensure compliance with these policies.

Warning

The chat history in Open WebUI is not backed up and could be lost at any time. So, if you want to keep anything important from your chat sessions, you should copy it somewhere else.