Skip to content

Inference Service Overview

We host a large language model (LLM) inference service for the Jetstream2 and IU Research Cloud communities.

Under Construction

This service is evolving. It may occasionally go offline for a few minutes at a time, as we make updates and improvements. Get the most current status information in the inference-service channel of our community chat.

We provide the latest, most-capable open-weights LLMs via two interfaces:

screenshot of Open WebUI chat interface

Which models do you offer?

As of 2025 February 15, we offer three models:

  • DeepSeek R1, a chains-of-thought reasoning model, at native (FP8) quantization.
    • R1 is now the world’s most capable open-weights LLM, and nearly the most capable LLM overall.
    • We serve the full 671-billion-parameter DeepSeek model, not one of the (much) smaller R1 distillations of Llama or Qwen.
  • Llama-3.3-70B-Instruct, a general-purpose instruct-tuned model, at 8-bit quantization.
  • Qwen2.5-VL-72B-Instruct, a vision-language model that accepts image uploads, at 8-bit quantization.

The models that we offer are subject to change as the state of the art improves rapidly.

Which model should I use?

Use DeepSeek R1 for your most complex, nuanced questions and tasks, when you don’t mind waiting a minute for the best answer. R1 does well following detailed instructions, ideally in the first message in a given chat window. As a chains-of-thought reasoning model, DeepSeek R1 performs similar to OpenAI o1, with the additional benefit of exposing the “thinking” behind its final answers.

Use Llama 3.3 for faster, general-purpose interactions where a “good-enough” answer will suffice. Llama 3.3 will also do better with extended back-and-forth conversation, where you ask for refinement or clarification several times. As an instruct-tuned model, Llama 3.3 is best-suited for following instructions and conversation-style prompts. It may not work well for completion or fill-in-middle tasks.

Use Qwen2.5-VL to work with images. In your prompt, you can upload an image and also provide questions or instructions. It can recognize objects and features in a photo, transcribe text, and much more.

Why is this valuable / worth using?

It’s an unlimited-use API that we provide at no cost to our communities. (APIs from OpenAI, Anthropic, DeepSeek, and similar providers all cost money to use.) The inference service provides larger, more capable LLMs than would fit on a g3.xl-size Jetstream2 instance (and much larger than will run on most personal computers).

It is private and sovereign compared to services operated by OpenAI and DeepSeek (respectively).

This service does not cost any Jetstream2 SUs to use. It is available as long as you have an ACCESS account (sign up here if needed).

What can I do with it?

Sky’s the limit!

  • Programming and debugging assistant.
  • Access it from your preferred code editor using Continue.
  • Use it with LangChain to develop LLM-powered applications.
  • Literature review assistant; give it a scientific paper and ask for a summary.
  • Brainstorming assistant to help develop hypotheses, experimental protocols, and approaches to data analysis.
  • Writing and proofreading assistant.
  • Tutor, surrogate thesis advisor.

Hallucinations

An LLM will readily hallucinate (“make things up”) while performing all of these tasks. Think of it as a confident, well-read intern with a complete lack of epistemic awareness. If you open a support ticket saying that it told you the ball is still in the cup, or that there are only two Rs in the word strawberry, we won’t be able to help.

Join the Community

Join the inference-service channel in our community chat to connect with the Jetstream2 team and others using the inference service. This is a good place to ask for help and share feedback, and discussion also extends to the world of LLMs in general.

This channel also provides the most up-to-date information on the status of the inference service, including automated alerts when something goes offline.

(More about Jetstream2 community chat)

What hardware is behind this service?

The service now runs across two servers:

  • vLLM serves Llama 3.3 from an NVIDIA Grace Hopper (GH200) server with an NVIDIA H100 GPU (96 GB of VRAM).
    • Users can expect inference at up to 35 tokens per second.
  • SGLang serves DeepSeek R1 from a server with 8 AMD MI300X GPUs (192 GB of VRAM each, 1536 GB total).
    • Users can expect inference at up to 36 tokens per second. We expect to improve this as SGLang continues to implement optimizations for serving DeepSeek.

Each back-end supports hundreds of simultaneous requests, though per-request inference speed will decrease under heavy load.

Data Handling and Sovereignty

When using our inference service, data processing occurs exclusively on systems located within the IU Bloomington Data Center. All prompt and response data are encrypted in transit. We do not use your prompt or response data for any AI training or data mining. (We may collect metadata about your usage for service improvement purposes, reporting it only in aggregate.)

DeepSeek Privacy

We operate our own local copy of the DeepSeek R1 model, ensuring data sovereignty within US-based research infrastructure. We do not send your prompt or response data to any service operated by Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd. (or to any parties other than IU).

The only exception to local processing occurs if you enable the web search feature in the chat UI, which will use DuckDuckGo to perform web searches on your behalf, using the information you provide in the prompt.

Terms of use

This service is governed by Jetstream2 and IU’s acceptable use policies. We do encourage playful exploration for learning purposes, but if your use of the service is not in service of research or education, please take it elsewhere. Systems administrators are able to view user interactions to ensure compliance with these policies.

Warning

The chat history in Open WebUI is not backed up and could be lost at any time. So, if you want to keep anything important from your chat sessions, you should copy it somewhere else.