Skip to content

Inference Service API Usage Examples

Info

First read Access the APIs for guidance on how to connect to the APIs.

Python example (from Jetstream2 instance or tunnelled connection)

pip install openai, then create a Python script with these contents, and run it.

from openai import OpenAI

client = OpenAI(base_url="https://llm.jetstream-cloud.org/vllm/v1", api_key="empty")

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "What is the difference between SSH and SSL",
        }
    ],
    model="Llama-3.3-70B-Instruct-FP8-Dynamic",
)

print(chat_completion.choices[0].message.content)

Command line example (from Jetstream2 instance or tunnelled connection)

You can also use the llm to access the LLM from the command line, this is particularly convenient because you can integrate it with bash commands.

First, install llm in your favorite Python virtual environment:

pip install llm

Then, find where the configuration files are located:

dirname "$(llm logs path)"

Add a file named extra-openai-models.yaml to the directory that was printed by the previous command, with the following content:

- model_id: llama3.370B
  model_name: "Llama-3.3-70B-Instruct-FP8-Dynamic"
  api_base: "https://llm.jetstream-cloud.org/vllm/v1/"

And set it default:

llm models default llama3.370B

Finally you can use it interactively (-s sets the system prompt):

curl https://docs.jetstream-cloud.org/general/inference-service/ | html2text | llm -s "make a 1 paragraph summary"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69412  100 69412    0     0   105k      0 --:--:-- --:--:-- --:--:--  105k
Here is a 1-paragraph summary of the Jetstream2 Large Language Model Inference Service documentation:

**Summary**: Jetstream2 offers a free, unlimited-use Large Language Model (LLM) Inference Service, powered by Llama 3.3, for its community. The service provides an OpenAI-compatible API and a browser-based chat interface (Open WebUI) for tasks like programming assistance, literature reviews, brainstorming, and writing aid. Access is restricted to Jetstream2 or IU Research Cloud networks and instances, but can be tunneled through from external computers. The service runs on an NVIDIA Grace Hopper server with an H100 GPU, supporting up to 4 simultaneous requests, and is subject to Jetstream2's acceptable use policies, primarily for research, education, or learning purposes.

or you can start a chat session on the command line (-c continues the conversation):

llm chat -c

Using API with Your IDE

Using with Cline extension for VSCode or VSCodium (from Jetstream2 instance or tunnelled connection)

Cline is a semi-autonomous coding agent with two modes (“plan” and “act”). Cline can also use your terminal to test and debug its own code changes.

  • Install the “Cline” extension, if you haven’t already. Then, in the extension pane:
  • If you’re starting the extension for the first time, choose “Use your own API key”. Otherwise, go to the settings part of the Extension pane (gear icon).
  • Choose “OpenAI Compatible” API Provider
  • For Base URL, enter
    • https://llm.jetstream-cloud.org/sglang/v1 to use DeepSeek R1
      • (slower but higher-quality model, verbose thinking output in extension pane)
    • https://llm.jetstream-cloud.org/v1 to use Llama 3.3
      • (faster but lower-quality model, more concise output in extension pane)
  • For API key, enter anything (like a space character)
  • For Model ID, enter
    • DeepSeek-R1 if using DeepSeek R1
    • Llama-3.3-70B-Instruct if using Llama 3.3
  • Click “let’s go!” or “Done”

You should now be able to give Cline a task to do.

Using with Continue extension for VSCode or VSCodium (from Jetstream2 instance or tunnelled connection)

Continue is an in-editor AI assistant.

Install the Continue extension. In the extension’s config.json, set the models like so:

  "models": [
    {
      "provider": "openai",
      "title": "Jetstream2 Inference Service",
      "apiBase": "https://llm.jetstream-cloud.org/vllm/v1/",
      "model": "Llama-3.3-70B-Instruct-FP8-Dynamic",
      "useLegacyCompletionsEndpoint": true
    }
  ],

The chat pane should now work.

Using with JupyterLab via JupyterAI (from Jetstream2 instance or tunnelled connection)

Install the jupyter-ai package, version 2.29.1 or higher, and langchain-openai. In JupyterLab, open the JupyterAI settings, and configure:

  • Completion model = OpenRouter :: *
  • API Base url = https://llm.jetstream-cloud.org/sglang/v1/ for Deepseek or https://llm.jetstream-cloud.org/vllm/v1/ for Llama.
  • Local model ID = currently Deepseek R1 or Llama-3.3-70B-Instruct-FP8-Dynamic, you can find the most recent available models using the API Base url, appending models to it, and checking the output in your browser, for example https://llm.jetstream-cloud.org/vllm/v1/models for vLLM.
  • OPENROUTER_API_KEY = “EMPTY”

Now you should be able to use the JupyterLab chat and the code assistant in the notebooks.