Orientation to Running Large Language Models on Jetstream2¶
First, if you just need API or chat UI access to the latest, most-capable open-source LLMs, consider using Jetstream2’s centrally-hosted inference service. This can serve many inference use cases, and it costs nothing in terms of SUs from your allocation.
You only need to set up an LLM on your own instances in order to:
- Use specific models that the inference service doesn’t offer
- Change the model weights (training and fine-tuning)
- Make changes to the software that runs and serves the model (e.g. PyTorch, Transformers)
Epistemic warning
This page was written in early 2025, by a domain non-expert, to an audience of domain non-experts, about a set of extremely fast-moving technologies. It definitely contains at least one error (and we welcome corrections). In a few months, it will be at least mildly out-of-date.
Anthromorphization
It is common to describe the behavior of LLMs using terms that we apply to people, like ‘creative’, ‘confused’, and ‘thinking’. The LLM does not ‘think’ or get ‘confused’ in the way that human minds do. It is a neural network generating output tokens that appear similar to what a thinking or confused person would write. The anthromorphization is a convenient shorthand for qualitative properties of the output text. This page uses these words in single-quotes.
Training versus Inference¶
Training (which includes pre-training, reinforcement learning, and fine-tuning) is like baking the cake – it produces the LLM. Inference is eating the cake, using the LLM for something. When you talk to ChatGPT, or use AI auto-complete in your code editor, that is inference.
Training generally requires orders-of-magnitude more computational resources than inference. Pre-training (making a new model from scratch) is likely too compute-intensive to do on a typical Jetstream2 allocation (except for very small models and proofs-of-concept). Fine-tuning an existing model on Jetstream2 is more achievable, though outside the scope of this page. Running a model for inference is very achievable, and is the focus of this FAQ.
How do I choose an LLM?¶
Several factors inform the decision!
- Factors that depend on your intended task or workload:
- Media that the LLM supports (text, images, etc.)
- Type of training (base, instruct-tuned, etc.)
- Factors that determine the hardware required to run the LLM:
- Model size, or the number of parameters
- Quantization of model weights
- Size of the context window
- Factors that determine the software required to run the LLM:
- Parameter format
- Factors that influence the LLM’s generation speed:
- Model Granularity (dense or mixture-of-experts)
- Factors that affect your legal obligations
- License and availability of the model weights
Let’s review each of them in detail.
Input and Output Media¶
Consider the kind of data you want the LLM to work with.
- Text-only LLMs (e.g. most Llama variants) take text (or code, etc.) as input, and generate text as output.
- Multi-modal vision models (e.g. Qwen QVQ, Llama 3.2) can accept input of both images and text, but generally only output text.
- Image generation models (e.g. Janus-Pro) can accept images and text, and also output images.
With text-only workloads, vision and image generation models are generally less-capable than text-only LLMs. If your work is a mix of image processing/generation and text processing, consider using multiple models specialized for each task.
Training type¶
If want the LLM to directly complete a partial sequence of text (or code), or finish a story, use a base or foundation model. A base model is the most ‘raw’ form of an LLM. It has been trained using unsupervised learning to predict the next token (i.e. word) in a sequence. If your prompt ends in a half-finished sentence, the model will begin its output with it prediction of the rest of the sentence. If you give a base model some instructions, it might generate further instructions instead of following yours. If you try to converse with one, it might get ‘confused’ and generate several rounds of conversation, including what it expects you will say next. A base model may output something that offends you, because its training data includes (e.g.) millions of web pages written by humans who don’t necessarily share your values. Subjectively, these models tend to be the most ‘creative’, but also the least ‘domesticated’. (If you remember Bing Chat a.k.a. Sydney, that was likely close to a base model.)
If you want to have a conversation with the LLM, ask it questions, or give it instructions to follow, use an instruct-tuned model. There are several approaches to instruct-tuning a base model, but they often involve fine-tuning a base model with a human-curated set of example chat exchanges with a helpful assistant. The tuning process adjusts the model weights until it generates output similar to the examples. The result is that the model will chat with you as a helpful assistant! Many instruct-tuned models are also trained to avoid producing output that its maker deems offensive or harmful.
An instruct-tuned model is likely the best choice if you aren’t yet sure how you want to use the LLM.
If you want an LLM with the best ‘reasoning’ and problem-solving ability, use a chains-of-thought model. These models will ‘think’ to themselves, often for many paragraphs, before providing their final answer. The model will ‘try’ several approaches and ‘second-guess’ itself, using output phrases like “But wait!” and “Alternatively,”. This ‘cautious’ behavior improves the quality of the output, because the model can avoid getting ‘stuck’ on an incorrect or unhelpful strategy. An open-source chains-of-thought model will expose its ‘thinking’ in the output, so you can see its internal monologue. This is a useful prompt debugging tool (and fascinating to watch). If the model focuses too much on an unhelpful strategy, you can stop the generation and revise your original prompt. The cost of using a chains-of-thought model is much longer generation time (and more compute usage) for each request, compared to other models. A chains-of-thought model may ‘think’ for several minutes before returning its final answer, depending on the complexity of the prompt.
If you want in-line AI suggestions in your code editor, use a model with fill-in-middle capability. Fill-in-middle is a combined training and prompting technique that allows an LLM to predict the middle of a sequence instead of the end. Imagine you are writing code in your editor, and your keyboard cursor is in the middle of a file. When you ask Continue for a code suggestion, it will prompt the model with the code before the cursor, followed by whatever code is after the cursor, separated by special tokens. The model will output a prediction of what would go in the middle, i.e. what it would insert at the cursor position.
(Other model types can still generate code, but the style of prompting is different. For example, to use an instruct-tuned model as a coding assistant, you generally copy and paste sections of code between your editor and the LLM chat window. Fill-in-middle is simply a technique that enables GitHub Copilot-style suggestions as you type.)
Model Size (Parameter Count)¶
Parameter count is the quantity of weights in the model (or roughly, ‘neurons’ in the neural network). Each weight is a number representing some value that is not very meaningful to a human, but when used in combination with the other weights in the network, enables the model to predict text. Holding all other factors constant, more parameters can accommodate a more capable model, so LLMs tend to have many billions of parameters. The spectrum runs from tiny models with 1-8 billion parameters (intended to run on phones and personal computers), to 30-70 billion parameter models that you can run on a Jetstream2 instance, to 400-700+ billion parameter models that do not fit on Jetstream2 instances at this time (but you can still access on more capable hardware via the inference service!).
Generally, you will find that larger models can:
- ‘Recall’ more information about the world, with greater accuracy
- Generate correct solutions to more complex, harder problems
- Hallucinate (somewhat) less
- Provide better answers in (and translations to/from) non-English languages
- ‘Recognize’ more nuance in the prompt, and provide more-nuanced responses
That said, parameter count is just one of many factors of capability. Advances in architecture and training techniques (and larger, higher-quality sets of training data) also enable more capable models of a given size. For example, Llama 3.3 (70 billion parameters, released late 2024) performs much better on LLM benchmarks than the 2.5x larger GPT-3 (175 billion parameters, released in 2020).
Parameter count is one of several considerations that determine the hardware you need to run a given model. This is because all of the model parameters must fit in working memory of whatever computer you run it on.
If you run the LLM on one or more GPUs, it needs to fit in the combined GPU memory (a.k.a. VRAM or GPU RAM) – the amount of base system RAM is not relevant. If you’re running the LLM on a CPU, then it is stored in system RAM (but you can expect orders-of-magnitude slower inference than you would obtain with GPU hardware).
The parameter count interacts with the Quantization and Context Size (next two sections) to determine how much memory you need.
Quantization¶
LLMs are typically trained and released with each parameter represented as a 16-bit (or more recently, 8-bit) floating point number. These number formats can express a lot of precision, but with billions of parameters, the model will occupy a lot of memory.
We can convert a model’s parameters to less-precise numbers, such as 4-bit integers. These ‘shorter’ numbers occupy less space, both on disk (when you download the model) and in memory (when you run it). This form of data compression is called quantization, and the result is called a quantized model, typically specified with the resulting number format.
In other words, quantization makes a model smaller and less resource-intensive to run. It often means the difference between running a model successfully, and getting an out-of-memory error when you try to load it. Also, Quantization will often increase a model’s inference speed on a given piece of hardware, because a CPU or GPU can often do faster arithmetic on simpler number formats.
The downside of quantization (of course there’s a downside!) is that quantization decreases a model’s capability. When representing the parameters as less-precise numbers, there is less capacity to represent nuanced or complex information. The effect is somewhat like decreasing its parameter size. As you quantize a model down to smaller-precision weights, the benefits of larger models listed in the previous section will apply in reverse. Quantizing a model from 16 to 8 bits is often a great tradeoff, because the difference in output quality tends to be small, perhaps imperceptible outside of benchmarks, and you halve the compute requirement. Below 8 bits, output quality suffers more. A 4-bit quantized model produces subjectively-worse output in side-by-side comparisons with the model’s original format. You can expect anything less than 4 bits to be severely worse.
There are several recent, more sophisticated quantization schemes, such as 5 bits with K-means clustering, and Activation-aware Weight Quantization (AWQ). These can expand the pareto frontier of the tradeoff between memory-efficiency and capability, but the general principle (simpler numbers degrade the LLM) still holds.
With an understanding of parameter count and quantization, you can roughly estimate how much memory you’ll need to run a given model! It’s simply the number of bits in the number format, multiplied by the number of parameters.
(4-bit quantization) * (70 billion parameters) = ~35 gigabytes of memory
So, this model will (perhaps just barely) fit on a g3.xl
Jetstream2 instance, which has 40 GB of GPU RAM.
To refine this estimate (a bit upward), there is one more thing we need to store in memory, covered in the next section.
Context Size¶
The context window is analogous to an LLM’s short-term working memory. Everything you provide in your prompt, and everything the LLM returns as output, is added into the model’s context window. When using a chat-style interface, the context window generally holds the entire conversation.
With a large context window, you can include a large text corpus (or a modest-size book, many-page PDF file, etc) in your prompt, and the LLM can look at the entire thing at once when responding. More recent LLMs support a 128,000 token context window, or even much larger. Smaller, older LLMs tend to have small context window, such as 2,048 tokens for GPT-3.
The context is stored in memory alongside the model parameters (across two data structures: the attention mechanism and the key-value cache, or simply KV cache). A larger context window requires more memory to fit. This memory scaling is quadratic, meaning that doubling the size of context window causes a four-fold increase in memory consumed by the context window. For a 70 billion parameter model with a 128,000 token context size, the context window may occupy up to 40 GB of memory (an entire A100 GPU)!
If you are memory-constrained, and you do not need use of a model’s full context window, the software that hosts your LLM typically allows you to specify a smaller context size to fit in memory.
Parameter format¶
The parameter format determines how an LLM’s weights are stored on disk and loaded into memory. This factor affects your choice of inference software, because each software option is compatible with different formats. Common LLM weight formats include:
.safetensors
, developed by Hugging Face as a more secure alternative to PyTorch checkpoints. This format is usable with vLLM and SGLang.- GGUF, the native format for llama.cpp and its derivatives (like Ollama and Llamafile).
- PyTorch checkpoint (
.bin
or.pth
format), usable with PyTorch. This format can include arbitrary code that your computer will execute, so use caution when consuming LLMs in this format from untrusted parties. - AWQ, the format for models converted to Activation-aware Weight Quantization, usable with vLLM.
- GPTQ, a format for GPU-optimized quantization, usable with vLLM.
If you want to run a particular LLM, and you cannot find it in a weight format compatible with your desired inference software, there is often a way to convert it.
Model Granularity¶
There are two options here: dense, and sparse mixture-of-experts (MoE). With a dense model, all parameters are “active” at all times (for every token) during inference. A mixture-of-experts model is split into a set of several smaller sub-models, called “experts”. The LLM will dynamically activate only 1 or 2 of these experts per token, which means that at any given time, only a small fraction of the model total weights are active.
MoE is essentially a performance optimization: fewer active parameters at a given time require fewer compute cycles, which means that MoEs generally run faster than a dense model with the same parameter count (running on the same hardware).
License and availability of weights¶
The general categories of this are:
-
Models with a clean open-source license, like DeepSeek V3 or Qwen 2.5-Coder. Generally, you can download the model weights (typically from Hugging Face) without creating an account or agreeing to anything. You can do approximately anything you want with these models.
-
Models with an “open-weights” but somewhat-restrictive license, like the Llama models from Meta. You can use these models on Jetstream2, but you may need to provide contact information or sign an agreement before the creator will make them available for you to download. The creator’s license terms limit certain uses of the model. (For example, Meta forbids licensees from using the generated output of Llama 3.3 to train or improve any other LLM.) You, not Jetstream2 staff, are responsible for your compliance with the license terms.
-
Proprietary models like GPT-4, Claude, and Gemini. The companies who create these models do not make them available for download, so you’re unable to run them directly on Jetstream2. You can only consume them as a service hosted by the creator.
How do I compare and shop for LLMs?¶
Benchmarks¶
- Chatbot Arena for comparison rankings on user-rated outputs
- Private, uncontaminated benchmarks
- https://livebench.ai
- Kagi LLM Benchmarking Project
- Benchmarks with public problem sets (warning, models include these in training set)
- MMLU, etc.
How do I reason about system requirements?¶
The most important thing is that the model weights plus KV cache (for the context window) must fit in VRAM.
What is a token?¶
A token is a fundamental unit of text that the LLM processes. It could be a whole smaller word (like “cat” or “boss”), part of a larger word (like the “non” in “non-expert”), or a punctuation mark. One token averages out to approximately 0.75 words. The LLM’s tokenizer converts input text into tokens for the model to process.
Several aspects of LLM capacity and performance (e.g. context window size and generation speed) are measured in tokens.
For models that come in multiple parameter counts, how do I balance model size with quantization?¶
To fit the most-capable model in a given memory size, the rule of thumb is to prefer a larger, more-heavily-quantized model over a smaller, less-quantized model. But perhaps try both and compare the output!
How do I get a quantized model?¶
You can either download a model that someone else has already quantized for you, or quantize it yourself.
To find a pre-quantized model, browse to the model page on Hugging Face, and look for “Quantizations” in the model tree. (You likely also need to know which weight format will work with your chosen inference software.)
To quantize the model yourself, you generally use a Python script with the transformers
package. A tutorial is beyond the scope of this page, but it’s generally pretty simple to do.
Which model serving options?¶
For consumption by a single user:
For serving to multiple, potentially-untrusted users:
What are some good learning resources?¶
- https://simonwillison.net for brief, rapid updates on the state of the art
- 3blue1brown Deep Learning video lecture series to gain a mental model of how LLMs work.
- Everything I’ve learned so far about running LLMs