REFERENCE

LLM Glossary

201 terms covering AI fundamentals, model architecture, training, inference, evaluation, and more. Written for developers.

Fundamentals

Chain-of-Thought

A prompting technique that asks the model to show its reasoning step-by-step before answering.

Chat Template

A specific format for structuring messages in chat-based interactions with LLMs.

Completion

The text output generated by an LLM in response to a prompt.

Context Window

The maximum amount of text (tokens) an LLM can process in a single request.

Embeddings

Numerical vector representations of text that capture semantic meaning.

Few-Shot Learning

Providing a few examples in the prompt to teach the model how to perform a task.

Fine-Tuning

Retraining a pretrained model on domain-specific data to improve performance on specific tasks.

Function Calling

A feature allowing LLMs to request function execution by returning structured outputs.

Grounding

Providing an LLM with factual reference documents to reduce hallucination and improve accuracy.

Hallucination

When an LLM generates plausible-sounding but factually incorrect or fabricated information.

Instruction Following

An LLM's ability to understand and follow explicit instructions in prompts.

JSON Mode

A model behavior setting that constrains outputs to valid JSON format.

Max Tokens

The maximum number of tokens the model will generate in a completion.

Multi-Turn Conversation

An extended dialogue between user and model with multiple back-and-forth exchanges.

Multimodal

An LLM that processes multiple types of input: text, images, audio, or video.

One-Shot Learning

Providing a single example in the prompt to teach the model how to perform a task.

Prompt

Text input provided to an LLM to generate a completion or response.

Retrieval Ranking

Ordering retrieved documents by relevance score in information retrieval systems.

Retrieval-Augmented Generation (RAG)

A technique that retrieves relevant documents and provides them to an LLM for grounded generation.

Semantic Search

Finding documents based on meaning rather than keyword matching.

Stop Sequence

A string that signals the model to stop generating when encountered.

Streaming

Receiving LLM outputs token-by-token as they're generated rather than waiting for completion.

Structured Output

A general feature for constraining models to return outputs matching a specified schema.

System Prompt

A special prompt that sets the context and behavior guidelines for the entire conversation.

Temperature

A parameter controlling the randomness of model outputs (0=deterministic, 1+=random).

Token

A unit of text that an LLM processes. Typically represents a word, subword, or character sequence.

Tool Use

When an LLM uses external tools, APIs, or functions to accomplish tasks.

Top-K Sampling

A sampling method that selects from the K most likely next tokens.

Top-P (Nucleus Sampling)

A sampling method that selects from the smallest set of tokens with cumulative probability P.

Vector Database

A database optimized for storing and searching high-dimensional embedding vectors.

Vision-Language Model

A multimodal model that understands both images and text, enabling visual reasoning.

Zero-Shot Learning

Asking an LLM to perform a task without providing any examples.

Architecture

Attention Mechanism

A neural network component that weights the importance of different input tokens.

Byte-Pair Encoding

A tokenization algorithm that iteratively merges the most frequent byte/character pairs.

Cross-Entropy Loss

A loss function measuring the difference between predicted and actual probability distributions.

Feed-Forward Network

A layer of dense transformations between attention layers in transformers.

Flash Attention

An optimized attention algorithm that reduces memory I/O and increases GPU utilization.

Grouped-Query Attention

An attention variant where multiple query heads share key and value heads, reducing memory.

Key-Value Cache

Storing pre-computed keys and values from previous tokens to speed up inference.

Layer Normalization

A normalization technique that stabilizes training by normalizing activations across features.

Long Context

LLMs with very large context windows, enabling processing of long documents or conversations.

Mixture of Experts (MoE)

An architecture where different experts handle different parts of the input conditionally.

Multi-Head Attention

Applying attention multiple times in parallel with different learned representations.

Perplexity

A metric measuring how well a model predicts text, calculated as the exponential of cross-entropy loss.

Positional Encoding

A technique for encoding position information into transformer embeddings.

Residual Connection

A shortcut path allowing gradients and features to bypass processing layers.

RoPE (Rotary Position Embeddings)

Same as Rotary Position Embedding—an efficient, generalizable positional encoding.

Rotary Position Embedding (RoPE)

A positional encoding technique that rotates vectors proportionally to position.

Self-Attention

Attention where tokens attend to other tokens in the same sequence.

SentencePiece

A language-independent tokenization algorithm treating raw text without assuming word boundaries.

Sliding Window Attention

An attention mechanism where tokens only attend to a fixed window of recent previous tokens.

Softmax

A function that converts raw scores into a probability distribution.

Sparse Mixture of Experts

A MoE variant where only a small number of experts activate per token.

Speculative Decoding

A technique using a smaller model to predict tokens, verified by the larger model.

Tokenizer

An algorithm that converts text into tokens for model input.

Vocabulary

The set of all tokens a model can output, learned during training.

Training

Catastrophic Forgetting

When training on new data severely degrades performance on previously learned tasks.

Constitutional AI

A training approach using a set of principles to guide model behavior without extensive human feedback.

Continual Learning

Training models on sequences of tasks without forgetting previously learned knowledge.

Curriculum Learning

Training on progressively harder examples, starting with simple examples and advancing.

Data Augmentation

Techniques for creating variations of training data to improve model robustness and generalization.

Direct Preference Optimization (DPO)

A training method that directly optimizes for human preferences without training a separate reward model.

DPO

Abbreviation for Direct Preference Optimization.

Full Fine-Tuning

Training all model parameters, contrasted with parameter-efficient methods like LoRA.

Instruction Tuning

Fine-tuning on instruction-following examples to teach models to follow user directions.

Knowledge Distillation

Training a smaller student model to mimic a larger teacher model's behavior.

LoRA

Low-Rank Adaptation: a parameter-efficient fine-tuning method adding small trainable matrices.

Model Merging

Combining weights from multiple models to create new models with combined capabilities.

Parameter-Efficient Fine-Tuning

Methods that adapt models to new tasks by updating only a small fraction of total parameters.

Pretraining

The initial training phase where models learn language from vast unlabeled text corpora.

Proximal Policy Optimization (PPO)

A reinforcement learning algorithm used to optimize models based on reward signals.

QLoRA

A variant of LoRA that adds quantization for even more parameter-efficient fine-tuning.

Reinforcement Learning from Human Feedback (RLHF)

A training method using human preferences to fine-tune models beyond supervised learning.

Reward Model

A model trained to predict human preference, used to guide policy optimization in RLHF.

RLHF

Abbreviation for Reinforcement Learning from Human Feedback.

Supervised Fine-Tuning (SFT)

Training a pretrained model on labeled examples to improve performance on specific tasks.

Synthetic Data

Training data generated by models or algorithms rather than manually created.

Inference

Batch Inference

Processing multiple inputs together to improve overall throughput efficiency.

Cold Start

The latency delay when a model is loaded into memory before it can serve requests.

Concurrent Requests

Multiple requests being processed simultaneously, enabled by batching and system design.

GGUF

A file format for quantized models, supporting multiple quantization levels and efficient inference.

GPU Memory

The VRAM available on GPUs, a key constraint for model loading and inference.

Inference

The process of running a trained model to generate outputs given inputs.

INT4 Quantization

Quantizing model weights to 4-bit integers, achieving 8x compression with some quality trade-off.

INT8 Quantization

Quantizing model weights to 8-bit integers, achieving 4x memory reduction with minimal quality loss.

Latency

The time delay from input to first output (time-to-first-token) or complete output.

Model Serving

The infrastructure for deploying models in production, handling requests at scale.

ONNX

Open Neural Network Exchange: a standard format for representing trained models across frameworks.

Pipeline Parallelism

Distributing different layers across multiple GPUs to enable processing multiple batches simultaneously.

Quantization

Reducing model precision (e.g., from float32 to int8) to reduce memory and computation.

Tensor Parallelism

Distributing model computations across multiple GPUs by splitting tensors.

Throughput

The number of tokens generated per unit time, measuring inference speed at scale.

Time to First Token (TTFT)

The latency from sending a request to receiving the first output token.

Tokens Per Second (TPS)

A throughput metric measuring how many tokens the model generates per second.

Triton Inference Server

NVIDIA's inference server supporting multiple frameworks and models with advanced scheduling.

TTFT

Abbreviation for Time to First Token.

vLLM

A high-throughput, memory-efficient LLM inference engine with optimized batching and caching.

Evaluation

ARC

AI2 Reasoning Challenge: multiple-choice science questions requiring knowledge and reasoning.

Arena Elo

A ranking system comparing models based on human preference judgments in pairwise comparisons.

Benchmark

A standardized test dataset used to compare model performance across different models.

BLEU Score

A metric measuring similarity between machine translation output and reference translations.

Calibration

How well a model's confidence scores match actual correctness probability.

Chatbot Arena

A crowdsourced platform where users compare models through pairwise contests.

Evals

A framework and library for designing and running custom evaluations on language models.

Exact Match

An evaluation metric where answers are marked correct only if they exactly match the reference.

F1 Score

A metric combining precision and recall, useful for evaluating QA and information extraction.

GPQA

Graduate-level Google-Proof Q&A: a benchmark of grad-level science questions from PhDs.

GSM8K

Grade School Math 8K: a benchmark of 8,500 grade school math word problems.

HellaSwag

A benchmark of commonsense reasoning through completing video descriptions.

HumanEval

A benchmark evaluating code generation capability through functional correctness on programming tasks.

LM Eval Harness

A flexible framework for evaluating language models on diverse benchmarks using consistent methodology.

MMLU

Massive Multitask Language Understanding: a broad benchmark covering 57 academic subjects.

Model Evaluation

The systematic process of measuring model quality and capabilities using metrics and benchmarks.

Out-of-Distribution

Data or scenarios different from the training distribution, testing model generalization.

Pass@K

For code generation, the probability that at least one of K samples passes tests.

ROUGE Score

A metric measuring recall of n-grams between model output and reference summaries.

SWE-Bench

Software Engineering Benchmark: evaluating models on real GitHub issues and pull requests.

TruthfulQA

A benchmark measuring whether models generate factually accurate answers or hallucinate.

Pricing & Cost

API Rate Limits

Restrictions on how many requests or tokens can be processed per time unit.

Batch API

An API for asynchronous batch processing of many requests at discounted rates.

Cached Tokens

Previously processed tokens stored and reused, charged at lower rates (prompt caching).

Context Caching

The capability to cache long context and reuse it across multiple requests.

Context Compression

Techniques for reducing context size while preserving necessary information.

Cost Per Million Tokens (CPM)

The pricing metric showing cost for processing one million tokens.

Input Tokens

Tokens in the prompt, billed separately (usually cheaper than output tokens).

LLM Cost Optimization

Strategies for reducing API costs through model selection, caching, and prompt engineering.

Model Routing

Automatically selecting the most cost-effective model based on task requirements.

Output Tokens

Tokens in the model's response, typically billed at higher rate than input tokens.

Pay-Per-Token

A billing model charging per token processed (input and output separately).

Prompt Caching

An API feature enabling reuse of previously processed prompt tokens at lower cost.

Requests Per Minute (RPM)

A rate limit measuring maximum requests allowed per minute.

Token Counting

Accurately determining how many tokens text will consume before submitting requests.

Tokens Per Minute (TPM)

A rate limit measuring maximum tokens processed per minute.

Deployment

Auto-Scaling

Automatically adjusting compute resources based on demand to maintain performance and efficiency.

Blue-Green Deployment

A deployment strategy maintaining two identical environments, switching traffic for zero-downtime updates.

Canary Deployment

Gradually rolling out a new version to a small percentage of users before full rollout.

Dedicated GPU

Provisioning physical GPU hardware exclusively for model inference, ensuring predictable performance.

Inference Endpoint

A deployed model accessible via API for making predictions in production.

Load Balancing

Distributing requests across multiple instances to ensure efficient resource usage and reliability.

Model Deployment

The process of putting a trained model into production for real-world use.

Model Versioning

Tracking different versions of models and managing their deployment, updates, and rollback.

Serverless Inference

Running inference without managing servers, using managed services that auto-scale.

Spot Instances

Discounted cloud compute instances that can be terminated, useful for cost-sensitive inference.

Prompting

Adversarial Prompting

Crafting prompts designed to expose model weaknesses or cause failures.

Assistant Prompt

Text providing hints or examples of desired model behavior within a conversation.

Chain-of-Thought Prompting

Requesting explicit step-by-step reasoning to improve accuracy on complex tasks.

Delimiters

Special characters or markers used to separate sections or mark boundaries in prompts.

Dynamic Prompting

Generating or adapting prompts at runtime based on input or task characteristics.

Jailbreaking

Attempting to bypass safety guidelines through prompting techniques.

Markdown Prompting

Using markdown formatting to structure prompts with clear headings and sections.

Meta-Prompting

Prompting the model to improve its own prompts or reasoning strategies.

Output Formatting

Instructing models to return output in specific formats (JSON, markdown, lists, etc.).

Persona

A defined character or professional identity assigned to the model to shape its responses.

Plan and Execute

First planning the approach, then executing step-by-step to solve complex tasks.

Prompt Compression Strategy

Techniques for reducing prompt size while preserving information needed for quality answers.

Prompt Engineering

The practice of crafting effective prompts to elicit desired behavior from LLMs.

Prompt Injection

A security vulnerability where attacker-controlled text overrides intended instructions.

ReAct Prompting

A technique where the model interleaves reasoning, acting (tool use), and observation.

Role-Playing

Instructing the model to adopt a specific role or persona to improve response quality.

Scratchpad

Asking the model to use an intermediate space to work through problems before final answers.

Self-Consistency

Sampling multiple reasoning paths and taking the majority vote for more reliable answers.

Step-by-Step

Instructing the model to break down problems into sequential steps before answering.

Structured Prompting

Using consistent, templated prompt formats with clear sections for instructions and context.

Tree-of-Thought

A prompting technique where the model explores multiple reasoning branches before deciding.

User Prompt

Text input from the user within a conversation, as opposed to system context.

XML Tags Prompting

Using XML tags in prompts to structure information and guide model behavior.

Safety & Alignment

Alignment

Training models to behave in ways consistent with human values and intentions.

Bias in LLMs

Systematic prejudices in model outputs reflecting biases in training data or design.

Constitutional AI Principles

Explicit principles guiding model behavior (e.g., be helpful, harmless, honest).

Content Moderation

Evaluating and filtering model outputs or user inputs for inappropriate or harmful content.

Guardrails

Systems or rules that prevent LLMs from generating harmful, toxic, or inappropriate content.

Harmlessness

A key alignment goal: ensuring models don't cause harm through their outputs.

Helpfulness

An alignment goal: ensuring models provide useful, accurate, and appropriate responses.

Honesty

A core alignment principle: models should provide truthful information and acknowledge uncertainty.

Jailbreak

Successfully bypassing model safety mechanisms or constraints through prompting or other means.

PII Detection

Identifying and protecting personally identifiable information in text.

Prompt Injection Attack

A security attack where malicious input overrides model instructions, causing unintended behavior.

Red-Teaming

Adversarial testing to discover vulnerabilities, failure modes, and safety gaps in systems.

Refusal

When an LLM declines to respond to a request, typically for safety or ethical reasons.

Safety

Designing and training LLMs to avoid harmful outputs and handle edge cases responsibly.

Toxicity

Model outputs containing abusive, offensive, or hateful language.

Tools & Frameworks

A/B Testing LLMs

Comparing model outputs against baselines through controlled experiments to measure improvements.

AI Gateway

A gateway service providing unified access to multiple LLM providers with advanced features.

Anthropic API

Anthropic's REST API providing access to Claude models with a focus on safety and quality.

DSPy Framework

A framework for systematically optimizing LLM prompts and programs using techniques like backprop.

Haystack Framework

A framework for building retrieval-augmented generation (RAG) and search applications.

Helicone Observability

An observability platform for LLM applications providing monitoring, logging, and analytics.

Hugging Face Hub

A platform hosting open-source models, datasets, and spaces for LLM development.

LangChain Framework

A popular Python framework for building LLM applications with composable chains and agents.

Langfuse Observability

An observability platform providing tracing, logging, and analytics for LLM applications.

LiteLLM Gateway

A gateway for unified API access across multiple LLM providers with cost tracking.

LlamaIndex Framework

A data framework for indexing and querying documents with LLMs, enabling RAG applications.

LLM Proxy

A middleware service that intercepts and manages requests to LLM APIs.

LLM Router

A system that intelligently routes requests to different models based on criteria.

Model Registry

A system for cataloging, versioning, and managing deployed models.

Ollama

A tool for running large language models locally, simplifying local model deployment.

OpenAI API

OpenAI's REST API providing access to models like GPT-4 and GPT-3.5-turbo.

OpenRouter Platform

A platform providing unified API access to multiple LLM providers and open-source models.

Prompt Management

Tools and systems for organizing, versioning, and tracking prompts used in applications.

Semantic Kernel Framework

Microsoft's framework for orchestrating LLM plugins and functions in .NET applications.

Shadow Mode Testing

Running new models in parallel with production without affecting user experience to validate improvements.