REFERENCE
LLM Glossary
201 terms covering AI fundamentals, model architecture, training, inference, evaluation, and more. Written for developers.
Fundamentals
Chain-of-Thought
A prompting technique that asks the model to show its reasoning step-by-step before answering.
Chat Template
A specific format for structuring messages in chat-based interactions with LLMs.
Completion
The text output generated by an LLM in response to a prompt.
Context Window
The maximum amount of text (tokens) an LLM can process in a single request.
Embeddings
Numerical vector representations of text that capture semantic meaning.
Few-Shot Learning
Providing a few examples in the prompt to teach the model how to perform a task.
Fine-Tuning
Retraining a pretrained model on domain-specific data to improve performance on specific tasks.
Function Calling
A feature allowing LLMs to request function execution by returning structured outputs.
Grounding
Providing an LLM with factual reference documents to reduce hallucination and improve accuracy.
Hallucination
When an LLM generates plausible-sounding but factually incorrect or fabricated information.
Instruction Following
An LLM's ability to understand and follow explicit instructions in prompts.
JSON Mode
A model behavior setting that constrains outputs to valid JSON format.
Max Tokens
The maximum number of tokens the model will generate in a completion.
Multi-Turn Conversation
An extended dialogue between user and model with multiple back-and-forth exchanges.
Multimodal
An LLM that processes multiple types of input: text, images, audio, or video.
One-Shot Learning
Providing a single example in the prompt to teach the model how to perform a task.
Prompt
Text input provided to an LLM to generate a completion or response.
Retrieval Ranking
Ordering retrieved documents by relevance score in information retrieval systems.
Retrieval-Augmented Generation (RAG)
A technique that retrieves relevant documents and provides them to an LLM for grounded generation.
Semantic Search
Finding documents based on meaning rather than keyword matching.
Stop Sequence
A string that signals the model to stop generating when encountered.
Streaming
Receiving LLM outputs token-by-token as they're generated rather than waiting for completion.
Structured Output
A general feature for constraining models to return outputs matching a specified schema.
System Prompt
A special prompt that sets the context and behavior guidelines for the entire conversation.
Temperature
A parameter controlling the randomness of model outputs (0=deterministic, 1+=random).
Token
A unit of text that an LLM processes. Typically represents a word, subword, or character sequence.
Tool Use
When an LLM uses external tools, APIs, or functions to accomplish tasks.
Top-K Sampling
A sampling method that selects from the K most likely next tokens.
Top-P (Nucleus Sampling)
A sampling method that selects from the smallest set of tokens with cumulative probability P.
Vector Database
A database optimized for storing and searching high-dimensional embedding vectors.
Vision-Language Model
A multimodal model that understands both images and text, enabling visual reasoning.
Zero-Shot Learning
Asking an LLM to perform a task without providing any examples.
Architecture
Attention Mechanism
A neural network component that weights the importance of different input tokens.
Byte-Pair Encoding
A tokenization algorithm that iteratively merges the most frequent byte/character pairs.
Cross-Entropy Loss
A loss function measuring the difference between predicted and actual probability distributions.
Feed-Forward Network
A layer of dense transformations between attention layers in transformers.
Flash Attention
An optimized attention algorithm that reduces memory I/O and increases GPU utilization.
Grouped-Query Attention
An attention variant where multiple query heads share key and value heads, reducing memory.
Key-Value Cache
Storing pre-computed keys and values from previous tokens to speed up inference.
Layer Normalization
A normalization technique that stabilizes training by normalizing activations across features.
Long Context
LLMs with very large context windows, enabling processing of long documents or conversations.
Mixture of Experts (MoE)
An architecture where different experts handle different parts of the input conditionally.
Multi-Head Attention
Applying attention multiple times in parallel with different learned representations.
Perplexity
A metric measuring how well a model predicts text, calculated as the exponential of cross-entropy loss.
Positional Encoding
A technique for encoding position information into transformer embeddings.
Residual Connection
A shortcut path allowing gradients and features to bypass processing layers.
RoPE (Rotary Position Embeddings)
Same as Rotary Position Embedding—an efficient, generalizable positional encoding.
Rotary Position Embedding (RoPE)
A positional encoding technique that rotates vectors proportionally to position.
Self-Attention
Attention where tokens attend to other tokens in the same sequence.
SentencePiece
A language-independent tokenization algorithm treating raw text without assuming word boundaries.
Sliding Window Attention
An attention mechanism where tokens only attend to a fixed window of recent previous tokens.
Softmax
A function that converts raw scores into a probability distribution.
Sparse Mixture of Experts
A MoE variant where only a small number of experts activate per token.
Speculative Decoding
A technique using a smaller model to predict tokens, verified by the larger model.
Tokenizer
An algorithm that converts text into tokens for model input.
Vocabulary
The set of all tokens a model can output, learned during training.
Training
Catastrophic Forgetting
When training on new data severely degrades performance on previously learned tasks.
Constitutional AI
A training approach using a set of principles to guide model behavior without extensive human feedback.
Continual Learning
Training models on sequences of tasks without forgetting previously learned knowledge.
Curriculum Learning
Training on progressively harder examples, starting with simple examples and advancing.
Data Augmentation
Techniques for creating variations of training data to improve model robustness and generalization.
Direct Preference Optimization (DPO)
A training method that directly optimizes for human preferences without training a separate reward model.
DPO
Abbreviation for Direct Preference Optimization.
Full Fine-Tuning
Training all model parameters, contrasted with parameter-efficient methods like LoRA.
Instruction Tuning
Fine-tuning on instruction-following examples to teach models to follow user directions.
Knowledge Distillation
Training a smaller student model to mimic a larger teacher model's behavior.
LoRA
Low-Rank Adaptation: a parameter-efficient fine-tuning method adding small trainable matrices.
Model Merging
Combining weights from multiple models to create new models with combined capabilities.
Parameter-Efficient Fine-Tuning
Methods that adapt models to new tasks by updating only a small fraction of total parameters.
Pretraining
The initial training phase where models learn language from vast unlabeled text corpora.
Proximal Policy Optimization (PPO)
A reinforcement learning algorithm used to optimize models based on reward signals.
QLoRA
A variant of LoRA that adds quantization for even more parameter-efficient fine-tuning.
Reinforcement Learning from Human Feedback (RLHF)
A training method using human preferences to fine-tune models beyond supervised learning.
Reward Model
A model trained to predict human preference, used to guide policy optimization in RLHF.
RLHF
Abbreviation for Reinforcement Learning from Human Feedback.
Supervised Fine-Tuning (SFT)
Training a pretrained model on labeled examples to improve performance on specific tasks.
Synthetic Data
Training data generated by models or algorithms rather than manually created.
Inference
Batch Inference
Processing multiple inputs together to improve overall throughput efficiency.
Cold Start
The latency delay when a model is loaded into memory before it can serve requests.
Concurrent Requests
Multiple requests being processed simultaneously, enabled by batching and system design.
GGUF
A file format for quantized models, supporting multiple quantization levels and efficient inference.
GPU Memory
The VRAM available on GPUs, a key constraint for model loading and inference.
Inference
The process of running a trained model to generate outputs given inputs.
INT4 Quantization
Quantizing model weights to 4-bit integers, achieving 8x compression with some quality trade-off.
INT8 Quantization
Quantizing model weights to 8-bit integers, achieving 4x memory reduction with minimal quality loss.
Latency
The time delay from input to first output (time-to-first-token) or complete output.
Model Serving
The infrastructure for deploying models in production, handling requests at scale.
ONNX
Open Neural Network Exchange: a standard format for representing trained models across frameworks.
Pipeline Parallelism
Distributing different layers across multiple GPUs to enable processing multiple batches simultaneously.
Quantization
Reducing model precision (e.g., from float32 to int8) to reduce memory and computation.
Tensor Parallelism
Distributing model computations across multiple GPUs by splitting tensors.
Throughput
The number of tokens generated per unit time, measuring inference speed at scale.
Time to First Token (TTFT)
The latency from sending a request to receiving the first output token.
Tokens Per Second (TPS)
A throughput metric measuring how many tokens the model generates per second.
Triton Inference Server
NVIDIA's inference server supporting multiple frameworks and models with advanced scheduling.
TTFT
Abbreviation for Time to First Token.
vLLM
A high-throughput, memory-efficient LLM inference engine with optimized batching and caching.
Evaluation
ARC
AI2 Reasoning Challenge: multiple-choice science questions requiring knowledge and reasoning.
Arena Elo
A ranking system comparing models based on human preference judgments in pairwise comparisons.
Benchmark
A standardized test dataset used to compare model performance across different models.
BLEU Score
A metric measuring similarity between machine translation output and reference translations.
Calibration
How well a model's confidence scores match actual correctness probability.
Chatbot Arena
A crowdsourced platform where users compare models through pairwise contests.
Evals
A framework and library for designing and running custom evaluations on language models.
Exact Match
An evaluation metric where answers are marked correct only if they exactly match the reference.
F1 Score
A metric combining precision and recall, useful for evaluating QA and information extraction.
GPQA
Graduate-level Google-Proof Q&A: a benchmark of grad-level science questions from PhDs.
GSM8K
Grade School Math 8K: a benchmark of 8,500 grade school math word problems.
HellaSwag
A benchmark of commonsense reasoning through completing video descriptions.
HumanEval
A benchmark evaluating code generation capability through functional correctness on programming tasks.
LM Eval Harness
A flexible framework for evaluating language models on diverse benchmarks using consistent methodology.
MMLU
Massive Multitask Language Understanding: a broad benchmark covering 57 academic subjects.
Model Evaluation
The systematic process of measuring model quality and capabilities using metrics and benchmarks.
Out-of-Distribution
Data or scenarios different from the training distribution, testing model generalization.
Pass@K
For code generation, the probability that at least one of K samples passes tests.
ROUGE Score
A metric measuring recall of n-grams between model output and reference summaries.
SWE-Bench
Software Engineering Benchmark: evaluating models on real GitHub issues and pull requests.
TruthfulQA
A benchmark measuring whether models generate factually accurate answers or hallucinate.
Pricing & Cost
API Rate Limits
Restrictions on how many requests or tokens can be processed per time unit.
Batch API
An API for asynchronous batch processing of many requests at discounted rates.
Cached Tokens
Previously processed tokens stored and reused, charged at lower rates (prompt caching).
Context Caching
The capability to cache long context and reuse it across multiple requests.
Context Compression
Techniques for reducing context size while preserving necessary information.
Cost Per Million Tokens (CPM)
The pricing metric showing cost for processing one million tokens.
Input Tokens
Tokens in the prompt, billed separately (usually cheaper than output tokens).
LLM Cost Optimization
Strategies for reducing API costs through model selection, caching, and prompt engineering.
Model Routing
Automatically selecting the most cost-effective model based on task requirements.
Output Tokens
Tokens in the model's response, typically billed at higher rate than input tokens.
Pay-Per-Token
A billing model charging per token processed (input and output separately).
Prompt Caching
An API feature enabling reuse of previously processed prompt tokens at lower cost.
Requests Per Minute (RPM)
A rate limit measuring maximum requests allowed per minute.
Token Counting
Accurately determining how many tokens text will consume before submitting requests.
Tokens Per Minute (TPM)
A rate limit measuring maximum tokens processed per minute.
Deployment
Auto-Scaling
Automatically adjusting compute resources based on demand to maintain performance and efficiency.
Blue-Green Deployment
A deployment strategy maintaining two identical environments, switching traffic for zero-downtime updates.
Canary Deployment
Gradually rolling out a new version to a small percentage of users before full rollout.
Dedicated GPU
Provisioning physical GPU hardware exclusively for model inference, ensuring predictable performance.
Inference Endpoint
A deployed model accessible via API for making predictions in production.
Load Balancing
Distributing requests across multiple instances to ensure efficient resource usage and reliability.
Model Deployment
The process of putting a trained model into production for real-world use.
Model Versioning
Tracking different versions of models and managing their deployment, updates, and rollback.
Serverless Inference
Running inference without managing servers, using managed services that auto-scale.
Spot Instances
Discounted cloud compute instances that can be terminated, useful for cost-sensitive inference.
Prompting
Adversarial Prompting
Crafting prompts designed to expose model weaknesses or cause failures.
Assistant Prompt
Text providing hints or examples of desired model behavior within a conversation.
Chain-of-Thought Prompting
Requesting explicit step-by-step reasoning to improve accuracy on complex tasks.
Delimiters
Special characters or markers used to separate sections or mark boundaries in prompts.
Dynamic Prompting
Generating or adapting prompts at runtime based on input or task characteristics.
Jailbreaking
Attempting to bypass safety guidelines through prompting techniques.
Markdown Prompting
Using markdown formatting to structure prompts with clear headings and sections.
Meta-Prompting
Prompting the model to improve its own prompts or reasoning strategies.
Output Formatting
Instructing models to return output in specific formats (JSON, markdown, lists, etc.).
Persona
A defined character or professional identity assigned to the model to shape its responses.
Plan and Execute
First planning the approach, then executing step-by-step to solve complex tasks.
Prompt Compression Strategy
Techniques for reducing prompt size while preserving information needed for quality answers.
Prompt Engineering
The practice of crafting effective prompts to elicit desired behavior from LLMs.
Prompt Injection
A security vulnerability where attacker-controlled text overrides intended instructions.
ReAct Prompting
A technique where the model interleaves reasoning, acting (tool use), and observation.
Role-Playing
Instructing the model to adopt a specific role or persona to improve response quality.
Scratchpad
Asking the model to use an intermediate space to work through problems before final answers.
Self-Consistency
Sampling multiple reasoning paths and taking the majority vote for more reliable answers.
Step-by-Step
Instructing the model to break down problems into sequential steps before answering.
Structured Prompting
Using consistent, templated prompt formats with clear sections for instructions and context.
Tree-of-Thought
A prompting technique where the model explores multiple reasoning branches before deciding.
User Prompt
Text input from the user within a conversation, as opposed to system context.
XML Tags Prompting
Using XML tags in prompts to structure information and guide model behavior.
Safety & Alignment
Alignment
Training models to behave in ways consistent with human values and intentions.
Bias in LLMs
Systematic prejudices in model outputs reflecting biases in training data or design.
Constitutional AI Principles
Explicit principles guiding model behavior (e.g., be helpful, harmless, honest).
Content Moderation
Evaluating and filtering model outputs or user inputs for inappropriate or harmful content.
Guardrails
Systems or rules that prevent LLMs from generating harmful, toxic, or inappropriate content.
Harmlessness
A key alignment goal: ensuring models don't cause harm through their outputs.
Helpfulness
An alignment goal: ensuring models provide useful, accurate, and appropriate responses.
Honesty
A core alignment principle: models should provide truthful information and acknowledge uncertainty.
Jailbreak
Successfully bypassing model safety mechanisms or constraints through prompting or other means.
PII Detection
Identifying and protecting personally identifiable information in text.
Prompt Injection Attack
A security attack where malicious input overrides model instructions, causing unintended behavior.
Red-Teaming
Adversarial testing to discover vulnerabilities, failure modes, and safety gaps in systems.
Refusal
When an LLM declines to respond to a request, typically for safety or ethical reasons.
Safety
Designing and training LLMs to avoid harmful outputs and handle edge cases responsibly.
Toxicity
Model outputs containing abusive, offensive, or hateful language.
Tools & Frameworks
A/B Testing LLMs
Comparing model outputs against baselines through controlled experiments to measure improvements.
AI Gateway
A gateway service providing unified access to multiple LLM providers with advanced features.
Anthropic API
Anthropic's REST API providing access to Claude models with a focus on safety and quality.
DSPy Framework
A framework for systematically optimizing LLM prompts and programs using techniques like backprop.
Haystack Framework
A framework for building retrieval-augmented generation (RAG) and search applications.
Helicone Observability
An observability platform for LLM applications providing monitoring, logging, and analytics.
Hugging Face Hub
A platform hosting open-source models, datasets, and spaces for LLM development.
LangChain Framework
A popular Python framework for building LLM applications with composable chains and agents.
Langfuse Observability
An observability platform providing tracing, logging, and analytics for LLM applications.
LiteLLM Gateway
A gateway for unified API access across multiple LLM providers with cost tracking.
LlamaIndex Framework
A data framework for indexing and querying documents with LLMs, enabling RAG applications.
LLM Proxy
A middleware service that intercepts and manages requests to LLM APIs.
LLM Router
A system that intelligently routes requests to different models based on criteria.
Model Registry
A system for cataloging, versioning, and managing deployed models.
Ollama
A tool for running large language models locally, simplifying local model deployment.
OpenAI API
OpenAI's REST API providing access to models like GPT-4 and GPT-3.5-turbo.
OpenRouter Platform
A platform providing unified API access to multiple LLM providers and open-source models.
Prompt Management
Tools and systems for organizing, versioning, and tracking prompts used in applications.
Semantic Kernel Framework
Microsoft's framework for orchestrating LLM plugins and functions in .NET applications.
Shadow Mode Testing
Running new models in parallel with production without affecting user experience to validate improvements.