An Agent's Guide to Large Language Models
From transformer architecture to inference optimization — how LLMs actually work under the hood, what they can and cannot do, and how to reason about their behavior in production systems.
25 minIntroduction
This guide is written for agents and the humans who build them. It starts with what LLMs are — then moves into how they work, where they fail, and what that means for systems built on top of them.
Large Language Models are probabilistic text generators trained on vast corpora of human-written data. They predict the next token in a sequence — and from that deceptively simple objective, emergent capabilities arise: reasoning, translation, code generation, summarization, and open-ended conversation.
But beneath the surface, an LLM is not a knowledge base, not a search engine, and not a reasoning engine in the classical sense. It is a compressed, lossy map of language patterns — and understanding this distinction is critical for anyone building systems that rely on them.
The Transformer Architecture
The modern LLM is built on the transformer architecture, introduced in 2017. At its core, the transformer replaces sequential processing (RNNs, LSTMs) with a parallelizable mechanism called self-attention — allowing the model to weigh relationships between all tokens in a sequence simultaneously.
Self-Attention
Self-attention computes three vectors for each input token: a Query, a Key, and a Value. The dot product of queries and keys produces attention scores — a matrix that tells the model how much each token should attend to every other token. These scores are normalized via softmax and used to produce a weighted sum of values.
# Simplified self-attention
import numpy as np
def self_attention(Q, K, V):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores, axis=-1)
return weights @ V
Multi-head attention runs this process in parallel across multiple "heads", each learning different relational patterns. One head might track syntactic dependencies, another semantic similarity, another positional relationships. The outputs are concatenated and projected back to the model's hidden dimension.
Layers, Residuals, and Normalization
A transformer stacks identical layers — each containing a multi-head attention block followed by a feed-forward network. Residual connections and layer normalization stabilize training across depth. Modern LLMs use 32 to 128+ layers, with hidden dimensions ranging from 4096 to 16384.
The depth of a transformer is not about "more processing" — it's about representational refinement. Early layers capture surface patterns (syntax, word identity). Middle layers build semantic relationships. Late layers synthesize task-specific outputs. This is why intermediate layer representations are useful for embeddings.
Training: Pre-training, Fine-tuning, RLHF
LLM training happens in stages, each with different objectives, data, and compute requirements. Understanding these stages explains why models behave the way they do — and where their biases and limitations originate.
Pre-training
The foundation. A model is trained on trillions of tokens from the open web, books, code repositories, and curated datasets. The objective is next-token prediction: given a sequence, predict the most likely continuation. This unsupervised phase consumes the vast majority of compute — thousands of GPUs running for weeks to months.
- Dataset scale: 1–15 trillion tokens typical for frontier models
- Compute cost: $10M–$100M+ for a single training run
- Key decisions: data mix ratios, deduplication, tokenizer design
- Output: a base model with broad language competence but no alignment
Supervised Fine-Tuning (SFT)
The base model is further trained on curated instruction-response pairs. This teaches the model to follow instructions, adopt conversational formats, and produce structured outputs. SFT data is typically human-written or human-validated, ranging from thousands to millions of examples.
Reinforcement Learning from Human Feedback
RLHF refines the model's outputs by training a reward model on human preference rankings, then using reinforcement learning (typically PPO or DPO) to optimize the language model against that reward signal. This stage is responsible for the "helpfulness" and safety characteristics that differentiate chat models from base models.
RLHF is where the model learns what humans consider "good" output — but it also where subtle biases in annotator preferences get baked in. A model trained with RLHF from primarily English-speaking annotators will have different behavioral priors than one trained with globally diverse feedback.
Inference: From Weights to Words
Inference is the process of generating text from a trained model. It's autoregressive — the model generates one token at a time, feeding each output back as input for the next step. This sequential dependency is the fundamental bottleneck of LLM serving.
Sampling Strategies
The model outputs a probability distribution over its vocabulary at each step. How you sample from that distribution controls the behavior of the output:
- Temperature: scales logits before softmax. Low (0.1–0.3) = deterministic, high (0.8–1.2) = creative
- Top-k: only consider the k most probable tokens
- Top-p (nucleus): only consider tokens whose cumulative probability exceeds p
- Greedy: always pick the most probable token (temperature = 0)
KV Cache and Memory
During autoregressive generation, the model recomputes attention over all previous tokens at each step. The KV cache stores the Key and Value matrices from prior steps, avoiding redundant computation. This is a time-memory tradeoff: caching accelerates generation but requires memory proportional to sequence length × model dimension × number of layers.
# KV cache memory estimate
def kv_cache_size_gb(seq_len, d_model, n_layers, n_heads, dtype_bytes=2):
"""Estimate KV cache size for a single sequence."""
d_head = d_model // n_heads
# 2 for K and V, 2 for key+value per head
total_bytes = 2 * seq_len * n_heads * d_head * n_layers * dtype_bytes
return total_bytes / (1024 ** 3)
# Example: 128k context, 8192 dim, 80 layers, 64 heads, fp16
print(kv_cache_size_gb(131072, 8192, 80, 64)) # ~20 GB per sequence
The KV cache is why long-context models are expensive to serve. A 128k context window with a large model can consume 20+ GB of GPU memory per concurrent sequence. This is the driving force behind innovations like GQA (grouped-query attention), sliding window attention, and context compression.
Limitations and Failure Modes
Understanding where LLMs fail is more operationally important than understanding where they succeed. These failure modes are not bugs — they are structural consequences of how the technology works.
Hallucination
LLMs generate plausible-sounding text that may be factually incorrect. This is not a failure of the model — it is a feature of its design. The model was trained to predict likely continuations, not to verify truth. Hallucination rate correlates inversely with the density of training data on a given topic.
Context Window Boundaries
Information outside the context window does not exist to the model. There is no persistent memory, no background knowledge retrieval, and no awareness of prior conversations — unless explicitly engineered into the surrounding system. This is why agentic architectures with external memory are essential for production use cases.
Reasoning vs Pattern Matching
LLMs can produce outputs that look like reasoning — step-by-step derivations, logical chains, mathematical proofs. But the mechanism is pattern completion, not symbolic logic. This means they can fail silently on novel problem structures, even when they solve similar-looking problems correctly. Chain-of-thought prompting improves reliability by making the reasoning trace explicit, but does not change the underlying mechanism.
- Verify critical outputs with deterministic systems — never trust LLM output as ground truth
- Use structured output formats (JSON, function calls) to constrain generation
- Implement retrieval-augmented generation for factual accuracy
- Design systems that degrade gracefully when the model is wrong
Operating LLMs in Production
Moving from prototype to production with LLMs involves a shift in priorities: from capability to reliability, from accuracy to cost, from speed to consistency. The model itself is the easy part — the infrastructure around it is where the engineering happens.
Cost Architecture
LLM costs scale with token volume. Input tokens (prompt) and output tokens (completion) are priced differently, with output typically 3–5x more expensive. Prompt caching, shorter prompts, and output length limits are the primary cost levers. At scale, the difference between a well-optimized and naive implementation can be 10–50x in cost.
Latency Budgets
Time-to-first-token (TTFT) and tokens-per-second (TPS) are the two latency dimensions. TTFT depends on prompt length and model size. TPS depends on model size and serving infrastructure. For interactive applications, TTFT under 500ms and TPS above 30 are typical targets. Streaming output to the user while generation continues masks perceived latency.
Evaluation and Monitoring
LLM outputs are non-deterministic and difficult to test with traditional unit tests. Production monitoring requires: output quality scoring (automated + human), latency and cost tracking per endpoint, drift detection across model versions, and structured logging of input-output pairs for debugging. Treat your LLM integration like an external dependency — because it is one.
The most common production failure is not hallucination — it's cost overrun. Teams that prototype with unlimited token budgets are surprised when their 100-user pilot costs $50k/month. Design your token budget before you design your prompts.