Models & AI Tech

AI tech topics

Concept pages for the techniques and failure modes that cut across models — each one a lens over the findings, methods and references tagged to it.

AI agents & tool use

Agents let a model plan and call tools to act in the world. The autonomy that makes them useful also makes them dangerous: untrusted input can hijack instructions (prompt injection), tool arguments can be hallucinated, and errors compound across steps. The linked findings show concrete failure modes to test for before deploying an agent on untrusted data.

Agents

Bias & fairness

Models inherit the statistics of their training data, including its stereotypes — and alignment training redistributes bias rather than deleting it. The failures are often quiet: different tone for different names, different refusal rates across dialects, skewed defaults in generated personas. Because the effects are distributional, they only show up under aggregated, controlled comparisons, not spot checks. The linked findings document measured disparities; the methods show the comparison designs that surface them.

Bias

Evals & benchmarks

Evals are how the field measures progress, and they are easy to fool — including by accident. Benchmark data leaks into training sets (contamination), scores saturate while real-world capability lags, and optimizing for a metric corrupts the metric. A reported score is a claim about a dataset, not about your use case. The linked methods cover how to evaluate probabilistic systems honestly; the findings document where headline numbers and observed behavior came apart.

Evals

Hallucination & confabulation

Hallucination is the failure mode that made LLM testing a discipline: the model states something false with the same fluency and confidence as something true. It is not a bug to be patched but a consequence of how generative models work — they optimize for plausible continuations, not verified facts. Mitigations (RAG, citations, abstention training) reduce the rate without eliminating it, so testing has to measure it: the linked findings document fabricated citations, invented APIs and confident wrong answers across vendors.

Hallucination

Jailbreaking

A jailbreak is a prompt that talks a model out of its safety training — role-play framings, encodings, many-shot patterns, or automated search over adversarial suffixes. Unlike prompt injection (which hijacks an application's instructions), jailbreaking targets the model's own refusal behavior. Defenses improve every generation and none have held: each new model ships with new bypasses found within days. The linked findings document techniques that worked and how vendors responded.

Jailbreak

Long context windows

Bigger context windows do not guarantee reliable use of everything in them. Models recall information best at the start and end of the input and worst in the middle, and performance degrades as the relevant span moves. Don't assume a large window means a fact buried mid-context will be used — the linked findings quantify the effect.

Context window

Model drift

The model behind an API endpoint is not a fixed artifact: providers retrain, swap checkpoints and adjust system prompts, and behavior changes without a version bump. A prompt that worked in March can silently degrade by June — the linked findings include measured drift on identical prompts across months. For anything in production this turns testing from a release activity into a monitoring activity: pin versions where you can, and re-run your evals on a schedule where you can't.

Drift

Prompt injection

Prompt injection is the LLM-era analogue of SQL injection: once a model processes attacker-controlled text, that text can override the developer's instructions. Indirect injection — payloads hidden in retrieved or browsed content — is especially hard to defend. There is no robust general fix; the linked findings document direct, indirect and exfiltration variants.

Prompt injection

Reasoning & chain-of-thought

Chain-of-thought and dedicated reasoning models trade tokens for accuracy: the model writes out intermediate steps before answering. The gains are real but so are the new failure modes — the stated reasoning is not always the real reasoning (unfaithful chains), performance collapses on problems slightly outside the training distribution, and small irrelevant changes to a question can flip the answer. The linked findings are the largest cluster in the catalog: reasoning is where confident output and actual capability diverge most visibly.

Reasoning failure

Refusals & over-caution

Safety training has a false-positive side: models refuse legitimate requests — medical questions, security research, fiction involving conflict — because they pattern-match to something forbidden. Over-refusal is a real quality defect, in tension with jailbreak resistance: tuning toward one moves the other. It is also unevenly distributed across topics and phrasings, so it needs its own evaluation rather than being treated as the safe default. The linked findings document refusals of clearly benign requests.

Refusal

Retrieval-Augmented Generation (RAG)

RAG grounds a model's output in retrieved documents to reduce hallucination and add fresh knowledge. It is not a cure-all: retrieval quality bounds answer quality, models can ignore or misread retrieved context, and long contexts suffer recall degradation (see the linked findings). Treat retrieval as a noisy oracle, not ground truth.

RAG

Robustness & input perturbations

A robust system gives the same answer to the same question asked slightly differently. LLMs often don't: typos, reordered options, added whitespace or an irrelevant sentence can change the output — which means a single passing test proves little. Robustness testing makes the perturbation systematic (metamorphic relations, paraphrase sets, character-level noise) and measures the flip rate. The linked methods and findings show how small the perturbation can be and still matter.

Robustness

Safety & alignment

Alignment training teaches a model to refuse harmful requests and follow a policy — but the policy lives in the same soft, steerable substrate as everything else the model does. That makes safety a property to be tested, not assumed: behavior shifts across versions, guardrails that hold in English fail in other languages or encodings, and vendors' own system cards acknowledge residual risks. The linked findings collect both independently demonstrated failures and vendor-acknowledged limitations.

Safety

Tool calling

Tool calling is the contract surface between a probabilistic model and deterministic software: the model emits a structured call, your code executes it. The contract is enforced by training, not by a type system — so models hallucinate functions that don't exist, fabricate plausible argument values, and mishandle tool errors. Where the agents topic covers compounding multi-step autonomy, the findings here sit at the single-call interface: schema violations and invented arguments you can test for directly.

Tool use