At least one of prompt or output is required — that's the evidence.
Category Not sure Hallucination Jailbreak Reasoning Tool use Refusal Bias Prompt injection Safety Other
Found with which method? Optional — links the finding to a technique. None / not sure A/B testing in production Adversarial prompting Benchmark evaluation Bias auditing Boundary & edge-case testing Canary releases & staged rollout Chain-of-thought faithfulness probing Chaos engineering for AI systems Counterfactual bias probing Differential testing Distillation & model-extraction probing Distributional testing (KS test, Monte Carlo) Drift & decay monitoring Factual oracle verification Glitch-token & unicode fuzzing Hallucination triggering Integration testing (MCP handshakes & tool contracts) Logic & consistency testing Metamorphic testing Model-graded evaluation (LLM-as-judge) Needle-in-a-haystack (long-context retrieval) Perturbation testing Prompt-injection & jailbreak testing Property-based testing Self-consistency probing Smoke testing in CI/CD Threshold testing Unit testing the deterministic scaffold