AI security · AI Security

Evidence and cost

Conversation traces, pass/fail judges, token accounting, retries, and reproducible prompts.

AI security coverage tests LLM endpoints, chatbots, RAG workflows, tool-calling agents, memory, connectors, runtime guardrails, and policy controls against realistic adversarial prompts and workflows.

Start free Sign in

OWASP LLM Top 10judged

8coverage areas

5operator steps

4evidence fields

TranscriptJudgeTokensGuardrail

Coverage maps to the OWASP LLM Top 10 categories tested below, with judge-backed verdicts.

ScopeLLM Red Team

SectionAI Security

MethodDeterministic-first

OutputUnified evidence

ProfileAI security

Coverage

What does Evidence and cost test?

Conversation traces, pass/fail judges, token accounting, retries, and reproducible prompts.
This page is part of AI Security under LLM Red Team.
It links back into the broader red team models, agents, tools, and guardrails experience.
OWASP LLM Top 10 coverage for prompt injection, sensitive information disclosure, supply chain, data leakage, plugins, agency, overreliance, and model theft.
Jailbreak strategies, roleplay, encoding, payload splitting, multilingual variants, custom datasets, and judge-backed scoring.
Agentic tests for tool authorization, memory poisoning, context exfiltration, planner hijacking, and unsafe side effects.
Sentry runtime guardrails, HTTP sidecars, LiteLLM plugins, MCP middleware, PII, secrets, unsafe HTML, and tool authorization checks.
AI governance mapping to OWASP LLM, MITRE ATLAS, NIST AI RMF, EU AI Act, ISO/IEC 42001, GDPR, and SOC 2.

Execution

How does Pencheff run this?

Register an LLM endpoint, chatbot, model gateway, MCP host, or agent workflow.
Choose built-in categories, datasets, guardrails, custom prompts, and optional judge settings.
Run adversarial campaigns across prompt, tool, memory, retrieval, output, and policy paths.
Classify failures by category, strategy, severity, transcript, token cost, and guardrail recommendation.
Turn passing and failing prompts into regression suites for releases and model upgrades.

Evidence

What evidence does this produce?

Prompt, response, tool call, policy decision, transcript, category, strategy, judge result, and confidence.
Recommended guardrails with exact unsafe behavior, enforcement point, and regression prompt.
Token usage, model/provider metadata, retry behavior, and cost-oriented observability.
Governance mappings for AI risk, safety, privacy, and compliance programs.

Controls

How is this kept safe to run?

Tests can be run through HTTP, chat-completions, LiteLLM, MCP, or custom adapters.
Guardrail recommendations stay tied to the scan that exposed the failure.
Agentic testing focuses on authorization, context boundaries, and side-effect control.
Runtime policy checks can be placed before prompts, after responses, or around tools.

From the Pencheff docs

Verdict pipeline

LLM Red Team — adversarial testing for chat endpoints

For each probe, the engine evaluates verdicts in order:

Regex — success_indicators ∧ ¬refusal_patterns → VULNERABLE. Refusal beats success.
Embedding similarity (optional) — when a TestCase declares success_embeddings: [text, …] and an embedder is configured, an AMBIGUOUS verdict can be promoted by cosine match against any anchor.
LLM-as-judge (optional) — still-AMBIGUOUS verdicts go to a judge model. Judge confidence ≥ min_confidence is required to override.
Factuality (LLM09 only) — KB-grounded contradiction check via the judge.

REFUSED beats every promotion path. AMBIGUOUS emits no Finding — that's how the false-positive rate stays at zero.

From the Pencheff docs

LLM-as-judge

LLM Red Team — adversarial testing for chat endpoints

Five judge providers ship out of the box:

Provider	Notes
`openai-chat`	Any OpenAI-compatible chat endpoint. JSON-protocol baked into the system prompt.
`executable`	Local command receives JSON on stdin, returns JSON on stdout. Air-gapped friendly.
`llama-guard`	Llama Guard 3 (8B). Parses the official `safe`/`unsafe S1..S14` reply and maps S-codes onto OWASP LLM categories.
`granite-guardian`	IBM Granite Guardian 3.x. Yes/No protocol with optional risk dimension.
`openai-moderation`	OpenAI `/moderations` API. Threshold-graded; cheap and unaffected by `<think>` traces — recommended for reasoning-model targets.

redteam:
  judge:
    enabled: true
    provider: openai-moderation
    endpoint: https://api.openai.com/v1/moderations
    model: omni-moderation-latest
    headers:
      Authorization: "Bearer sk-…"
    min_confidence: 0.65
    unsafe_threshold: 0.4

From the Pencheff docs

Reporting

LLM Red Team — adversarial testing for chat endpoints

Format	Where	Notes
Markdown	`reporting.render_red_team_markdown`	CI comments, Slack
HTML	`reporting_extras.render_html`	Self-contained, embedded CSS, no JS, email-able
CSV	`reporting_extras.render_csv`	One row per Finding; stable columns
JSON	`--output-format json` from CLI	Full Finding shape + summary + optional regression diff
JUnit XML	`reporting.render_junit_xml`	CI fail-on-threshold
Prometheus	`reporting.render_prometheus_metrics`	Pair with the Grafana dashboard

A/B comparison & regression detection

GET /scans/{a}/compare/{b} returns a structured diff (regressions, fixes, common failures). The web UI exposes the same diff at /scans/compare?a=…&b=…. Use it to gate PRs on safety regressions or to A/B different model versions on the same suite.

Share-by-link

POST /scans/{id}/share?ttl_seconds=604800 returns a Fernet-encrypted token. The public route GET /share/llm/{token} renders the report as HTML / Markdown / CSV / JSON without auth — token expiry is the only revocation. Available only for kind: "llm" scans.

Grafana dashboard

The repo ships a canonical dashboard at docs/grafana/pencheff-llm-redteam.json: total failures, per-OWASP-LLM breakdown, per-strategy table, severity donut, latency p50/p95/p99, regression rate, cost trend.

What does Evidence and cost test?

How does Pencheff run this?

What evidence does this produce?

How is this kept safe to run?

A/B comparison & regression detection

Share-by-link

Grafana dashboard

Keep exploring AI Security.