Test AI products before attackers do: prompt attacks, tool abuse, data leakage, unsafe output, guardrail bypass, multi-agent workflows, and runtime policy enforcement.
AI security
Evidence and cost
Conversation traces, pass/fail judges, token accounting, retries, and reproducible prompts.
Findings, reports, dashboards, exports, integrations, and retests all read from the same normalized record.
Pencheff favors repeatable checks, then uses AI for triage, enrichment, orchestration, and remediation where it adds signal.
From the Pencheff docs
Verdict pipeline
LLM Red Team — adversarial testing for chat endpoints
/features/llm-redteamFor each probe, the engine evaluates verdicts in order:
- Regex —
success_indicators∧ ¬refusal_patterns→ VULNERABLE. Refusal beats success. - Embedding similarity (optional) — when a TestCase declares
success_embeddings: [text, …]and an embedder is configured, an AMBIGUOUS verdict can be promoted by cosine match against any anchor. - LLM-as-judge (optional) — still-AMBIGUOUS verdicts go to a
judge model. Judge confidence ≥
min_confidenceis required to override. - Factuality (LLM09 only) — KB-grounded contradiction check via the judge.
REFUSED beats every promotion path. AMBIGUOUS emits no Finding — that's how the false-positive rate stays at zero.
From the Pencheff docs
LLM-as-judge
LLM Red Team — adversarial testing for chat endpoints
/features/llm-redteamFive judge providers ship out of the box:
| Provider | Notes |
|---|---|
openai-chat | Any OpenAI-compatible chat endpoint. JSON-protocol baked into the system prompt. |
executable | Local command receives JSON on stdin, returns JSON on stdout. Air-gapped friendly. |
llama-guard | Llama Guard 3 (8B). Parses the official safe/unsafe S1..S14 reply and maps S-codes onto OWASP LLM categories. |
granite-guardian | IBM Granite Guardian 3.x. Yes/No protocol with optional risk dimension. |
openai-moderation | OpenAI /moderations API. Threshold-graded; cheap and unaffected by <think> traces — recommended for reasoning-model targets. |
redteam:
judge:
enabled: true
provider: openai-moderation
endpoint: https://api.openai.com/v1/moderations
model: omni-moderation-latest
headers:
Authorization: "Bearer sk-…"
min_confidence: 0.65
unsafe_threshold: 0.4
From the Pencheff docs
Reporting
LLM Red Team — adversarial testing for chat endpoints
/features/llm-redteam| Format | Where | Notes |
|---|---|---|
| Markdown | reporting.render_red_team_markdown | CI comments, Slack |
| HTML | reporting_extras.render_html | Self-contained, embedded CSS, no JS, email-able |
| CSV | reporting_extras.render_csv | One row per Finding; stable columns |
| JSON | --output-format json from CLI | Full Finding shape + summary + optional regression diff |
| JUnit XML | reporting.render_junit_xml | CI fail-on-threshold |
| Prometheus | reporting.render_prometheus_metrics | Pair with the Grafana dashboard |
A/B comparison & regression detection
GET /scans/{a}/compare/{b} returns a structured diff (regressions,
fixes, common failures). The web UI exposes the same diff at
/scans/compare?a=…&b=…. Use it to gate PRs on safety regressions
or to A/B different model versions on the same suite.
Share-by-link
POST /scans/{id}/share?ttl_seconds=604800 returns a Fernet-encrypted
token. The public route GET /share/llm/{token} renders the report
as HTML / Markdown / CSV / JSON without auth — token expiry is the
only revocation. Available only for kind: "llm" scans.
Grafana dashboard
The repo ships a canonical dashboard at
docs/grafana/pencheff-llm-redteam.json:
total failures, per-OWASP-LLM breakdown, per-strategy table, severity
donut, latency p50/p95/p99, regression rate, cost trend.
Related