Pencheff

AI security

Evidence and cost

Conversation traces, pass/fail judges, token accounting, retries, and reproducible prompts.

ScopeLLM Red Team

Test AI products before attackers do: prompt attacks, tool abuse, data leakage, unsafe output, guardrail bypass, multi-agent workflows, and runtime policy enforcement.

OutputUnified evidence

Findings, reports, dashboards, exports, integrations, and retests all read from the same normalized record.

MethodDeterministic first

Pencheff favors repeatable checks, then uses AI for triage, enrichment, orchestration, and remediation where it adds signal.

From the Pencheff docs

Verdict pipeline

LLM Red Team — adversarial testing for chat endpoints

/features/llm-redteam

For each probe, the engine evaluates verdicts in order:

  1. Regexsuccess_indicators ∧ ¬refusal_patterns → VULNERABLE. Refusal beats success.
  2. Embedding similarity (optional) — when a TestCase declares success_embeddings: [text, …] and an embedder is configured, an AMBIGUOUS verdict can be promoted by cosine match against any anchor.
  3. LLM-as-judge (optional) — still-AMBIGUOUS verdicts go to a judge model. Judge confidence ≥ min_confidence is required to override.
  4. Factuality (LLM09 only) — KB-grounded contradiction check via the judge.

REFUSED beats every promotion path. AMBIGUOUS emits no Finding — that's how the false-positive rate stays at zero.

From the Pencheff docs

LLM-as-judge

LLM Red Team — adversarial testing for chat endpoints

/features/llm-redteam

Five judge providers ship out of the box:

ProviderNotes
openai-chatAny OpenAI-compatible chat endpoint. JSON-protocol baked into the system prompt.
executableLocal command receives JSON on stdin, returns JSON on stdout. Air-gapped friendly.
llama-guardLlama Guard 3 (8B). Parses the official safe/unsafe S1..S14 reply and maps S-codes onto OWASP LLM categories.
granite-guardianIBM Granite Guardian 3.x. Yes/No protocol with optional risk dimension.
openai-moderationOpenAI /moderations API. Threshold-graded; cheap and unaffected by <think> traces — recommended for reasoning-model targets.
redteam:
  judge:
    enabled: true
    provider: openai-moderation
    endpoint: https://api.openai.com/v1/moderations
    model: omni-moderation-latest
    headers:
      Authorization: "Bearer sk-…"
    min_confidence: 0.65
    unsafe_threshold: 0.4

From the Pencheff docs

Reporting

LLM Red Team — adversarial testing for chat endpoints

/features/llm-redteam
FormatWhereNotes
Markdownreporting.render_red_team_markdownCI comments, Slack
HTMLreporting_extras.render_htmlSelf-contained, embedded CSS, no JS, email-able
CSVreporting_extras.render_csvOne row per Finding; stable columns
JSON--output-format json from CLIFull Finding shape + summary + optional regression diff
JUnit XMLreporting.render_junit_xmlCI fail-on-threshold
Prometheusreporting.render_prometheus_metricsPair with the Grafana dashboard

A/B comparison & regression detection

GET /scans/{a}/compare/{b} returns a structured diff (regressions, fixes, common failures). The web UI exposes the same diff at /scans/compare?a=…&b=…. Use it to gate PRs on safety regressions or to A/B different model versions on the same suite.

Share-by-link

POST /scans/{id}/share?ttl_seconds=604800 returns a Fernet-encrypted token. The public route GET /share/llm/{token} renders the report as HTML / Markdown / CSV / JSON without auth — token expiry is the only revocation. Available only for kind: "llm" scans.

Grafana dashboard

The repo ships a canonical dashboard at docs/grafana/pencheff-llm-redteam.json: total failures, per-OWASP-LLM breakdown, per-strategy table, severity donut, latency p50/p95/p99, regression rate, cost trend.

Related

Keep exploring AI Security.