AI security · Platform

AI agents

Tool-calling scan agent for testing LLM apps, chatbots, and agentic workflows.

AI security coverage tests LLM endpoints, chatbots, RAG workflows, tool-calling agents, memory, connectors, runtime guardrails, and policy controls against realistic adversarial prompts and workflows.

Start free Sign in

OWASP LLM Top 10judged

8coverage areas

5operator steps

4evidence fields

TranscriptJudgeTokensGuardrail

Coverage maps to the OWASP LLM Top 10 categories tested below, with judge-backed verdicts.

ScopeAI Security

SectionPlatform

MethodDeterministic-first

OutputUnified evidence

ProfileAI security

Coverage

What does AI agents test?

Tool-calling scan agent for testing LLM apps, chatbots, and agentic workflows.
This page is part of Platform under AI Security.
It links back into the broader a complete adversarial security platform experience.
OWASP LLM Top 10 coverage for prompt injection, sensitive information disclosure, supply chain, data leakage, plugins, agency, overreliance, and model theft.
Jailbreak strategies, roleplay, encoding, payload splitting, multilingual variants, custom datasets, and judge-backed scoring.
Agentic tests for tool authorization, memory poisoning, context exfiltration, planner hijacking, and unsafe side effects.
Sentry runtime guardrails, HTTP sidecars, LiteLLM plugins, MCP middleware, PII, secrets, unsafe HTML, and tool authorization checks.
AI governance mapping to OWASP LLM, MITRE ATLAS, NIST AI RMF, EU AI Act, ISO/IEC 42001, GDPR, and SOC 2.

Execution

How does Pencheff run this?

Register an LLM endpoint, chatbot, model gateway, MCP host, or agent workflow.
Choose built-in categories, datasets, guardrails, custom prompts, and optional judge settings.
Run adversarial campaigns across prompt, tool, memory, retrieval, output, and policy paths.
Classify failures by category, strategy, severity, transcript, token cost, and guardrail recommendation.
Turn passing and failing prompts into regression suites for releases and model upgrades.

Evidence

What evidence does this produce?

Prompt, response, tool call, policy decision, transcript, category, strategy, judge result, and confidence.
Recommended guardrails with exact unsafe behavior, enforcement point, and regression prompt.
Token usage, model/provider metadata, retry behavior, and cost-oriented observability.
Governance mappings for AI risk, safety, privacy, and compliance programs.

Controls

How is this kept safe to run?

Tests can be run through HTTP, chat-completions, LiteLLM, MCP, or custom adapters.
Guardrail recommendations stay tied to the scan that exposed the failure.
Agentic testing focuses on authorization, context boundaries, and side-effect control.
Runtime policy checks can be placed before prompts, after responses, or around tools.

From the Pencheff docs

AI agent swarm — parallel multi-agent scanning

Pencheff's swarm mode replaces the legacy single-agent loop as the default execution path for every scan. Instead of one agent iterating through tool calls sequentially, the orchestrator fans out work across 17 specialised agents arranged in three phases. Recon runs first and produces a frozen snapshot of the attack surface; 12 breaker agents then attack the target in parallel from that snapshot; 6 synthesis agents then process the merged findings in parallel to produce chains, compliance mappings, impact proofs, PoCs, screenshots, and admin access evidence.

The net effect is substantially deeper coverage in roughly the same wallclock time as the legacy loop, plus structured operator output that maps every finding through the full exploit chain → compliance → reproducibility path.

The swarm is opt-out: set SWARM_ENABLED=false to instantly revert every scan to the legacy single-agent path with no other changes required.

Pipeline shape

┌─────────────────────────────────────────────────────────────┐
│  Phase 1: ReconAgent (1 agent)                              │
│  Map attack surface → frozen ReconSnapshot                  │
└────────────────────────┬────────────────────────────────────┘
                         │ snapshot (read-only)
         ┌───────────────▼───────────────────────────────────┐
         │  Phase 2: Breakers (12 agents, fully parallel)    │
         │                                                    │
         │  InjectionAgent   ClientSideAgent   AuthAgent      │
         │  AuthzAgent       APIAgent          InfraAgent     │
         │  CloudAgent       LLMRedTeamAgent   SupplyChainAgent│
         │  K8sAgent   ActiveDirectoryAgent   MobileAppAgent  │
         └───────────────┬───────────────────────────────────┘
                         │ merge findings from all breakers
         ┌───────────────▼───────────────────────────────────┐
         │  Phase 3: Synthesis (6 agents, fully parallel)    │
         │                                                    │
         │  ChainAgent         ComplianceAgent                │
         │  ProofOfImpactAgent PayloadCraftingAgent           │
         │  EvidenceCaptureAgent AdminAccessAgent             │
         └───────────────────────────────────────────────────┘

The 19 agents

Agent	Phase	Mandate	Tools
ReconAgent	1	Map attack surface; produce frozen snapshot	`recon_passive`, `recon_active`, `recon_api_discovery`, `scan_waf`, `authenticated_crawl`
InjectionAgent	2	SQLi/NoSQLi/XXE/SSTI/cmdi + path traversal + file upload	`scan_injection`, `scan_file_handling`, `oast_*`, `test_endpoint`
ClientSideAgent	2	Reflected/DOM XSS, CSRF, open redirect, CORS	`scan_client_side`, `scan_dom_xss`, `test_endpoint`
AuthAgent	2	Login weakness, JWT, OAuth, MFA bypass	`scan_auth`, `scan_oauth`, `scan_mfa_bypass`, `test_endpoint`
AuthzAgent	2	IDOR, vertical/horizontal privesc — quiet-quits when no credentials supplied	`scan_authz`, `test_endpoint`
APIAgent	2	API/GraphQL flaws, websocket, business logic	`scan_api`, `scan_websocket`, `scan_business_logic`, `test_endpoint`
InfraAgent	2	TLS/headers, HTTP smuggling, CRLF, subdomain takeover	`scan_infrastructure`, `scan_advanced`, `scan_subdomain_takeover`, `run_security_tool`
CloudAgent	2	Cloud misconfig, IAM, public blobs, blind SSRF callbacks	`scan_cloud`, `oast_*`, `test_endpoint`
LLMRedTeamAgent	2	Prompt injection, jailbreak, system-prompt extraction on AI/LLM endpoints	`scan_llm_red_team`, `test_endpoint`
SupplyChainAgent	2	Exposed dependency manifests, outdated client-side libraries	`run_security_tool`, `test_endpoint`
K8sAgent	2	Kubernetes control-plane exposure, RBAC misconfig, exposed metrics	`run_security_tool`, `test_endpoint`
ActiveDirectoryAgent	2	Active Directory attack paths: BloodHound relationship graph, Certipy ESC1–ESC8 cert template abuse, CrackMapExec SMB enum, Impacket secretsdump	`scan_active_directory`, `test_endpoint`
MobileAppAgent	2	Android/iOS static analysis: MobSF enrichment, manifest exported-component check, secrets sweep across decompiled output	`scan_mobile_app`
ChainAgent	3	Multi-step attack chains; blast-radius scoring; cross-system chain detection	`exploit_chain_suggest`, `test_chain`, `test_endpoint`
ComplianceAgent	3	Map findings to PCI-DSS/HIPAA/SOC2/GDPR controls (read-only)	`get_findings`
ProofOfImpactAgent	3	Schema-only impact assessment via sqlmap (`--dbs/--tables/--columns/--count` only). No row data extracted.	`run_security_tool`, `test_endpoint`
PayloadCraftingAgent	3	Generate `curl` + Python `requests` PoCs per finding (read-only synthesis)	`get_findings`
EvidenceCaptureAgent	3	Playwright screenshot per verified high/critical finding with PII redaction	`capture_evidence`
AdminAccessAgent	3	Per-finding gated; when verified admin access exists, drives Playwright into the admin panel read-only: front-page screenshot, ≤ 5 menu links enumerated, then immediate logout. No state-changing tools in registry.	`playwright_navigate` (GET-only), `playwright_screenshot`, `playwright_enumerate_links`, `playwright_logout`

Phase 1: Recon

The ReconAgent runs first and exclusively. It calls recon_passive (DNS, HTTP headers, technology fingerprinting, Shodan metadata if configured), recon_active (path enumeration, port probe on the primary host), and recon_api_discovery (OpenAPI/GraphQL schema fetch, common API prefixes). It also runs scan_waf to detect WAF presence and authenticated_crawl when credentials are supplied.

The result is a frozen ReconSnapshot: a serialised record of discovered endpoints, technology stack fingerprints, WAF type, and discovered API specs. All Phase 2 agents receive an identical read-only copy of this snapshot — they cannot extend it or communicate with each other.

Graceful degradation: if any individual recon call returns a transient error (network timeout, tool error), the agent retries once with a 5 s backoff. If the retry also fails, the snapshot is emitted with the successfully-collected endpoints and a partial=true flag. The orchestrator proceeds — Phase 2 agents see the partial snapshot and work on what exists. Only a fully empty snapshot (zero endpoints) triggers the catastrophic fallback (see below).

Phase 2: Breakers (parallel fan-out)

Once the ReconSnapshot is sealed, the orchestrator spawns all 12 breaker agents simultaneously via asyncio.gather. Each agent receives:

A fresh, isolated pencheff session seeded with the read-only snapshot (so every breaker starts from the same known state).
Its own per-agent tool registry — each agent is granted only the tools it needs, reducing the chance of accidental cross-domain tool calls.
A turn budget drawn from the SWARM_TURNS_* env-var family (see Configuration).

Notable behaviours:

AuthzAgent quiet-quit: if the scan has no credentials, AuthzAgent emits a single informational note ("no credentials — skipping authz scan") and exits cleanly. Its absence is non-fatal.
Per-breaker retry: each breaker retries its first failing tool call once with a 10 s backoff. After that, partial findings are committed and the agent exits — a breaker crash does not bring down the swarm.
Partial-failure tolerance: the orchestrator waits for all 12 breakers, collects results from those that succeeded, and logs a warning for those that failed. The merge step proceeds on whatever findings exist.

After all breakers finish, their findings are de-duplicated by a deterministic key (endpoint|parameter|technique|title) and merged into a single findings list that Phase 3 agents consume.

Phase 3: Synthesis (6 agents in parallel)

All six synthesis agents start simultaneously once the merged findings list is available. Each reads the merged findings (and nothing else) — none of them probe the target again except to capture a specific screenshot or run a --count-only sqlmap call.

Each agent writes its output into a named section of the operator summary. Failure of any one synthesis agent is non-fatal: the other five still deliver their sections, and the failed section is noted as "unavailable" in the report rather than crashing the whole scan.

Operator-visible output sections

The final operator summary stitches the Phase 3 agent outputs into a structured document in this order:

Lead paragraph (from ChainAgent) — the top attack chain with its blast-radius score and any cross-system chain it detected.
## Compliance mapping (from ComplianceAgent) — which PCI-DSS, HIPAA, SOC 2, and GDPR controls are affected.
## Proof of Impact (from ProofOfImpactAgent) — schema-level evidence: database names, table names, column names, row counts. No customer data is extracted.
## Reproducible PoCs (from PayloadCraftingAgent) — one curl command and one Python requests snippet per verified high/critical finding.
## Evidence Screenshots (from EvidenceCaptureAgent) — inline PNG thumbnails of every verified high/critical finding with PII redacted.
## Admin Panel Access (Verified) (from AdminAccessAgent) — a screenshot of the admin panel front page plus up to 5 enumerated menu links. This section is only present when verified admin access was confirmed by a Phase 2 breaker.

Catastrophic fallback

If the ReconAgent produces a snapshot with zero endpoints, or if all 12 Phase 2 breakers fail, the orchestrator falls back automatically to the legacy single-agent loop (agent_runner.run_agent). The scan continues — no operator action required. A swarm_fallback: true flag is set on the scan record and visible in the scan-detail API response and the UI banner.

Killswitch: setting SWARM_ENABLED=false in the API environment immediately disables swarm mode for all new scans. In-flight scans are unaffected. This is the fastest path to reverting to the legacy path if an unexpected issue arises.

Cost and performance

Typical numbers for a deep scan against a medium-complexity target:

Metric	Typical value
Wallclock time	~33 minutes
Total input tokens	~411 K
Total output tokens	~86 K
Total LLM calls	~109 calls

Per-tier turn budgets (controlled by SWARM_TURNS_* env vars):

Tier	ReconAgent	Each breaker	Each synthesis agent
`quick`	8	12	6
`standard`	15	25	10
`deep`	25	50	20

Configuration

All swarm behaviour is driven by environment variables on the API container. Field naming follows apps/api/pencheff_api/config.py.

Variable	Default	Description
`SWARM_ENABLED`	`true`	Master on/off switch. `false` reverts every scan to the legacy single-agent loop.
`SWARM_TURNS_RECON_QUICK`	`8`	Turn budget for `ReconAgent` on `quick` profile.
`SWARM_TURNS_RECON_STANDARD`	`15`	Turn budget for `ReconAgent` on `standard` profile.
`SWARM_TURNS_RECON_DEEP`	`25`	Turn budget for `ReconAgent` on `deep` profile.
`SWARM_TURNS_BREAKER_QUICK`	`12`	Turn budget per Phase 2 breaker on `quick`.
`SWARM_TURNS_BREAKER_STANDARD`	`25`	Turn budget per Phase 2 breaker on `standard`.
`SWARM_TURNS_BREAKER_DEEP`	`50`	Turn budget per Phase 2 breaker on `deep`.
`SWARM_TURNS_SYNTHESIS_QUICK`	`6`	Turn budget per Phase 3 synthesis agent on `quick`.
`SWARM_TURNS_SYNTHESIS_STANDARD`	`10`	Turn budget per Phase 3 synthesis agent on `standard`.
`SWARM_TURNS_SYNTHESIS_DEEP`	`20`	Turn budget per Phase 3 synthesis agent on `deep`.
`SWARM_TURNS_CHAIN_QUICK`	`6`	Override for `ChainAgent` specifically on `quick` (defaults to synthesis budget if unset).
`SWARM_TURNS_CHAIN_DEEP`	`30`	Override for `ChainAgent` on `deep`.
`SWARM_BREAKER_RETRY_ATTEMPTS`	`1`	How many times a failing breaker tool call is retried.
`SWARM_BREAKER_RETRY_BACKOFF_SEC`	`10`	Seconds to wait between retry attempts.

Consent screen

Because the swarm calls significantly more external endpoints and can demonstrate real proof-of-impact (schema enumeration, admin access), Pencheff requires explicit operator consent before any scan is created.

The scan-creation form (and the POST /scans API body) now includes a consent_payload block. The operator must:

Review the disclosed-actions catalogue for the agent classes they are enabling. Each agent class lists exactly what it will probe and what data it may read.
Paste or type an authorization statement of at least 50 characters (typically: "I am authorised to test [target] as of [date] and I accept the disclosed actions above.").
Tick the "I confirm" checkbox.

The consent_payload is persisted on Scan.consent_payload (JSONB) and is included in every audit export. The API rejects POST /scans if consent_payload is absent or if the authorization text is shorter than 50 characters.

Note: The consent model described here covers the current non-destructive swarm only. Agents that mutate target state, extract row data, or impact availability require a separate expanded consent flow documented in docs/superpowers/specs/2026-05-06-destructive-agents-blueprint.md (repo path, not a published docs route) and are not enabled in any current release.

LLM trace persistence

Every LLM call made by every swarm agent is recorded in the scan_llm_traces database table. Each row stores:

agent — which agent made the call (InjectionAgent, ChainAgent, etc.)
turn — the agent's conversation turn number at the time of the call
request_messages — the full messages array sent to the LLM (JSONB)
response — the raw response (JSONB), including tool-call blocks
input_tokens, output_tokens, cache_read_tokens — token counts
reasoning — the reasoning/thinking block if the model returned one

Traces are accessible via GET /scans/{id}/llm-traces (auth required). They are also summarised inline in the scan assessment log:

[InjectionAgent] LLM turn=3 in=1234t out=567t cached=800t · calls=[scan_injection]
[ChainAgent] LLM turn=1 in=3421t out=912t cached=2800t · calls=[exploit_chain_suggest,test_chain]

Evidence screenshots

When EvidenceCaptureAgent runs, it opens a Playwright browser context, navigates to the vulnerable URL with the session's auth cookies, and captures a full-page PNG. PII is redacted before the PNG is stored.

Screenshots are stored at ~/.pencheff/evidence/<scan_id>/<finding_id>.png inside the API worker container.

They are served via GET /scans/{id}/evidence/{finding_id}.png (auth required). A 404 is returned if no screenshot exists for that finding.

The Evidence Screenshots section of the operator summary embeds each PNG inline via a signed URL that expires after 24 hours.

What we DON'T do

The current swarm is explicitly non-destructive:

No row data is extracted from databases — ProofOfImpactAgent uses sqlmap with --dbs/--tables/--columns/--count only.
No state-changing requests are issued — every breaker and synthesis agent operates read-only.
No availability degradation — no slow-loris, query-of-death, or resource exhaustion testing.
No out-of-scope lateral movement.

Capabilities that would change any of these properties require a separate expanded consent model, legal review, and additional infrastructure described in docs/superpowers/specs/2026-05-06-destructive-agents-blueprint.md (repo path). None of those capabilities are implemented in the current release.

From the Pencheff docs

Pencheff Sentry — runtime LLM guardrail

Sentry is a runtime LLM guardrail that drops between your application and the model provider. It blocks prompt injection, PII / secret exfiltration, unsafe HTML in model output, and unbounded consumption as they happen — instead of catching them post-hoc on the next Pencheff red-team scan.

Same OWASP-LLM-Top-10 (2025) taxonomy as the offline scanner. Same detector library. Inline.

Modes

Mode	What it is	Best for
HTTP proxy sidecar	A FastAPI service in front of an OpenAI-compatible upstream	Drop-in URL change for any OpenAI-compatible provider
LiteLLM plugin	`pre_call` / `post_call` hooks	Stacks already running LiteLLM
MCP middleware	Wraps the MCP tool-call path	LLM agents that call tools — blocks unsafe tool args inline

The Cloudflare Worker mode (edge deployment) is on the v0.8 roadmap.

Hosted gateway (per-target)

If you'd rather not run the sidecar, register an LLM target in Pencheff, configure its guardrails, and point your app at the hosted gateway — no install, policy managed in the UI:

POST https://api.pencheff.com/proxy/<TARGET_ID>/v1/chat/completions
Authorization: Bearer <PENCHEFF_API_KEY>

The gateway runs the same OWASP-LLM detector chain on the prompt and response, plus two capabilities that build on it:

Agent firewall — gate the tool calls the model makes (block SSRF / secret exfil / destructive actions, require approval, or redact credential-shaped args). Off by default, per target.
Runtime traces — every request recorded as a span tree (LLM call · detector verdict · firewall decision), viewable on the target page.

Configure guardrails at Targets → (LLM target) → Edit → Guardrails, and the firewall just below it. See also the memory scanner for auditing agent memory / vector stores.

Quick start

pip install pencheff-sentry

pencheff-sentry serve \
  --upstream https://api.openai.com/v1 \
  --port 4242 \
  --max-output-tokens 4000

Then change your application's OpenAI base URL from https://api.openai.com/v1 to http://localhost:4242. Sentry forwards allowed requests verbatim and blocks unsafe ones with a clean 403 sentry_blocked response that includes the OWASP-LLM category.

{
  "error": {
    "message": "Pencheff Sentry blocked: prompt injection (direct-override)",
    "type": "guardrail_block",
    "code": "sentry_blocked",
    "pencheff_sentry": {
      "category": "LLM01",
      "detector": "direct-override"
    }
  }
}

What it detects

OWASP LLM	Detector	Examples
LLM01	Prompt injection	`ignore previous instructions`, `pretend to be DAN`, `print your system prompt`, encoded variants
LLM02	PII / secrets	SSN, credit card, email, phone, AWS access key, OpenAI sk-, GitHub PAT shapes
LLM05	Unsafe output handling	`<script>` / `<iframe>` / `javascript:` / inline event handlers in model response
LLM10	Unbounded consumption	Output token ceiling configurable via `--max-output-tokens`

The full pattern set lives in pencheff_sentry/core.py — pure Python, no I/O, easy to extend.

LiteLLM plugin

import litellm
from pencheff_sentry.litellm_plugin import register

register(litellm)

# Sentry now intercepts every litellm.completion() call.
response = litellm.completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "..."}],
)

pre_call raises litellm.BadRequestError on a blocked prompt. The post_call hook mutates a blocked response into a safe refusal string and stamps response.pencheff_sentry = {blocked, category, detector, reason} so downstream code can distinguish a guardrail-driven refusal from a model-native refusal.

Audit log

Sentry never persists prompt or response bodies by default — auditors asking "did you log my customer's prompt?" get a clean answer. The opt-in audit log (--audit-log path.jsonl) records decisions only: verdict, detector, category, plus a SHA-256 hash of the prompt/response for correlation. Never the body itself.

{"ts":"2026-05-08T15:00:01Z","side":"prompt","verdict":"block","category":"LLM01","detector":"direct-override","reason":"prompt injection (direct-override)","prompt_hash":"a7c2..."}

Default judge

The default judge is IBM Granite Guardian (Apache-2.0). Llama Guard 3 is opt-in via PENCHEFF_LLAMA_GUARD_ENABLED=1 — it ships under the Llama Community License (≤700 M MAU + attribution required), and Pencheff surfaces the license notice in every JudgeResult.reason so downstream consumers can reproduce it.

See features/llm-redteam for the full judge ensemble.

Extending the detector chain

from pencheff_sentry.core import GuardrailConfig, evaluate_prompt

cfg = GuardrailConfig(
    extra_patterns=[
        # (regex, detector_name, owasp_category)
        (r"(?i)\binternal[- ]doc:[a-z0-9-]+\b", "internal-doc-leak", "LLM02"),
    ],
)
decision = evaluate_prompt(user_prompt, config=cfg)
if decision.verdict == "block":
    refuse(decision.reason)

Source

Package: pencheff-sentry on PyPI (separate from the main pencheff package).
Source tree: plugins/sentry/.
License: MIT.