Issues and Judging¶

Investigator agents can be configured to produce either issues or scores, depending on what you need.

Issues¶

Issues are structured findings about problems in your eval or agent. Investigator agents write issues when they detect:

Test mis-specifications — Tasks with vague requirements or overly strict tests
Environment problems — Missing dependencies, incorrect setup, or broken tooling
Agent failures — Bugs in agent behavior or reasoning errors

Each issue includes:

Validator agents critique these findings and filter for high-quality results, reducing false positives.

Investigator agents can also track and score specific behaviors across trajectories. This is useful when you want to measure:

Scores can be:

Like issues, scores come with evidence and reasoning, making the judgment transparent and auditable.

When you launch an investigation:

Investigator agents read trajectories and execute commands in the eval environment
They write either issues (if looking for problems) or scores (if evaluating performance)
Validator agents critique the findings
High-quality results are surfaced in the web UI

Both types of output reference specific messages in the trajectory, so you can see exactly what led to each finding.