Finding Broken Tasks in SWE-bench¶

SWE-bench Verified is the most popular software engineering benchmark, but it still contains unsolvable tasks—broken tests, missing dependencies, ambiguous specifications. This walkthrough shows how to find them.

1. Run SWE-bench¶

lunette eval swebench --model anthropic/claude-sonnet-4 --limit 50

This runs 50 SWE-bench instances with full environment capture. Every command, file change, and model response is recorded.

2. View your results¶

Go to lunette.dev and find your run. You'll see:

Pass/fail for each task
The full trajectory (what the agent did)
Access to the original environment

3. Launch an investigation¶

Click "Investigate", select the default prompt, and then set it off. The investigator agent will:

Read what the agent tried to do
Access the sandbox environment
Run commands to test hypotheses
Report what it finds

Check in a few minutes and you'll see issues start coming in.