hermes-tool-test-suite¶
pytest harness for validating Hermes AI agent tool-calling reliability.
github.com/LegionForge/hermes-tool-test-suite
What it does¶
The hermes-tool-test-suite is a pytest harness that validates AI agent tool-calling reliability across providers. It's used to certify that a given model + prompt-config combination actually executes tool calls end-to-end, not just generates text that looks like a tool call.
The default validation set covers 10 scenarios across:
| Category | Scenarios |
|---|---|
| Terminal | shell exec, cwd handling |
| File ops | read, write, edit, list |
| Code execution | inline Python, subprocess Python |
| Web tools | fetch, search |
Each scenario has a deterministic expected output. The model's tool-call sequence is run and the resulting state (filesystem, stdout, etc.) is compared.
Status¶
Active. Public.
Why this exists¶
A model that generates a tool_calls block in its API response hasn't actually proven it can drive tools. Real failures observed in practice:
- Model generates
tool_callsbut the framework's tool dispatcher trips on a tool name that's been mangled (e.g., underscore stripping) - Model returns text describing a tool call instead of using the structured
tool_callsfield - Model loops on the same call without integrating the result
A test harness that runs the entire loop catches all three.
Use cases¶
- Pre-deployment certification — before shipping a new model to production, run the suite. If 9/10 pass, you know what the failure mode is and whether it's blocking.
- Model comparison — A/B testing tool-call reliability between providers (Ollama llama3.1 vs qwen2.5 vs cloud Claude vs OpenAI).
- Regression detection — after upgrading a framework version, re-run the suite to confirm no regression.
Integration with LegionForge¶
LegionForge's primary inference path runs through llm_factory which is tested by this suite. The framework's CI runs a subset of this harness on every merge.