Skip to content

hermes-tool-test-suite

pytest harness for validating Hermes AI agent tool-calling reliability.

github.com/LegionForge/hermes-tool-test-suite

What it does

The hermes-tool-test-suite is a pytest harness that validates AI agent tool-calling reliability across providers. It's used to certify that a given model + prompt-config combination actually executes tool calls end-to-end, not just generates text that looks like a tool call.

The default validation set covers 10 scenarios across:

Category Scenarios
Terminal shell exec, cwd handling
File ops read, write, edit, list
Code execution inline Python, subprocess Python
Web tools fetch, search

Each scenario has a deterministic expected output. The model's tool-call sequence is run and the resulting state (filesystem, stdout, etc.) is compared.

Status

Active. Public.

Why this exists

A model that generates a tool_calls block in its API response hasn't actually proven it can drive tools. Real failures observed in practice:

  • Model generates tool_calls but the framework's tool dispatcher trips on a tool name that's been mangled (e.g., underscore stripping)
  • Model returns text describing a tool call instead of using the structured tool_calls field
  • Model loops on the same call without integrating the result

A test harness that runs the entire loop catches all three.

Use cases

  • Pre-deployment certification — before shipping a new model to production, run the suite. If 9/10 pass, you know what the failure mode is and whether it's blocking.
  • Model comparison — A/B testing tool-call reliability between providers (Ollama llama3.1 vs qwen2.5 vs cloud Claude vs OpenAI).
  • Regression detection — after upgrading a framework version, re-run the suite to confirm no regression.

Integration with LegionForge

LegionForge's primary inference path runs through llm_factory which is tested by this suite. The framework's CI runs a subset of this harness on every merge.