llm-valet¶
Cross-platform LLM lifecycle manager — auto-pause and resume Ollama based on system resource pressure and gaming detection.
github.com/LegionForge/llm-valet
The problem it solves¶
Running a local LLM is expensive in RAM. A 7B-parameter model in 4-bit quantization is ~4.5 GB; an 8B at q4 is ~5.5 GB. On a 16 GB machine, keeping Ollama warm means you have ~10 GB for everything else — a Chromium tab tree, a game, a video editor, Slack — and the moment the system swaps, the LLM gets evicted unpredictably and re-loads on the next request (~30 s).
llm-valet sits in the middle. It watches:
- Memory pressure — if the system is paging or close to it, pause Ollama models cleanly.
- GPU pressure — if another GPU-hungry process spins up (a game, a render), pause the model.
- Gaming detection — recognized game processes are a strong signal that the user wants their hardware back.
When the pressure subsides, llm-valet warms the model back up automatically.
Status¶
Active. Initial public version. Cross-platform (macOS, Linux, Windows) but tested most heavily on Apple Silicon.
When to use it¶
- You run Ollama locally and use the same machine for other workloads.
- You don't want LLM memory pressure to thrash the rest of your system.
- You want the model to come back automatically when there's room — not on the next slow request.
When not to use it¶
- Dedicated inference box (no other workloads). Just let Ollama hold memory.
- Cloud-only LLM use. There's nothing to pause.
Integration with LegionForge¶
LegionForge can run alongside llm-valet without coordination — both work independently. For tighter integration (e.g., have the framework respect llm-valet pause state instead of making a request that triggers an unwanted reload), see the llm-valet README.