llm-valet¶

Cross-platform LLM lifecycle manager — auto-pause and resume Ollama based on system resource pressure and gaming detection.

github.com/LegionForge/llm-valet

The problem it solves¶

Running a local LLM is expensive in RAM. A 7B-parameter model in 4-bit quantization is ~4.5 GB; an 8B at q4 is ~5.5 GB. On a 16 GB machine, keeping Ollama warm means you have ~10 GB for everything else — a Chromium tab tree, a game, a video editor, Slack — and the moment the system swaps, the LLM gets evicted unpredictably and re-loads on the next request (~30 s).

llm-valet sits in the middle. It watches:

Memory pressure — if the system is paging or close to it, pause Ollama models cleanly.
GPU pressure — if another GPU-hungry process spins up (a game, a render), pause the model.
Gaming detection — recognized game processes are a strong signal that the user wants their hardware back.

When the pressure subsides, llm-valet warms the model back up automatically.

Status¶

Active. Initial public version. Cross-platform (macOS, Linux, Windows) but tested most heavily on Apple Silicon.

When to use it¶

You run Ollama locally and use the same machine for other workloads.
You don't want LLM memory pressure to thrash the rest of your system.
You want the model to come back automatically when there's room — not on the next slow request.

When not to use it¶

Dedicated inference box (no other workloads). Just let Ollama hold memory.
Cloud-only LLM use. There's nothing to pause.

Integration with LegionForge¶

LegionForge can run alongside llm-valet without coordination — both work independently. For tighter integration (e.g., have the framework respect llm-valet pause state instead of making a request that triggers an unwanted reload), see the llm-valet README.