Hatchet News

For any of you that have systems with LLM interactions in production — how do you monitor how the quality of LLM outputs continuously?

Do you use another LLM to evaluate if the response was hallucinated, and grade it across a set of metrics?