NorthStar Labs - AskHR Policy Assistant
Employees spend hours searching through scattered policy documents for answers to routine questions - parental leave eligibility, 401k matching, expense limits. HR teams field the same questions repeatedly. The information exists, but finding it is the bottleneck. This project built a RAG-powered assistant that retrieves the right policy sections, generates cited answers, and evaluates its own quality in production.
- Build hybrid search (semantic + keyword) so exact terms and concepts are both covered
- Use hierarchical chunking that preserves document structure for precise citations
- Add automated evaluation so quality is measured, not assumed
- Instrument everything with tracing so production issues are debuggable
- Hybrid search over semantic-only: BM25 catches exact policy codes and abbreviations that embeddings miss
- 3-level eval scoring (1.0/0.5/0.0) over binary: Industry standard, more actionable than pass/fail
- Arize Phoenix over LangSmith: Open standard (OpenInference), no vendor lock-in, span-level annotations
- Voyage AI over OpenAI embeddings: Better retrieval benchmarks, asymmetric query/document encoding
Evaluation is what separates a prototype from a production system. Automated scoring for accuracy, completeness, and grounding gives you confidence in every answer and a clear signal when something needs fixing.
- User asks a question. React frontend sends the query to the FastAPI backend.
- Hybrid retrieval finds relevant policies. Voyage AI embeddings search Supabase pgvector (semantic), while BM25 searches an in-memory index (keyword). Results are merged via Reciprocal Rank Fusion with K=60.
- Claude generates a cited answer. Top 5 chunks are passed to Claude Sonnet 4.5 with a system prompt that mandates source citations and restricts answers to provided context only.
- Traces flow to Phoenix Cloud. Every query is fully traceable - which documents were retrieved, what the LLM generated, and how long each step took.
- On-demand evaluation scores the answer. Arize Phoenix scores each answer on Faithfulness and Completeness using custom 3-level eval templates, then attaches the results directly to the trace for easy debugging.
Each metric catches failures the others miss. A faithful but incomplete answer looks perfect without the completeness check. A complete but hallucinated answer looks comprehensive without the faithfulness check.
| Metric | What It Measures | Scoring |
|---|---|---|
| Faithfulness | Is every claim grounded in the retrieved context? | Factual (1.0) / Mixed (0.5) / Hallucinated (0.0) |
| Completeness | Does the answer address all parts of the question? | Complete (1.0) / Partial (0.5) / Incomplete (0.0) |
| QA Correctness | Does the answer match the expected ground truth? | Correct (1.0) / Partially Correct (0.5) / Incorrect (0.0) |
Why 3-level scoring? Binary (pass/fail) is too coarse - it can't distinguish "mostly correct with a minor gap" from "completely wrong." The 1.0/0.5/0.0 scale is the industry standard used by RAGAS, LangSmith, and Braintrust. It provides enough granularity to be actionable without the subjectivity of 5-level scales.
- Retrieval quality determines everything. The best LLM in the world can't compensate for retrieving the wrong documents. I spent more time tuning retrieval than generation.
- Hybrid search is worth the complexity. Semantic search alone missed exact policy codes and abbreviations. Adding keyword search caught these cases with almost no added latency.
- You can't improve what you can't measure. Without evals, there is no way to know if a code change improved or broke answer quality. The eval framework turns subjective "looks good" into measurable scores you can track over time.
- Tracing makes debugging possible. With Phoenix traces, I can click on any query and see exactly which documents were retrieved, what the LLM generated, and how the evals scored it. Without this, debugging is guesswork.