NorthStar Labs - AskHR Policy Assistant

The Problem

Employees spend hours searching through scattered policy documents for answers to routine questions - parental leave eligibility, 401k matching, expense limits. HR teams field the same questions repeatedly. The information exists, but finding it is the bottleneck. This project built a RAG-powered assistant that retrieves the right policy sections, generates cited answers, and evaluates its own quality in production.

Approach

Build hybrid search (semantic + keyword) so exact terms and concepts are both covered
Use hierarchical chunking that preserves document structure for precise citations
Add automated evaluation so quality is measured, not assumed
Instrument everything with tracing so production issues are debuggable

Why These Choices

Hybrid search over semantic-only: BM25 catches exact policy codes and abbreviations that embeddings miss
3-level eval scoring (1.0/0.5/0.0) over binary: Industry standard, more actionable than pass/fail
Arize Phoenix over LangSmith: Open standard (OpenInference), no vendor lock-in, span-level annotations
Voyage AI over OpenAI embeddings: Better retrieval benchmarks, asymmetric query/document encoding

Evaluation is what separates a prototype from a production system. Automated scoring for accuracy, completeness, and grounding gives you confidence in every answer and a clear signal when something needs fixing.

Architecture

User asks a question. React frontend sends the query to the FastAPI backend.
Hybrid retrieval finds relevant policies. Voyage AI embeddings search Supabase pgvector (semantic), while BM25 searches an in-memory index (keyword). Results are merged via Reciprocal Rank Fusion with K=60.
Claude generates a cited answer. Top 5 chunks are passed to Claude Sonnet 4.5 with a system prompt that mandates source citations and restricts answers to provided context only.
Traces flow to Phoenix Cloud. Every query is fully traceable - which documents were retrieved, what the LLM generated, and how long each step took.
On-demand evaluation scores the answer. Arize Phoenix scores each answer on Faithfulness and Completeness using custom 3-level eval templates, then attaches the results directly to the trace for easy debugging.

Evaluation

Each metric catches failures the others miss. A faithful but incomplete answer looks perfect without the completeness check. A complete but hallucinated answer looks comprehensive without the faithfulness check.

Metric	What It Measures	Scoring
Faithfulness	Is every claim grounded in the retrieved context?	Factual (1.0) / Mixed (0.5) / Hallucinated (0.0)
Completeness	Does the answer address all parts of the question?	Complete (1.0) / Partial (0.5) / Incomplete (0.0)
QA Correctness	Does the answer match the expected ground truth?	Correct (1.0) / Partially Correct (0.5) / Incorrect (0.0)

Why 3-level scoring? Binary (pass/fail) is too coarse - it can't distinguish "mostly correct with a minor gap" from "completely wrong." The 1.0/0.5/0.0 scale is the industry standard used by RAGAS, LangSmith, and Braintrust. It provides enough granularity to be actionable without the subjectivity of 5-level scales.

Lessons

Retrieval quality determines everything. The best LLM in the world can't compensate for retrieving the wrong documents. I spent more time tuning retrieval than generation.
Hybrid search is worth the complexity. Semantic search alone missed exact policy codes and abbreviations. Adding keyword search caught these cases with almost no added latency.
You can't improve what you can't measure. Without evals, there is no way to know if a code change improved or broke answer quality. The eval framework turns subjective "looks good" into measurable scores you can track over time.
Tracing makes debugging possible. With Phoenix traces, I can click on any query and see exactly which documents were retrieved, what the LLM generated, and how the evals scored it. Without this, debugging is guesswork.