NorthStar Labs - AskHR Policy Assistant

Employees spend hours searching through scattered policy documents for answers to routine questions - parental leave eligibility, 401k matching, expense limits. HR teams field the same questions repeatedly. The information exists, but finding it is the bottleneck. This project built a RAG-powered assistant that retrieves the right policy sections, generates cited answers, and evaluates its own quality in production.

  • Build hybrid search (semantic + keyword) so exact terms and concepts are both covered
  • Use hierarchical chunking that preserves document structure for precise citations
  • Add automated evaluation so quality is measured, not assumed
  • Instrument everything with tracing so production issues are debuggable
  • Hybrid search over semantic-only: BM25 catches exact policy codes and abbreviations that embeddings miss
  • 3-level eval scoring (1.0/0.5/0.0) over binary: Industry standard, more actionable than pass/fail
  • Arize Phoenix over LangSmith: Open standard (OpenInference), no vendor lock-in, span-level annotations
  • Voyage AI over OpenAI embeddings: Better retrieval benchmarks, asymmetric query/document encoding

Evaluation is what separates a prototype from a production system. Automated scoring for accuracy, completeness, and grounding gives you confidence in every answer and a clear signal when something needs fixing.

  • User asks a question. React frontend sends the query to the FastAPI backend.
  • Hybrid retrieval finds relevant policies. Voyage AI embeddings search Supabase pgvector (semantic), while BM25 searches an in-memory index (keyword). Results are merged via Reciprocal Rank Fusion with K=60.
  • Claude generates a cited answer. Top 5 chunks are passed to Claude Sonnet 4.5 with a system prompt that mandates source citations and restricts answers to provided context only.
  • Traces flow to Phoenix Cloud. Every query is fully traceable - which documents were retrieved, what the LLM generated, and how long each step took.
  • On-demand evaluation scores the answer. Arize Phoenix scores each answer on Faithfulness and Completeness using custom 3-level eval templates, then attaches the results directly to the trace for easy debugging.

Each metric catches failures the others miss. A faithful but incomplete answer looks perfect without the completeness check. A complete but hallucinated answer looks comprehensive without the faithfulness check.

Metric What It Measures Scoring
Faithfulness Is every claim grounded in the retrieved context? Factual (1.0) / Mixed (0.5) / Hallucinated (0.0)
Completeness Does the answer address all parts of the question? Complete (1.0) / Partial (0.5) / Incomplete (0.0)
QA Correctness Does the answer match the expected ground truth? Correct (1.0) / Partially Correct (0.5) / Incorrect (0.0)

Why 3-level scoring? Binary (pass/fail) is too coarse - it can't distinguish "mostly correct with a minor gap" from "completely wrong." The 1.0/0.5/0.0 scale is the industry standard used by RAGAS, LangSmith, and Braintrust. It provides enough granularity to be actionable without the subjectivity of 5-level scales.

  • Retrieval quality determines everything. The best LLM in the world can't compensate for retrieving the wrong documents. I spent more time tuning retrieval than generation.
  • Hybrid search is worth the complexity. Semantic search alone missed exact policy codes and abbreviations. Adding keyword search caught these cases with almost no added latency.
  • You can't improve what you can't measure. Without evals, there is no way to know if a code change improved or broke answer quality. The eval framework turns subjective "looks good" into measurable scores you can track over time.
  • Tracing makes debugging possible. With Phoenix traces, I can click on any query and see exactly which documents were retrieved, what the LLM generated, and how the evals scored it. Without this, debugging is guesswork.