Case Study · RAG

DocChat RAG (Production)

PDF/doc upload, chunking, embeddings, semantic retrieval, and answer generation with explicit citations and source panels. Includes history per user session for follow-up questions.

ProblemAnswers from internal docs are inconsistent and hard to verify.
SolutionGrounded RAG with citations and source confidence signals.
ImpactHigher trust, lower hallucination risk, faster onboarding answers.

Architecture

System Design

Keep the runtime simple for users but strict internally: versioned ingestion, hybrid retrieval, citation-aligned generation, and release safety checks.

sources -> ingestion contracts -> chunk/version store
                     |                    |
                     v                    v
               lexical index         vector index
                     \                  /
                      \                /
                      hybrid retrieval + rerank
                               |
                               v
                  citation-constrained answer runtime
                               |
                               v
                 response + citations + source panels

Why this matters: this keeps answer quality stable as the corpus and traffic grow.

Ingestion Service

Extract text, normalize artifacts, chunk by boundaries, and persist source/version lineage.

Hybrid Retrieval

Dense + lexical retrieval with reranking and confidence filtering before context assembly.

Answer + Guardrails

Citation-constrained generation with fallback behavior when grounding confidence is low.

Evidence

Evaluation and Reliability

Faithfulness

Tracked against reference answers with citation match checks.

p95 Latency

Measured end-to-end from query to sourced answer delivery.

Fallback Rate

Monitored when retrieval confidence drops below threshold.

Regression Safety

Dataset-based tests run on retrieval and final answer quality before deploy.

Operational Guardrails

Timeout paths, empty-context handling, and structured error envelopes for client apps.

Technical Peek

Reliability-Gated Answer Controller

def answer_with_policy(env: QueryEnvelope) -> AnswerPayload:
    retrieval = retrieve_context(query=env.query, ...)
    if not retrieval.context:
        return fallback_payload(env=env, reason="empty_context", ...)

    draft = llm.generate(
        prompt=build_prompt_contract(retrieval.query),
        context=retrieval.context,
        history=sessions.get_recent_turns(env.session_id),
    )

    citations = align_citations(draft.answer, retrieval.fused_hits)
    report = build_guardrail_report(
        answer=draft.answer,
        retrieval=retrieval,
        citations=citations,
        policy=policy,
        policy_store=policy_store,
    )

    if report.blocked:
        return fallback_payload(env=env, reason="guardrail_block", ...)

    return AnswerPayload(
        request_id=env.request_id,
        answer=draft.answer,
        citations=citations,
        sources=build_sources(citations),
        confidence=compute_confidence(report, retrieval),
        metrics=collect_runtime_metrics(...),
        guardrails=report,
    )

Why this matters: the answer path is constrained by runtime policy, not just model output.

Advanced Breakdown

Most Important Engineering Decisions

1. Retrieval Confidence Thresholding

Results below relevance policy are excluded; weak evidence routes to fallback instead of forced generation.

Why this matters: reduces unsupported answers under difficult queries.

Quality Gate

2. Citation-to-Chunk Mapping

Every citation token maps to chunk IDs, source title, section, and revision so claims remain auditable.

Why this matters: users can verify answers instantly.

Traceability

3. Hybrid Dense + Lexical Retrieval

Semantic retrieval captures meaning, lexical retrieval catches exact policy terms, reranker merges both signals.

Why this matters: improves recall on both narrative and exact-match queries.

Retrieval Quality

4. Runtime Guardrails + Fallback Lanes

Empty context, low citation precision, and low groundedness each trigger deterministic fallback behavior.

Why this matters: failure behavior is predictable in production.

Resilience

5. Regression Evaluation Harness

Benchmark suites compare current release candidate against baseline before deploy.

Why this matters: quality drift is caught before users see it.

Release Safety

Implementation Notes

FastAPIPostgres + pgvectorHybrid RetrievalChunk VersioningTrace Logging

Presentation path: projects/docchat-rag-production/presentations/upcoming/