When I start
I read the code, talk to the people closest to the problem, and find the gap between what the system does and what the team actually needs it to do. That usually takes a week, not a month.
AI Application Engineer · LLM + RAG shipped with eval gates
Most AI teams ship features that work on day one and quietly break by day thirty. Six end-to-end case studies; each with real architecture decisions, evaluation results, and deployment evidence you can actually look at.
How I Work
I don't need a long ramp-up. But I won't ship something I can't measure.
I read the code, talk to the people closest to the problem, and find the gap between what the system does and what the team actually needs it to do. That usually takes a week, not a month.
Working AI features with eval gates already in place; so the team can keep improving without me in the room. Not just code — a system the team can own.
Capabilities Snapshot
Multi-source ingestion, chunk strategy, retrieval tuning, and citation-grounded answer flows.
Prompt routing, function/tool calling, schema-constrained outputs, and deterministic APIs.
Offline/online eval loops, observability, release checks, and rollback-safe deployment flow.
Role Fit Guide
It depends on your team's focus. Here's a quick guide.
Lead with Customer Support AI Triage and DocChat RAG to show speed gains with human-review safeguards.
Start with DocChat RAG and Knowledge Base Builder to demonstrate grounded answers and scalable ingestion.
Share RAG for Teams SaaS and RAG Evaluation Lab to highlight multi-tenant architecture and release-quality governance.
Projects Snapshot
Each one covers the same ground: what the problem was, how I built it, and what proves it worked.
Grounded internal knowledge assistant with citations, retrieval confidence, and session memory.
Automated faithfulness, relevance, and regression scoring used as release gates.
Multi-tenant AI app surface with auth, org isolation, billing flow, rate limits, and audit logs.
The Proof
Every case study includes the actual architecture, real code you can read, and the specific thresholds used — not a summary of what was done, but evidence of how.
Not pseudocode or high-level diagrams. Real implementation patterns — answer controllers, eval orchestrators, tenant isolation logic — that you can open and discuss line by line.
Faithfulness ≥ 0.88. p95 latency ≤ 1900ms. Pass rate ≥ 0.90. These are the actual gate values configured in the eval lab — specific enough to block a bad release.
Each architecture choice in the case studies includes a "Why this matters" note — what problem it solves and what breaks if you skip it. The thinking is documented, not just the outcome.
Process Snapshot
Define user questions, source-of-truth data, and quality targets before model tuning.
Implement deterministic interfaces and guardrails so behavior stays predictable.
Compare outputs, inspect failures, and tune retrieval/orchestration before release.
Contact Snapshot
If you're building AI that needs to hold up under real conditions, not just in a demo, I'd like to hear about it.
Tell me what you're trying to ship and what's in the way. I'll reply within a day with how I'd approach it and where I can move fastest.