About

I Close the Gap Between AI Demos and Production Systems.

Most AI features look great on demo day. I work on the part that comes after; making them reliable, measurable, and safe to keep in production a month later.

What pulled me toward this specifically was a pattern I kept seeing; AI features that looked great in the demo, then quietly became a trust problem once real users hit them. Nobody could explain why because nobody had been measuring. I built my whole workflow around making sure that doesn't happen.

  • Before I write a prompt: I define what success looks like and where the data comes from.
  • While I build: I wrap model behavior in guardrails so it stays predictable under pressure.
  • Before I ship: I score it, compare it to the last version, and only release if it passes.

What I'm Actually Good At

RAG and Retrieval Quality

Chunking strategies, hybrid retrieval, reranking, and citation-grounded response patterns.

RAG Architecture · Search Quality

LLM Workflow Engineering

Prompt contracts, function-calling, fallback behavior, and stable API outputs for apps.

Tool Calling · FastAPI · Automation

Evaluation and Reliability

Regression suites, faithfulness tracking, and trace-based debugging in pre-release workflows.

Evals · Observability · Release Gates

Stack

Technologies I Use Most

Application Layer

Python-first APIs and workflow systems; the tools I reach for when something needs to ship fast and stay maintainable.

Python FastAPI Pydantic n8n Docker

Data + AI Layer

Everything needed to make retrieval accurate, answers grounded, and quality measurable before release.

Postgres pgvector OpenAI APIs Reranking Eval Harnesses