Building production-grade agentic systems, RAG pipelines & LLM safety architectures — across banking, automotive & pharmaceutical domains. Not demos.
These aren't principles from a paper — they're scars from real production systems that broke, were fixed, and survived.
LLMs are non-deterministic. Your system cannot be.
Every LLM call is wrapped in typed validation, retries, and schema enforcement. The AI is a powerful component — not the controller.
Guardrails are the architecture, not an afterthought.
Presidio PII detection runs before any data reaches an LLM. Unmasking only happens post-compliance verification. Zero exposure by design.
Every system must have a fallback path.
DLQs, circuit breakers, human review queues. When an LLM call fails — and it will — the system degrades predictably, not catastrophically.
Latency is a product decision, not a technical one.
Took CortexIQ from 7s to 3s by mapping the agent dependency graph and parallelizing independent nodes. Users feel the difference.
Observability from the first commit.
Token cost, latency, confidence scores — logged from day one. Not retrofitted after something breaks at 2 AM.
Build for the regulated domain, not the demo.
Banking, pharma, automotive — these domains don't tolerate hallucinations. RLHF-ready loops and compliance outputs are the minimum.
Every item below has been used in a shipped, production system — selected for reliability, safety, and real-world scale.
Real architecture flows — not marketing diagrams. Each node is a production component.
Each entry shows the real engineering decision — not just what was built, but why the architecture choices were made.
Pharmacy domain needed LLM-powered dialogue under simultaneous PII/PHI, hallucination, and regulatory constraints — no framework handled all three in a multi-agent flow.
Defense-in-depth: Presidio masking → LangGraph agent flow (intent → clarification → grounding → generation → compliance) → controlled unmask. Redis session state.
MongoDB → Redis for AgentState gave 2× session retrieval speed. RLHF-ready feedback model built from day one — enables reward modeling without rearchitecting.
Legal teams manually reviewing large-scale banking documents for compliance risks — high error rate under deadline pressure, regulatory exposure when risks were missed.
Vertex AI extraction → semantic RAG → risk classification with citation grounding. Every flagged risk traces to a source document — no black-box decisions for legal teams.
Hybrid retrieval (keyword + semantic) for legal precision. Pure cosine similarity missed technical clause matches. Thresholds calibrated on historical risk labels by domain experts.
Thousands of unstructured NHTSA safety complaints required consistent categorization into 30+ issue domains for regulatory reporting — impossible at scale manually.
Fine-tuned multi-label classifier on automotive vocabulary. Batch pipeline on SageMaker endpoints. Output aligned to NHTSA taxonomy with confidence scoring per label.
Prompt-only: 73% accuracy. Fine-tuned: 92%. Training cost justified at 100k+ records/month. Batch over real-time for cost efficiency at scale.
Fragmented banking product information across documents. Query resolution was slow, inconsistent, and dependent on human subject-matter experts for every answer.
RAG pipeline over banking corpus. Hybrid retrieval for precision. Responses grounded with citations. Confidence gating routes low-certainty queries to human agents.
Citation grounding non-negotiable for banking — hallucination without attribution is a compliance failure, not just a quality issue. Every answer traces to a source document.
Most engineers hide their failures. I document them — because real credibility is built on post-mortems, not highlight reels.
Sequential agent execution looked fine in testing. Under real conversational load, each agent waiting for the previous made the UX feel broken. Users dropped queries mid-flow.
Mapped full dependency graph in LangGraph. Parallelized independent nodes — intent + clarification ran simultaneously. Added Redis session caching. Final: ~3s avg response time.
Cosine similarity returned confident scores for mismatched context. LLM answered confidently with wrong information — dangerous in regulated banking where users act on answers.
Switched to hybrid retrieval (BM25 + semantic). Added cross-encoder reranking. Implemented citation grounding validation — answer not traceable to a source chunk doesn't go out.
Early regex-based masking had gaps — composite patient codes bypassed sanitization. Pre-launch QA caught it before production, but the near-miss was sobering.
Replaced all custom regex with Microsoft Presidio as sole masking authority. Made PII scanning the mandatory first node — if it fails, request is rejected. No exceptions.
Six weeks post-deployment, NHTSA classifier output format drifted. No code changes. Root cause: provider silently updated the underlying model version affecting token distributions.
Pydantic schema validation on every LLM response. Automated regression tests run daily against production endpoints. Schema mismatch triggers an alert before users are affected.
AgentState stored in MongoDB added round-trip overhead per agent step. In a 7-node pipeline, small latencies compound. Fine in design — painful in production profiling.
Migrated to Redis. Key insight: conversational memory is ephemeral and latency-sensitive — MongoDB is wrong for this. Redis key structure optimized to eliminate JSON deserialization overhead.
Technical writing on agentic AI, LLM safety, and high-impact production systems on Medium. No content marketing — just engineering depth.
A deep dive into agent dependency graphs, parallelization strategies, and why sequential execution is the silent killer of agentic AI in production.
How hybrid retrieval, cross-encoder reranking, and citation grounding reduced our banking RAG hallucination rate from 12% to under 2%.
The architecture decisions behind CortexIQ's zero-PHI-exposure guarantee — and why regex-based masking is never enough for regulated domains.
The 73% → 92% jump that justified fine-tuning, and the decision framework I now use to choose between prompt engineering and supervised training.
Every system listed under Projects was built here. One company, two-plus years, five production systems across three regulated domains.
Open to Senior Generative AI, Agentic AI Architect, and AI Engineering roles. If you're building AI that needs to make real impact — let's talk.