Ravindra Kupatkar — Generative AI Engineer

01 — Philosophy

The laws
I build by.

These aren't principles from a paper — they're scars from real production systems that broke, were fixed, and survived.

— 01

LLMs are non-deterministic. Your system cannot be.

Every LLM call is wrapped in typed validation, retries, and schema enforcement. The AI is a powerful component — not the controller.

— 02

Guardrails are the architecture, not an afterthought.

Presidio PII detection runs before any data reaches an LLM. Unmasking only happens post-compliance verification. Zero exposure by design.

— 03

Every system must have a fallback path.

DLQs, circuit breakers, human review queues. When an LLM call fails — and it will — the system degrades predictably, not catastrophically.

— 04

Latency is a product decision, not a technical one.

Took CortexIQ from 7s to 3s by mapping the agent dependency graph and parallelizing independent nodes. Users feel the difference.

— 05

Observability from the first commit.

Token cost, latency, confidence scores — logged from day one. Not retrofitted after something breaks at 2 AM.

— 06

Build for the regulated domain, not the demo.

Banking, pharma, automotive — these domains don't tolerate hallucinations. RLHF-ready loops and compliance outputs are the minimum.

02 — Tech Stack

The skills behind production systems.

Every item below has been used in a shipped, production system — selected for reliability, safety, and real-world scale.

AI & Machine Learning

Generative AI Large Language Models (LLM) RAG AI Agents Deep Learning Machine Learning Natural Language Processing LLM Fine-Tuning (GPT-4, PaLM-2, Llama) SLM Fine-Tuning (Qwen2.5-14b-Instruct) Prompt Engineering Testing AI Systems

Frameworks & Libraries

LangGraph LangChain CrewAI Agentic AI Workflows Multi-Agent Workflows Microsoft Presidio (PII Guardrails) Hugging Face Models TensorFlow PyTorch

Languages

Python

Cloud & DevOps

GCP Vertex AI AWS Bedrock AWS SageMaker AWS Lambda Google Model Armor Docker Kubernetes (K8s) CI/CD Pipelines Grafana

Data & Tools

n8n Workflows MCP a2a Protocols Redis Cache Vector Databases

Methodologies

Analytical / Critical Thinking Agile Development Clean Code

03 — Systems Thinking

How my AI systems
actually work.

Real architecture flows — not marketing diagrams. Each node is a production component.

User Query

embed

Query Embedding
text-embedding-3

Vector DB
Pinecone / FAISS

top-k

Hybrid Retrieval
BM25 + Semantic

Context Assembly
512-token chunks

scan

Presidio PII Guard
mask before LLM

infer

LLM Inference
Bedrock / Vertex AI

validate

Output Validator
Pydantic schema

Confidence
≥ 0.80?

yes

Grounded Response
with citations

Human Review Queue
DLQ escalation

Retrieval: Hybrid BM25 + semantic. Chunk 512 tokens, 50-token overlap. Cross-encoder reranking on top-20 candidates. | Safety: Presidio masks SSN, DOB, MRN before LLM. | Fallback: Confidence < 0.80 → human review. Lambda timeout → DLQ + async retry.

User Message
pharmacy domain

detect

Presidio PII Scan
SSN · DOB · MRN · Drug IDs

mask

PII Masked Input
zero raw exposure

classify

Intent Classifier
LangGraph node 1

Redis Session State
AgentState · conversation history

clarify

Clarification Agent
LangGraph node 2

ground

Contextual Grounding
RAG retrieval

generate

Response Generator
LangGraph node 4

Compliance Verifier
LangGraph node 5

Compliant?

yes

Controlled Unmask
post-verification only

Safe Response + Feedback
RLHF-ready rating capture

Latency: 7s → 3s by parallelizing intent + clarification nodes in LangGraph. | Memory: MongoDB → Redis migration for 2× session retrieval speed. | RLHF: Feedback data model structured for future reward modeling from day one.

NHTSA Complaint
unstructured narrative text

clean

Text Preprocessing
normalization · dedup

tokenize

Domain Tokenizer
automotive vocabulary

encode

Feature Encoding
fine-tuned embeddings

Fine-tuned Classifier
SageMaker endpoint · 92% acc.

multi-label

30+ Domain Labels
suspension · engine · brakes…

score

Confidence Scoring
per-label probabilities

Confidence
≥ threshold?

yes

Structured Output
NHTSA taxonomy · auto-filed

Manual Review
low-confidence queue

Why fine-tune: Prompt-only → 73%. Fine-tuned on automotive domain vocab → 92%. | Scale: Batch endpoints on SageMaker — 100k+ records/month. | Multi-label: Single complaint can map to multiple issue domains simultaneously.

Raw Input
any domain

detect

PII Entity Detection
Microsoft Presidio

PII Found?

yes → mask

Anonymised Input
zero raw exposure

Policy Guardrails
content + intent check

pass

LLM Inference
masked context only

validate

Schema Validation
Pydantic strict mode

Confidence
≥ 0.85?

yes

Safe Output
controlled unmask

Compliance Audit Log
immutable · timestamped · hashed

PII Entities: SSN, DOB, MRN, names, addresses, drug IDs — masked before any LLM call. | Threshold: 0.85 confidence for pharmacy safety (vs 0.80 for RAG). | Audit: Every transformation logged to immutable S3 with timestamp, hash, and operator ID.

04 — Selected Work

Systems built.
Impact measured.

Each entry shows the real engineering decision — not just what was built, but why the architecture choices were made.

CortexIQ — Pharmacy-Safe Multi-Agent System

LangGraph·Presidio·Redis·RLHF-ready

Pharma · 2024–25

Problem

Pharmacy domain needed LLM-powered dialogue under simultaneous PII/PHI, hallucination, and regulatory constraints — no framework handled all three in a multi-agent flow.

Architecture

Defense-in-depth: Presidio masking → LangGraph agent flow (intent → clarification → grounding → generation → compliance) → controlled unmask. Redis session state.

Key Decision

MongoDB → Redis for AgentState gave 2× session retrieval speed. RLHF-ready feedback model built from day one — enables reward modeling without rearchitecting.

7s→3s

Latency Reduced

PHI Exposures

100%

Compliance Coverage

Contract Risk Detection System

GCP Vertex AI·Document AI·RAG

Banking

Problem

Legal teams manually reviewing large-scale banking documents for compliance risks — high error rate under deadline pressure, regulatory exposure when risks were missed.

Architecture

Vertex AI extraction → semantic RAG → risk classification with citation grounding. Every flagged risk traces to a source document — no black-box decisions for legal teams.

Key Decision

Hybrid retrieval (keyword + semantic) for legal precision. Pure cosine similarity missed technical clause matches. Thresholds calibrated on historical risk labels by domain experts.

90%+

Detection Accuracy

70%

Faster Review

Full

Audit Trail

NHTSA Complaint Categorization System

LLM Fine-tuning·SageMaker·Multi-label

Automotive

Problem

Thousands of unstructured NHTSA safety complaints required consistent categorization into 30+ issue domains for regulatory reporting — impossible at scale manually.

Architecture

Fine-tuned multi-label classifier on automotive vocabulary. Batch pipeline on SageMaker endpoints. Output aligned to NHTSA taxonomy with confidence scoring per label.

Key Decision

Prompt-only: 73% accuracy. Fine-tuned: 92%. Training cost justified at 100k+ records/month. Batch over real-time for cost efficiency at scale.

92%

Accuracy

100k+

Records/Month

30+

Issue Domains

Banking RAG Intelligent Search Platform

RAG·Vector DB·GCP

Banking

Problem

Fragmented banking product information across documents. Query resolution was slow, inconsistent, and dependent on human subject-matter experts for every answer.

Architecture

RAG pipeline over banking corpus. Hybrid retrieval for precision. Responses grounded with citations. Confidence gating routes low-certainty queries to human agents.

Key Decision

Citation grounding non-negotiable for banking — hallucination without attribution is a compliance failure, not just a quality issue. Every answer traces to a source document.

↑

Retrieval Accuracy

↓

Resolution Time

Full

Traceability

05 — War Stories

What broke &
what I learned.

Most engineers hide their failures. I document them — because real credibility is built on post-mortems, not highlight reels.

Multi-Agent Latency

7-Second Responses in CortexIQ

Sequential agent execution looked fine in testing. Under real conversational load, each agent waiting for the previous made the UX feel broken. Users dropped queries mid-flow.

Mapped full dependency graph in LangGraph. Parallelized independent nodes — intent + clarification ran simultaneously. Added Redis session caching. Final: ~3s avg response time.

RAG Hallucination Spike

High Retrieval Score, Wrong Documents

Cosine similarity returned confident scores for mismatched context. LLM answered confidently with wrong information — dangerous in regulated banking where users act on answers.

Switched to hybrid retrieval (BM25 + semantic). Added cross-encoder reranking. Implemented citation grounding validation — answer not traceable to a source chunk doesn't go out.

PII Architecture Gap

PHI Reaching LLM in Early Build

Early regex-based masking had gaps — composite patient codes bypassed sanitization. Pre-launch QA caught it before production, but the near-miss was sobering.

Replaced all custom regex with Microsoft Presidio as sole masking authority. Made PII scanning the mandatory first node — if it fails, request is rejected. No exceptions.

Model Drift

Output Quality Dropped Without Code Changes

Six weeks post-deployment, NHTSA classifier output format drifted. No code changes. Root cause: provider silently updated the underlying model version affecting token distributions.

Pydantic schema validation on every LLM response. Automated regression tests run daily against production endpoints. Schema mismatch triggers an alert before users are affected.

Wrong Database

MongoDB for Conversational State

AgentState stored in MongoDB added round-trip overhead per agent step. In a 7-node pipeline, small latencies compound. Fine in design — painful in production profiling.

Migrated to Redis. Key insight: conversational memory is ephemeral and latency-sensitive — MongoDB is wrong for this. Redis key structure optimized to eliminate JSON deserialization overhead.

07 — Experience

Where I've built at scale.

Every system listed under Projects was built here. One company, two-plus years, five production systems across three regulated domains.

Jun 2023 — Present

Generative AI Engineer

Tata Consultancy Services (TCS) · Pune

Designed CortexIQ — pharmacy-safe multi-agent system (LangGraph + Presidio + Redis). Reduced latency from 7s to 3s, zero PHI exposures.
Architected LLM-powered contract risk detection on GCP Vertex AI — 90%+ accuracy, 70% faster review cycle across large-scale legal documents.
Built RAG intelligent search platform for banking products, improving retrieval accuracy and customer query resolution time significantly.
Developed NHTSA complaint classification — 92% accuracy across 30+ automotive issue domains at 100k+ records/month on SageMaker.
Engineered agentic LangChain/LangGraph workflows: multi-step reasoning with intent → retrieval → generation → validation pipelines.
Implemented RLHF-ready feedback mechanism — structured for future fine-tuning and reward modeling without rearchitecting.

Certifications

Applying AI Principles with Google Cloud — Google

Agentic AI with LangChain and LangGraph — IBM / Coursera

Agentic AI with LangGraph, CrewAI, and AutoGen — IBM / Coursera

Advanced RAG with Vector Databases and Retrievers — IBM / Coursera

Ravindra Kupatkar.

The laws
I build by.

The skills behind production systems.

How my AI systems
actually work.

Systems built.
Impact measured.

What broke &
what I learned.

7-Second Responses in CortexIQ

High Retrieval Score, Wrong Documents

PHI Reaching LLM in Early Build

Output Quality Dropped Without Code Changes

MongoDB for Conversational State

AI that drives impact
at production scale.

Why Your Multi-Agent System Is Slower Than a Single LLM Call

RAG Hallucinations Are a Retrieval Problem, Not a Prompt Problem

Building PII-Safe LLM Systems: Why Presidio Runs Before Everything Else

When to Fine-tune vs Prompt: Lessons from NHTSA at 100k Records/Month

Where I've built at scale.

Let's build AI
that scales.

Ravindra Kupatkar.

The lawsI build by.

The skills behind production systems.

How my AI systemsactually work.

Systems built.Impact measured.

What broke &what I learned.

7-Second Responses in CortexIQ

High Retrieval Score, Wrong Documents

PHI Reaching LLM in Early Build

Output Quality Dropped Without Code Changes

MongoDB for Conversational State

AI that drives impactat production scale.

Why Your Multi-Agent System Is Slower Than a Single LLM Call

RAG Hallucinations Are a Retrieval Problem, Not a Prompt Problem

Building PII-Safe LLM Systems: Why Presidio Runs Before Everything Else

When to Fine-tune vs Prompt: Lessons from NHTSA at 100k Records/Month

Where I've built at scale.

Let's build AIthat scales.

The laws
I build by.

How my AI systems
actually work.

Systems built.
Impact measured.

What broke &
what I learned.

AI that drives impact
at production scale.

Let's build AI
that scales.