Why Building AI Agents Is 5% Models and 100% Engineering

What a doc-to-chat pipeline actually does

A doc-to-chat pipeline ingests enterprise documents, standardizes and normalizes them, enforces governance, indexes embeddings alongside relational features, and serves retrieval + generation through authenticated APIs with human-in-the-loop (HITL) checkpoints. This is the practical reference architecture for agentic Q&A, copilots, and workflow automation where answers must respect permissions and be audit-ready.

In production this pattern looks like a hardened RAG (retrieval-augmented generation) system with guardrails, governance, and OpenTelemetry-backed tracing to make behavior reproducible and auditable.

Integrating with an existing stack

Keep service boundaries standard (REST/JSON, gRPC) and use storage layers your organization already trusts. For tabular data, Iceberg provides ACID semantics, schema and partition evolution, and snapshots—critical for reproducible retrieval and backfills. For vectors, pick a strategy that coexists with SQL filters:

Many production teams run both: use SQL+pgvector for transactional joins and policies, and Milvus for heavy retrieval workloads.

Key components and their properties

Coordination between agents, humans, and workflows

Production agents need defined coordination points where humans can approve, correct, or escalate outputs. Managed services like AWS A2I provide HITL loops (private workforces, flow definitions) and act as a concrete gating mechanism for low-confidence outputs.

Frameworks such as LangGraph treat approvals as first-class steps in agent graphs so human checkpoints become explicit DAG nodes, not ad hoc callbacks. Use these gates for actions like publishing summaries, filing tickets, or committing code. Persist every artifact—prompts, retrieval sets, and decisions—for auditing and reproducibility.

Pattern: LLM → confidence/guardrail checks → HITL gate → side-effects.

Enforcing reliability before models

Reliability is a layered defense:

Most outages and trust failures are data plumbing, permissioning, retrieval decay, or missing telemetry, not model choice.

Scaling indexing and retrieval

Two axes matter: ingest throughput and query concurrency.

For structured + unstructured fusion favor hybrid retrieval (BM25 + ANN + reranker) and store structured features next to vectors for filters and re-ranking at query time.

Observability beyond logs

Stitch traces, metrics, and evaluations together:

Add schema profiling/mapping on ingestion so observability stays attached to data-shape changes and explains retrieval regressions when upstream sources shift.

Reference doc-to-chat flow

Ingest: connectors → text extraction → normalization → Iceberg write (ACID, snapshots). Govern: PII scan (Presidio) → redact/mask → catalog registration with ACL policies. Index: embedding jobs → pgvector (policy-aware joins) and Milvus (high-QPS ANN). Serve: REST/gRPC → hybrid retrieval → guardrails → LLM → tool use. HITL: low-confidence paths route to A2I/LangGraph approval steps. Observe: OTEL traces to LangSmith/APM + scheduled RAG evaluations.

Why it’s ‘5% AI, 100% software engineering’

The hard work that makes agents reliable, auditable, and safe is systems engineering: data plumbing, ACID tables, ACL catalogs, PII guardrails, hybrid retrieval, telemetry, and human gates. These controls determine whether even a stable base model is safe, fast, and credible for users. Swap models later as needed, but invest in engineering first.