Why Building AI Agents Is 5% Models and 100% Engineering
What a doc-to-chat pipeline actually does
A doc-to-chat pipeline ingests enterprise documents, standardizes and normalizes them, enforces governance, indexes embeddings alongside relational features, and serves retrieval + generation through authenticated APIs with human-in-the-loop (HITL) checkpoints. This is the practical reference architecture for agentic Q&A, copilots, and workflow automation where answers must respect permissions and be audit-ready.
In production this pattern looks like a hardened RAG (retrieval-augmented generation) system with guardrails, governance, and OpenTelemetry-backed tracing to make behavior reproducible and auditable.
Integrating with an existing stack
Keep service boundaries standard (REST/JSON, gRPC) and use storage layers your organization already trusts. For tabular data, Iceberg provides ACID semantics, schema and partition evolution, and snapshots—critical for reproducible retrieval and backfills. For vectors, pick a strategy that coexists with SQL filters:
- pgvector collocates embeddings with business keys and ACL tags inside PostgreSQL, enabling precise joins and policy enforcement in one query plan.
- Dedicated vector engines like Milvus deliver high-QPS ANN with disaggregated storage/compute and horizontal scaling.
Many production teams run both: use SQL+pgvector for transactional joins and policies, and Milvus for heavy retrieval workloads.
Key components and their properties
- Iceberg tables: ACID guarantees, hidden partitioning, snapshot isolation, and vendor support across warehouses.
- pgvector: lets you run SQL + vector similarity in one plan for precise joins and enforcement of access policies.
- Milvus: layered, horizontally scalable architecture built for large-scale similarity search.
Coordination between agents, humans, and workflows
Production agents need defined coordination points where humans can approve, correct, or escalate outputs. Managed services like AWS A2I provide HITL loops (private workforces, flow definitions) and act as a concrete gating mechanism for low-confidence outputs.
Frameworks such as LangGraph treat approvals as first-class steps in agent graphs so human checkpoints become explicit DAG nodes, not ad hoc callbacks. Use these gates for actions like publishing summaries, filing tickets, or committing code. Persist every artifact—prompts, retrieval sets, and decisions—for auditing and reproducibility.
Pattern: LLM → confidence/guardrail checks → HITL gate → side-effects.
Enforcing reliability before models
Reliability is a layered defense:
- Language and content guardrails: pre-validate inputs/outputs for safety and policy (managed options like Bedrock Guardrails or OSS options like NeMo Guardrails, Guardrails AI, Llama Guard).
- PII detection/redaction: analyze both source docs and model I/O; tools like Microsoft Presidio can recognize and mask PII but should be combined with additional controls.
- Access control and lineage: enforce row-/column-level ACLs and audit across catalogs (Unity Catalog) so retrieval respects permissions; unify lineage and access policies across workspaces.
- Retrieval quality gates: evaluate RAG with reference-free metrics (faithfulness, context precision/recall) and block or down-rank poor contexts.
Most outages and trust failures are data plumbing, permissioning, retrieval decay, or missing telemetry, not model choice.
Scaling indexing and retrieval
Two axes matter: ingest throughput and query concurrency.
- Ingest: normalize at the lakehouse edge and write to Iceberg for versioned snapshots. Embed asynchronously to enable deterministic rebuilds and point-in-time re-indexing.
- Vector serving: Milvus’s shared-storage, disaggregated compute model supports horizontal scaling; use HNSW/IVF/Flat hybrids and replica sets to balance recall and latency.
- SQL + vector: keep business joins server-side (pgvector). Example pattern: WHERE tenant_id = ? AND acl_tag @> … ORDER BY embedding <-> :q LIMIT k—to avoid N+1 trips and respect policies.
- Chunking/embedding strategy: tune chunk size, overlap, and semantic boundaries; poor chunking silently kills recall.
For structured + unstructured fusion favor hybrid retrieval (BM25 + ANN + reranker) and store structured features next to vectors for filters and re-ranking at query time.
Observability beyond logs
Stitch traces, metrics, and evaluations together:
- Distributed tracing: emit OpenTelemetry spans across ingestion, retrieval, model calls, and tools. Tools like LangSmith ingest OTEL traces and interoperate with external APMs (Jaeger, Datadog, Elastic) to provide end-to-end timing, prompts, contexts, and per-request cost.
- LLM observability platforms: compare LangSmith, Arize Phoenix, LangFuse, Datadog by tracing, evals, cost tracking, and enterprise readiness.
- Continuous evaluation: schedule RAG evals on canary sets and live traffic replays; track faithfulness and grounding drift over time.
Add schema profiling/mapping on ingestion so observability stays attached to data-shape changes and explains retrieval regressions when upstream sources shift.
Reference doc-to-chat flow
Ingest: connectors → text extraction → normalization → Iceberg write (ACID, snapshots). Govern: PII scan (Presidio) → redact/mask → catalog registration with ACL policies. Index: embedding jobs → pgvector (policy-aware joins) and Milvus (high-QPS ANN). Serve: REST/gRPC → hybrid retrieval → guardrails → LLM → tool use. HITL: low-confidence paths route to A2I/LangGraph approval steps. Observe: OTEL traces to LangSmith/APM + scheduled RAG evaluations.
Why it’s ‘5% AI, 100% software engineering’
The hard work that makes agents reliable, auditable, and safe is systems engineering: data plumbing, ACID tables, ACL catalogs, PII guardrails, hybrid retrieval, telemetry, and human gates. These controls determine whether even a stable base model is safe, fast, and credible for users. Swap models later as needed, but invest in engineering first.