Agentic Decision-Tree RAG: Smart Routing, Self-Checks & Iterative Refinement

What this system does

This tutorial walks through building an agentic Retrieval-Augmented Generation (RAG) pipeline that routes queries to appropriate knowledge sources, retrieves relevant context with FAISS, generates answers using Flan-T5, performs self-checks, and iteratively refines responses. The system combines lightweight local models and libraries (SentenceTransformers, FAISS, Transformers) into a decision-tree-like flow that mimics agentic reasoning.

Dependencies and setup

Install and verify dependencies at the start so the pipeline runs locally and reliably. The snippet below shows a setup routine that installs necessary Python packages and imports modules used throughout the project.

print(" Setting up dependencies...")
import subprocess
import sys
def install_packages():
   packages = ['sentence-transformers', 'transformers', 'torch', 'faiss-cpu', 'numpy', 'accelerate']
   for package in packages:
       print(f"Installing {package}...")
       subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package])
try:
   import faiss
except ImportError:
   install_packages()
   print("✓ All dependencies installed! Importing modules...\n")
import torch
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import faiss
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')
print("✓ All modules loaded successfully!\n")

This ensures Transformers, FAISS, SentenceTransformers, NumPy and PyTorch are available, and it silences noisy warnings to keep console output clear.

Vector store and embedding

The VectorStore component embeds documents using a SentenceTransformer and indexes them with FAISS for fast similarity search. It keeps document text and source metadata for contextual grounding.

class VectorStore:
   def __init__(self, embedding_model='all-MiniLM-L6-v2'):
       print(f"Loading embedding model: {embedding_model}...")
       self.embedder = SentenceTransformer(embedding_model)
       self.documents = []
       self.index = None
   def add_documents(self, docs: List[str], sources: List[str]):
       self.documents = [{"text": doc, "source": src} for doc, src in zip(docs, sources)]
       embeddings = self.embedder.encode(docs, show_progress_bar=False)
       dimension = embeddings.shape[1]
       self.index = faiss.IndexFlatL2(dimension)
       self.index.add(embeddings.astype('float32'))
       print(f"✓ Indexed {len(docs)} documents\n")
   def search(self, query: str, k: int = 3) -> List[Dict]:
       query_vec = self.embedder.encode([query]).astype('float32')
       distances, indices = self.index.search(query_vec, k)
       return [self.documents[i] for i in indices[0]]

Key points: embeddings turn text to vectors, FAISS performs fast nearest-neighbor search, and documents retain a 'source' field which later appears in generated context.

Query routing (intent classification)

Routing decides how many documents to fetch and tailors downstream behavior. A simple keyword-based router provides a quick way to detect technical, factual, comparative, or procedural intents.

class QueryRouter:
   def __init__(self):
       self.categories = {
           'technical': ['how', 'implement', 'code', 'function', 'algorithm', 'debug'],
           'factual': ['what', 'who', 'when', 'where', 'define', 'explain'],
           'comparative': ['compare', 'difference', 'versus', 'vs', 'better', 'which'],
           'procedural': ['steps', 'process', 'guide', 'tutorial', 'how to']
       }
   def route(self, query: str) -> str:
       query_lower = query.lower()
       scores = {}
       for category, keywords in self.categories.items():
           score = sum(1 for kw in keywords if kw in query_lower)
           scores[category] = score
       best_category = max(scores, key=scores.get)
       return best_category if scores[best_category] > 0 else 'factual'

This lightweight approach is easy to extend with more keywords or swapped for an ML classifier for higher accuracy.

Generation and self-checking

The AnswerGenerator uses a seq2seq model (Flan-T5) to produce an answer given the retrieved context. After generation, a self_check routine evaluates length, grounding in retrieved documents, and relevance to the query. If checks fail, the system can iterate.

class AnswerGenerator:
   def __init__(self, model_name='google/flan-t5-base'):
       print(f"Loading generation model: {model_name}...")
       self.generator = pipeline('text2text-generation', model=model_name, device=0 if torch.cuda.is_available() else -1, max_length=256)
       device_type = "GPU" if torch.cuda.is_available() else "CPU"
       print(f"✓ Generator ready (using {device_type})\n")
   def generate(self, query: str, context: List[Dict], query_type: str) -> str:
       context_text = "\n\n".join([f"[{doc['source']}]: {doc['text']}" for doc in context])
      
Context:
{context_text}
 
 
Question: {query}
 
 
Answer:"""
       answer = self.generator(prompt, max_length=200, do_sample=False)[0]['generated_text']
       return answer.strip()
   def self_check(self, query: str, answer: str, context: List[Dict]) -> Tuple[bool, str]:
       if len(answer) < 10:
           return False, "Answer too short - needs more detail"
       context_keywords = set()
       for doc in context:
           context_keywords.update(doc['text'].lower().split()[:20])
       answer_words = set(answer.lower().split())
       overlap = len(context_keywords.intersection(answer_words))
       if overlap < 2:
           return False, "Answer not grounded in context - needs more evidence"
       query_keywords = set(query.lower().split())
       if len(query_keywords.intersection(answer_words)) < 1:
           return False, "Answer doesn't address the query - rephrase needed"
       return True, "Answer quality acceptable"

Note: the self-check logic is intentionally simple and conservative — it can be enhanced with entailment checks, factuality scoring, or model-based verifiers.

Orchestrator: AgenticRAG

This class wires routing, retrieval, generation, and self-evaluation into an iterative loop. Based on the routing result, it selects how many documents to fetch, generates an answer, runs self-checks, and optionally refines the query or expands context across iterations.

class AgenticRAG:
   def __init__(self):
       self.vector_store = VectorStore()
       self.router = QueryRouter()
       self.generator = AnswerGenerator()
       self.max_iterations = 2
   def add_knowledge(self, documents: List[str], sources: List[str]):
       self.vector_store.add_documents(documents, sources)
   def query(self, question: str, verbose: bool = True) -> Dict:
       if verbose:
           print(f"\n{'='*60}")
           print(f" Query: {question}")
           print(f"{'='*60}")
       query_type = self.router.route(question)
       if verbose:
           print(f" Route: {query_type.upper()} query detected")
       k_docs = {'technical': 2, 'comparative': 4, 'procedural': 3}.get(query_type, 3)
       iteration = 0
       answer_accepted = False
       while iteration < self.max_iterations and not answer_accepted:
           iteration += 1
           if verbose:
               print(f"\n Iteration {iteration}")
           context = self.vector_store.search(question, k=k_docs)
           if verbose:
               print(f" Retrieved {len(context)} documents from sources:")
               for doc in context:
                   print(f"   - {doc['source']}")
           answer = self.generator.generate(question, context, query_type)
           if verbose:
               print(f" Generated answer: {answer[:100]}...")
           answer_accepted, feedback = self.generator.self_check(question, answer, context)
           if verbose:
               status = "✓ ACCEPTED" if answer_accepted else "✗ REJECTED"
               print(f" Self-check: {status}")
               print(f"   Feedback: {feedback}")
           if not answer_accepted and iteration < self.max_iterations:
               question = f"{question} (provide more specific details)"
               k_docs += 1
       return {'answer': answer, 'query_type': query_type, 'iterations': iteration, 'accepted': answer_accepted, 'sources': [doc['source'] for doc in context]}

Demo and usage

A simple main() demonstrates loading a tiny knowledge base, registering it with the agent, and running a few example queries. The console prints routing decisions, retrieved sources, generation previews, self-check results, and the final accepted answer.

def main():
   print("\n" + "="*60)
   print(" AGENTIC RAG WITH ROUTING & SELF-CHECK")
   print("="*60 + "\n")
   documents = [
       "RAG (Retrieval-Augmented Generation) combines information retrieval with text generation. It retrieves relevant documents and uses them as context for generating accurate answers."
   ]
   sources = ["Python Documentation", "ML Textbook", "Neural Networks Guide", "Deep Learning Paper", "Transformer Architecture", "RAG Research Paper"]
   rag = AgenticRAG()
   rag.add_knowledge(documents, sources)
   test_queries = ["What is Python?", "How does machine learning work?", "Compare neural networks and deep learning"]
   for query in test_queries:
       result = rag.query(query, verbose=True)
       print(f"\n{'='*60}")
       print(f" FINAL RESULT:")
       print(f"   Answer: {result['answer']}")
       print(f"   Query Type: {result['query_type']}")
       print(f"   Iterations: {result['iterations']}")
       print(f"   Accepted: {result['accepted']}")
       print(f"{'='*60}\n")
if __name__ == "__main__":
   main()

Run this locally to observe routing, retrieval, generation, and iterative refinement end-to-end. Replace or expand the knowledge base, swap in larger models, or upgrade the self-check routine to increase robustness and factuality.