От фрагментов к отчётам: создание многораундового агента исследования с Gemini и DuckDuckGo

августа 28, 2025 · 5 min

Кратко о системе

В этом руководстве описывается модульный агент для глубоких исследований, который можно запускать в Google Colab. Система использует Gemini как движок рассуждений, DuckDuckGo Instant Answer API для быстрых веб-поисков и реализует многораундовый сбор данных с дедупликацией, контролем задержек и структурированными подсказками. Подход экономит вызовы API и извлекает сжатые, структурированные инсайты, затем автоматически формирует отчёт.

Архитектура и компоненты

Пайплайн состоит из следующих частей:

Конфигурация: dataclass для ключей API, лимитов и задержек.
Слой поиска: DuckDuckGo Instant Answer для быстрого получения сниппетов.
Извлечение и анализ: Gemini обрабатывает извлечение ключевых пунктов и синтез тем.
Оркестрация: многораундовый поиск, расширение запросов и дедупликация.
Отчётность: автоматическая генерация структурированного отчёта (резюме, выводы, анализ, ограничения исследования).

Импорты и зависимости

Начинаем с импорта стандартных библиотек и SDK Google Generative AI, которые используются для вызовов к Gemini. Блок кода взят напрямую из ноутбука:

import os
import json
import time
import requests
from typing import List, Dict, Any
from dataclasses import dataclass
import google.generativeai as genai
from urllib.parse import quote_plus
import re

Ключевые классы и методы

ResearchConfig управляет параметрами: ключ API Gemini, лимиты по источникам, длине контента и задержке между поисками. Класс DeepResearchSystem настраивает Gemini, реализует поиск через DuckDuckGo, извлечение ключевых пунктов, анализ объединённых сниппетов и формирование итогового отчёта.

Ниже приведён полный класс и сопутствующий код так же, как в ноутбуке:

@dataclass
class ResearchConfig:
   gemini_api_key: str
   max_sources: int = 10
   max_content_length: int = 5000
   search_delay: float = 1.0


class DeepResearchSystem:
   def __init__(self, config: ResearchConfig):
       self.config = config
       genai.configure(api_key=config.gemini_api_key)
       self.model = genai.GenerativeModel('gemini-1.5-flash')


   def search_web(self, query: str, num_results: int = 5) -> List[Dict[str, str]]:
       """Search web using DuckDuckGo Instant Answer API"""
       try:
           encoded_query = quote_plus(query)
           url = f"https://api.duckduckgo.com/?q={encoded_query}&format=json&no_redirect=1"


           response = requests.get(url, timeout=10)
           data = response.json()


           results = []


           if 'RelatedTopics' in data:
               for topic in data['RelatedTopics'][:num_results]:
                   if isinstance(topic, dict) and 'Text' in topic:
                       results.append({
                           'title': topic.get('Text', '')[:100] + '...',
                           'url': topic.get('FirstURL', ''),
                           'snippet': topic.get('Text', '')
                       })


           if not results:
               results = [{
                   'title': f"Research on: {query}",
                   'url': f"https://search.example.com/q={encoded_query}",
                   'snippet': f"General information and research about {query}"
               }]


           return results


       except Exception as e:
           print(f"Search error: {e}")
           return [{'title': f"Research: {query}", 'url': '', 'snippet': f"Topic: {query}"}]


   def extract_key_points(self, content: str) -> List[str]:
       """Extract key points using Gemini"""
       prompt = f"""
       Extract 5-7 key points from this content. Be concise and factual:


       {content[:2000]}


       Return as numbered list:
       """


       try:
           response = self.model.generate_content(prompt)
           return [line.strip() for line in response.text.split('\n') if line.strip()]
       except:
           return ["Key information extracted from source"]


   def analyze_sources(self, sources: List[Dict[str, str]], query: str) -> Dict[str, Any]:
       """Analyze sources for relevance and extract insights"""
       analysis = {
           'total_sources': len(sources),
           'key_themes': [],
           'insights': [],
           'confidence_score': 0.7
       }


       all_content = " ".join([s.get('snippet', '') for s in sources])


       if len(all_content) > 100:
           prompt = f"""
           Analyze this research content for the query: "{query}"


           Content: {all_content[:1500]}


           Provide:
           1. 3-4 key themes (one line each)
           2. 3-4 main insights (one line each)
           3. Overall confidence (0.1-1.0)


           Format as JSON with keys: themes, insights, confidence
           """


           try:
               response = self.model.generate_content(prompt)
               text = response.text
               if 'themes' in text.lower():
                   analysis['key_themes'] = ["Theme extracted from analysis"]
                   analysis['insights'] = ["Insight derived from sources"]
           except:
               pass


       return analysis


   def generate_comprehensive_report(self, query: str, sources: List[Dict[str, str]],
                                   analysis: Dict[str, Any]) -> str:
       """Generate final research report"""


       sources_text = "\n".join([f"- {s['title']}: {s['snippet'][:200]}"
                                for s in sources[:5]])


       prompt = f"""
       Create a comprehensive research report on: "{query}"


       Based on these sources:
       {sources_text}


       Analysis summary:
       - Total sources: {analysis['total_sources']}
       - Confidence: {analysis['confidence_score']}


       Structure the report with:
       1. Executive Summary (2-3 sentences)
       2. Key Findings (3-5 bullet points)
       3. Detailed Analysis (2-3 paragraphs)
       4. Conclusions & Implications (1-2 paragraphs)
       5. Research Limitations


       Be factual, well-structured, and insightful.
       """


       try:
           response = self.model.generate_content(prompt)
           return response.text
       except Exception as e:
           return f"""
# Research Report: {query}


## Executive Summary
Research conducted on "{query}" using {analysis['total_sources']} sources.


## Key Findings
- Multiple perspectives analyzed
- Comprehensive information gathered
- Research completed successfully


## Analysis
The research process involved systematic collection and analysis of information related to {query}. Various sources were consulted to provide a balanced perspective.


## Conclusions
The research provides a foundation for understanding {query} based on available information.


## Research Limitations
Limited by API constraints and source availability.
           """


   def conduct_research(self, query: str, depth: str = "standard") -> Dict[str, Any]:
       """Main research orchestration method"""
       print(f" Starting research on: {query}")


       search_rounds = {"basic": 1, "standard": 2, "deep": 3}.get(depth, 2)
       sources_per_round = {"basic": 3, "standard": 5, "deep": 7}.get(depth, 5)


       all_sources = []


       search_queries = [query]


       if depth in ["standard", "deep"]:
           try:
               related_prompt = f"Generate 2 related search queries for: {query}. One line each."
               response = self.model.generate_content(related_prompt)
               additional_queries = [q.strip() for q in response.text.split('\n') if q.strip()][:2]
               search_queries.extend(additional_queries)
           except:
               pass


       for i, search_query in enumerate(search_queries[:search_rounds]):
           print(f" Search round {i+1}: {search_query}")
           sources = self.search_web(search_query, sources_per_round)
           all_sources.extend(sources)
           time.sleep(self.config.search_delay)


       unique_sources = []
       seen_urls = set()
       for source in all_sources:
           if source['url'] not in seen_urls:
               unique_sources.append(source)
               seen_urls.add(source['url'])


       print(f" Analyzing {len(unique_sources)} unique sources...")


       analysis = self.analyze_sources(unique_sources[:self.config.max_sources], query)


       print(" Generating comprehensive report...")


       report = self.generate_comprehensive_report(query, unique_sources, analysis)


       return {
           'query': query,
           'sources_found': len(unique_sources),
           'analysis': analysis,
           'report': report,
           'sources': unique_sources[:10]
       }

Быстрая настройка для Colab

Функция setup_research_system упрощает инициализацию в Colab, возвращая готовый экземпляр DeepResearchSystem с настроенными лимитами и задержкой:

def setup_research_system(api_key: str) -> DeepResearchSystem:
   """Quick setup for Google Colab"""
   config = ResearchConfig(
       gemini_api_key=api_key,
       max_sources=15,
       max_content_length=6000,
       search_delay=0.5
   )
   return DeepResearchSystem(config)

Пример запуска

В блоке main показано, как инициализировать систему, запустить запрос и вывести результаты: краткую статистику, сгенерированный отчёт и список источников с превью.

if __name__ == "__main__":
   API_KEY = "Use Your Own API Key Here"


   researcher = setup_research_system(API_KEY)


   query = "Deep Research Agent Architecture"
   results = researcher.conduct_research(query, depth="standard")


   print("="*50)
   print("RESEARCH RESULTS")
   print("="*50)
   print(f"Query: {results['query']}")
   print(f"Sources found: {results['sources_found']}")
   print(f"Confidence: {results['analysis']['confidence_score']}")
   print("\n" + "="*50)
   print("COMPREHENSIVE REPORT")
   print("="*50)
   print(results['report'])


   print("\n" + "="*50)
   print("SOURCES CONSULTED")
   print("="*50)
   for i, source in enumerate(results['sources'][:5], 1):
       print(f"{i}. {source['title']}")
       print(f"   URL: {source['url']}")
       print(f"   Preview: {source['snippet'][:150]}...")
       print()

Практические советы

DuckDuckGo Instant Answer удобен для быстрых сниппетов, но имеет ограниченный охват; для серьёзных исследований рассмотрите платные поисковые API.
Тонкая настройка подсказок для Gemini повышает качество извлечения и синтеза. Тестируйте разные формулировки и длину контента.
Учитывайте квоты API — подбирайте параметры search_rounds и max_sources, чтобы контролировать расходы.

Этот шаблон демонстрирует компактный и практичный подход к превращению неструктурированных сниппетов в структурированный отчёт, комбинируя поиск, генеративную модель и простую логику оркестрации в Colab.