Mastering AI-Powered Web Scraping: BrightData and Google Gemini Integration
Discover how to create a sophisticated web scraper combining BrightData proxies with Google's Gemini AI for enhanced data extraction and analysis.
Building a Powerful Web Scraper with BrightData and Google Gemini
This tutorial guides you through creating an advanced web scraping tool that combines BrightData's extensive proxy network with Google's Gemini API for intelligent data extraction. The project is structured in Python, using essential libraries to build a clean and reusable BrightDataScraper class.
Setting Up the Environment
Install all necessary libraries in one step:
!pip install langchain-brightdata langchain-google-genai langgraph langchain-core google-generativeaiImport the required modules to handle system operations, data serialization, and integrate BrightData and Google Gemini:
import os
import json
from typing import Dict, Any, Optional
from langchain_brightdata import BrightDataWebScraperAPI
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.prebuilt import create_react_agentThe BrightDataScraper Class
This class encapsulates all web scraping logic and optional AI-powered intelligence.
class BrightDataScraper:
"""Enhanced web scraper using BrightData API"""
def __init__(self, api_key: str, google_api_key: Optional[str] = None):
"""Initialize scraper with API keys"""
self.api_key = api_key
self.scraper = BrightDataWebScraperAPI(bright_data_api_key=api_key)
if google_api_key:
self.llm = ChatGoogleGenerativeAI(
model="gemini-2.0-flash",
google_api_key=google_api_key
)
self.agent = create_react_agent(self.llm, [self.scraper])
def scrape_amazon_product(self, url: str, zipcode: str = "10001") -> Dict[str, Any]:
"""Scrape Amazon product data"""
try:
results = self.scraper.invoke({
"url": url,
"dataset_type": "amazon_product",
"zipcode": zipcode
})
return {"success": True, "data": results}
except Exception as e:
return {"success": False, "error": str(e)}
def scrape_amazon_bestsellers(self, region: str = "in") -> Dict[str, Any]:
"""Scrape Amazon bestsellers"""
try:
url = f"https://www.amazon.{region}/gp/bestsellers/"
results = self.scraper.invoke({
"url": url,
"dataset_type": "amazon_product"
})
return {"success": True, "data": results}
except Exception as e:
return {"success": False, "error": str(e)}
def scrape_linkedin_profile(self, url: str) -> Dict[str, Any]:
"""Scrape LinkedIn profile data"""
try:
results = self.scraper.invoke({
"url": url,
"dataset_type": "linkedin_person_profile"
})
return {"success": True, "data": results}
except Exception as e:
return {"success": False, "error": str(e)}
def run_agent_query(self, query: str) -> None:
"""Run AI agent with natural language query"""
if not hasattr(self, 'agent'):
print("Error: Google API key required for agent functionality")
return
try:
for step in self.agent.stream(
{"messages": query},
stream_mode="values"
):
step["messages"][-1].pretty_print()
except Exception as e:
print(f"Agent error: {e}")
def print_results(self, results: Dict[str, Any], title: str = "Results") -> None:
"""Pretty print results"""
print(f"\n{'='*50}")
print(f"{title}")
print(f"{'='*50}")
if results["success"]:
print(json.dumps(results["data"], indent=2, ensure_ascii=False))
else:
print(f"Error: {results['error']}")
print()This class provides methods to scrape Amazon product details, bestseller lists, LinkedIn profiles, and to run natural-language queries using Google's Gemini model. Error handling and JSON output formatting are included for smooth operation.
Main Execution Flow
The main function demonstrates usage by scraping Amazon India bestsellers, a specific Amazon product, a LinkedIn profile, and running an AI agent query:
def main():
"""Main execution function"""
BRIGHT_DATA_API_KEY = "Use Your Own API Key"
GOOGLE_API_KEY = "Use Your Own API Key"
scraper = BrightDataScraper(BRIGHT_DATA_API_KEY, GOOGLE_API_KEY)
print(" Scraping Amazon India Bestsellers...")
bestsellers = scraper.scrape_amazon_bestsellers("in")
scraper.print_results(bestsellers, "Amazon India Bestsellers")
print(" Scraping Amazon Product...")
product_url = "https://www.amazon.com/dp/B08L5TNJHG"
product_data = scraper.scrape_amazon_product(product_url, "10001")
scraper.print_results(product_data, "Amazon Product Data")
print(" Scraping LinkedIn Profile...")
linkedin_url = "https://www.linkedin.com/in/satyanadella/"
linkedin_data = scraper.scrape_linkedin_profile(linkedin_url)
scraper.print_results(linkedin_data, "LinkedIn Profile Data")
print(" Running AI Agent Query...")
agent_query = """
Scrape Amazon product data for https://www.amazon.com/dp/B0D2Q9397Y?th=1
in New York (zipcode 10001) and summarize the key product details.
"""
scraper.run_agent_query(agent_query)Running the Script
The script installs necessary packages silently and sets environment variables before executing:
if __name__ == "__main__":
print("Installing required packages...")
os.system("pip install -q langchain-brightdata langchain-google-genai langgraph")
os.environ["BRIGHT_DATA_API_KEY"] = "Use Your Own API Key"
main()This setup ensures all dependencies are installed, and the API keys are correctly configured.
Expanding Capabilities
This foundation can be extended by adding support for more dataset types, integrating additional large language models, or deploying the scraper as part of a larger pipeline or web service. With AI-powered scraping and modular design, data collection and analysis become more efficient and adaptable to various use cases.
Сменить язык
Читать эту статью на русском