Unlock Natural Language Data Analysis with Gemini-Powered Pandas Agent

Leveraging Google Gemini with Pandas for Data Analysis

This tutorial demonstrates how to integrate Google's Gemini models with Pandas by using LangChain's experimental DataFrame agent. We focus on the Titanic dataset to perform both basic and advanced data analysis using natural language queries.

Setting Up the Environment

First, install the necessary libraries: langchain_experimental, langchain_google_genai, and pandas.

!pip install langchain_experimental langchain_google_genai pandas

Import the essential modules and configure the Google API key.

import os
import pandas as pd
import numpy as np
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_google_genai import ChatGoogleGenerativeAI
 
os.environ["GOOGLE_API_KEY"] = "Use Your Own API Key"

Initializing the Gemini-Powered Pandas Agent

Define a helper function to create a LangChain Pandas DataFrame agent powered by Gemini.

def setup_gemini_agent(df, temperature=0, model="gemini-1.5-flash"):
    llm = ChatGoogleGenerativeAI(
        model=model,
        temperature=temperature,
        convert_system_message_to_human=True
    )
   
    agent = create_pandas_dataframe_agent(
        llm=llm,
        df=df,
        verbose=True,
        agent_type=AgentType.OPENAI_FUNCTIONS,
        allow_dangerous_code=True
    )
    return agent

Loading and Exploring the Titanic Dataset

Load the dataset directly from the Pandas GitHub repository and display its dimensions and columns.

def load_and_explore_data():
    print("Loading Titanic Dataset...")
    df = pd.read_csv(
        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
    )
    print(f"Dataset shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    return df

Performing Basic Data Analysis

Run simple exploratory queries such as counting rows, calculating survival rates, and examining passenger class distributions.

def basic_analysis_demo(agent):
    print("\nBASIC ANALYSIS DEMO")
    print("=" * 50)
   
    queries = [
        "How many rows and columns are in the dataset?",
        "What's the survival rate (percentage of people who survived)?",
        "How many people have more than 3 siblings?",
        "What's the square root of the average age?",
        "Show me the distribution of passenger classes"
    ]
   
    for query in queries:
        print(f"\nQuery: {query}")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

Advanced Data Analysis

Explore correlations, survival rates by demographic segments, median statistics, and detailed filtering.

def advanced_analysis_demo(agent):
    print("\nADVANCED ANALYSIS DEMO")
    print("=" * 50)
   
    advanced_queries = [
        "What's the correlation between age and fare?",
        "Create a survival analysis by gender and class",
        "What's the median age for each passenger class?",
        "Find passengers with the highest fares and their details",
        "Calculate the survival rate for different age groups (0-18, 18-65, 65+)"
    ]
   
    for query in advanced_queries:
        print(f"\nQuery: {query}")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

Comparing Multiple DataFrames

Demonstrate how to compare the original Titanic dataset with a version where missing ages are filled, using natural language prompts.

def multi_dataframe_demo():
    print("\nMULTI-DATAFRAME DEMO")
    print("=" * 50)
   
    df = pd.read_csv(
        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
    )
   
    df_filled = df.copy()
    df_filled["Age"] = df_filled["Age"].fillna(df_filled["Age"].mean())
   
    agent = setup_gemini_agent([df, df_filled])
   
    queries = [
        "How many rows in the age column are different between the two datasets?",
        "Compare the average age in both datasets",
        "What percentage of age values were missing in the original dataset?",
        "Show summary statistics for age in both datasets"
    ]
   
    for query in queries:
        print(f"\nQuery: {query}")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

Custom and Complex Data Analyses

Build risk scoring models, analyze deck-based survival rates, and find patterns in surnames and ticket prices.

def custom_analysis_demo(agent):
    print("\nCUSTOM ANALYSIS DEMO")
    print("=" * 50)
   
    custom_queries = [
        "Create a risk score for each passenger based on: Age (higher age = higher risk), Gender (male = higher risk), Class (3rd class = higher risk), Family size (alone or large family = higher risk). Then show the top 10 highest risk passengers who survived",
       
        "Analyze the 'deck' information from the cabin data: Extract deck letter from cabin numbers, Show survival rates by deck, Which deck had the highest survival rate?",
       
        "Find interesting patterns: Did people with similar names (same surname) tend to survive together? What's the relationship between ticket price and survival? Were there any age groups that had 100% survival rate?"
    ]
   
    for i, query in enumerate(custom_queries, 1):
        print(f"\nCustom Analysis {i}:")
        print(f"Query: {query[:100]}...")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

Running the Tutorial

The main() function ensures the API key is set, loads the data, initializes the agent, and runs all demos sequentially.

def main():
    print("Advanced Pandas Agent with Gemini Tutorial")
    print("=" * 60)
   
    if not os.getenv("GOOGLE_API_KEY"):
        print("Warning: GOOGLE_API_KEY not set!")
        print("Please set your Gemini API key as an environment variable.")
        return
   
    try:
        df = load_and_explore_data()
        print("\nSetting up Gemini Agent...")
        agent = setup_gemini_agent(df)
       
        basic_analysis_demo(agent)
        advanced_analysis_demo(agent)
        multi_dataframe_demo()
        custom_analysis_demo(agent)
       
        print("\nTutorial completed successfully!")
       
    except Exception as e:
        print(f"Error: {e}")
        print("Make sure you have installed all required packages and set your API key.")
 
 
if __name__ == "__main__":
    main()

Additional One-Off Queries

You can also directly query the agent for specific insights without rerunning the demos.

df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")
agent = setup_gemini_agent(df)
 
agent.invoke("What factors most strongly predicted survival?")
agent.invoke("Create a detailed survival analysis by port of embarkation")
agent.invoke("Find any interesting anomalies or outliers in the data")

Using the Gemini-powered LangChain DataFrame agent transforms data exploration into a conversational experience, enabling natural language queries to quickly generate statistics, insights, and visualizations without manual coding.