LLMs + Pandas: Generate DataFrame Summaries

Project Tutorial • Data Analysis with AI

LLMs + Pandas: Generate DataFrame Summaries 📊

Is guide ke liye, hum Kaggle se MBA admissions dataset ka use kar rahe hain. Agar aap saath me follow karna chahte hain to ise download kar lein. Yeh dataset Apache 2.0 license ke tehat hai, jiska matlab hai ki aap ise personal aur commercial dono projects ke liye free me use kar sakte hain.

Step 1: Dataset ka Parichay aur Exploration 🔍

Shuru karne ke liye, aapko apne system par kuch Python libraries install karni hongi.

pip install langchain pandas langchain-ollama

Aapko in versions ki zaroorat padegi:

langchain 0.3.25
pandas 2.2.3
langchain ollama 0.3.3

Jab sab kuch install ho jaye, to ek nayi script ya notebook me zaroori libraries import karein:

import pandas as pd
from langchain_ollama import ChatOllama
from typing import Literal

Dataset Loading aur Preprocessing

Pandas ke saath dataset load karke shuruaat karein. Yeh snippet CSV file ko load karta hai, dataset ke shape ke baare me basic jaankari print karta hai, aur dikhata hai ki har column me kitni missing values hain:

df = pd.read_csv("data/MBA.csv")

# Basic dataset info
print(f"Dataset shape: {df.shape}\n")
print("Missing value stats:")
print(df.isnull().sum())
print("-" * 25)
df.sample(5)

Kyunki data cleaning is article ka mukhya focus nahi hai, hum preprocessing ko minimal rakhenge. Dataset me sirf kuch missing values hain jin par dhyan dene ki zaroorat hai:

df["race"] = df["race"].fillna("Unknown")
df["admission"] = df["admission"].fillna("Deny")

Bas ho gaya! Chaliye ab dekhte hain ki isse ek meaningful report kaise banate hain.

Step 2: Boring Part - Summary Statistics Nikalna 📉

AI ki capability aur availability me itni tarakki ke bawajood, aap shayad apna poora dataset kisi LLM provider ko nahi bhejna chahenge. Iske kuch acche kaaran hain:

Isliye, kuch manual kaam abhi bhi zaroori hai.

Summary Function Banana

Is approach me aapko ek function likhna hoga jo aapke Pandas DataFrame se zaroori elements aur statistics nikalta hai. Aapko alag-alag datasets ke liye is function ko scratch se likhna padega, lekin core idea aasani se projects ke beech transfer ho jaata hai.

get_summary_context_message() function ek DataFrame leta hai aur ek formatted multi-line string return karta hai jisme ek detailed summary hoti hai. Isme yeh sab shamil hai:

Yahan function ka poora source code hai:

def get_summary_context_message(df: pd.DataFrame) -> str:
    """
    Generate a comprehensive summary report of MBA admissions dataset statistics.
    
    This function analyzes MBA application data to provide detailed statistics on
    applicant demographics, academic performance, professional backgrounds, and
    admission rates across various categories. The summary includes gender and
    international status distributions, GPA and GMAT score statistics, admission
    rates by academic major and work industry, and work experience impact analysis.
    
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing MBA admissions data with the following expected columns:
        - 'gender', 'international', 'gpa', 'gmat', 'major', 'work_industry', 'work_exp', 'admission'
    
    Returns
    -------
    str
        A formatted multi-line string containing comprehensive MBA admissions
        statistics.
    """
    # Basic application statistics
    total_applications = len(df)

    # Gender distribution
    gender_counts = df["gender"].value_counts()
    male_count = gender_counts.get("Male", 0)
    female_count = gender_counts.get("Female", 0)

    # International status
    international_count = (
        df["international"].sum()
        if df["international"].dtype == bool
        else (df["international"] == True).sum()
    )

    # GPA statistics
    gpa_data = df["gpa"].dropna()
    gpa_avg = gpa_data.mean()
    gpa_25th = gpa_data.quantile(0.25)
    gpa_50th = gpa_data.quantile(0.50)
    gpa_75th = gpa_data.quantile(0.75)

    # GMAT statistics
    gmat_data = df["gmat"].dropna()
    gmat_avg = gmat_data.mean()
    gmat_25th = gmat_data.quantile(0.25)
    gmat_50th = gmat_data.quantile(0.50)
    gmat_75th = gmat_data.quantile(0.75)

    # Major analysis - admission rates by major
    major_stats = []
    for major in df["major"].unique():
        major_data = df[df["major"] == major]
        admitted = len(major_data[major_data["admission"] == "Admit"])
        total = len(major_data)
        rate = (admitted / total) * 100
        major_stats.append((major, admitted, total, rate))

    # Sort by admission rate (descending)
    major_stats.sort(key=lambda x: x[3], reverse=True)

    # Work industry analysis - admission rates by industry
    industry_stats = []
    for industry in df["work_industry"].unique():
        if pd.isna(industry):
            continue
        industry_data = df[df["work_industry"] == industry]
        admitted = len(industry_data[industry_data["admission"] == "Admit"])
        total = len(industry_data)
        rate = (admitted / total) * 100
        industry_stats.append((industry, admitted, total, rate))

    # Sort by admission rate (descending)
    industry_stats.sort(key=lambda x: x[3], reverse=True)

    # Work experience analysis
    work_exp_data = df["work_exp"].dropna()
    avg_work_exp_all = work_exp_data.mean()

    # Work experience for admitted students
    admitted_students = df[df["admission"] == "Admit"]
    admitted_work_exp = admitted_students["work_exp"].dropna()
    avg_work_exp_admitted = admitted_work_exp.mean()

    # Work experience ranges analysis
    def categorize_work_exp(exp):
        if pd.isna(exp):
            return "Unknown"
        elif exp < 2:
            return "0-1 years"
        elif exp < 4:
            return "2-3 years"
        elif exp < 6:
            return "4-5 years"
        elif exp < 8:
            return "6-7 years"
        else:
            return "8+ years"

    df["work_exp_category"] = df["work_exp"].apply(categorize_work_exp)
    work_exp_category_stats = []

    for category in ["0-1 years", "2-3 years", "4-5 years", "6-7 years", "8+ years"]:
        category_data = df[df["work_exp_category"] == category]
        if len(category_data) > 0:
            admitted = len(category_data[category_data["admission"] == "Admit"])
            total = len(category_data)
            rate = (admitted / total) * 100
            work_exp_category_stats.append((category, admitted, total, rate))

    # Build the summary message
    summary = f"""MBA Admissions Dataset Summary (2025)
    
Total Applications: {total_applications:,} people applied to the MBA program.

Gender Distribution:
- Male applicants: {male_count:,} ({male_count/total_applications*100:.1f}%)
- Female applicants: {female_count:,} ({female_count/total_applications*100:.1f}%)

International Status:
- International applicants: {international_count:,} ({international_count/total_applications*100:.1f}%)
- Domestic applicants: {total_applications-international_count:,} ({(total_applications-international_count)/total_applications*100:.1f}%)

Academic Performance Statistics:

GPA Statistics:
- Average GPA: {gpa_avg:.2f}
- 25th percentile: {gpa_25th:.2f}
- 50th percentile (median): {gpa_50th:.2f}
- 75th percentile: {gpa_75th:.2f}

GMAT Statistics:
- Average GMAT: {gmat_avg:.0f}
- 25th percentile: {gmat_25th:.0f}
- 50th percentile (median): {gmat_50th:.0f}
- 75th percentile: {gmat_75th:.0f}

Major Analysis - Admission Rates by Academic Background:"""

    for major, admitted, total, rate in major_stats:
        summary += (
            f"\n- {major}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    summary += (
        "\n\nWork Industry Analysis - Admission Rates by Professional Background:"
    )

    # Show top 8 industries by admission rate
    for industry, admitted, total, rate in industry_stats[:8]:
        summary += (
            f"\n- {industry}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    summary += "\n\nWork Experience Impact on Admissions:\n\nOverall Work Experience Comparison:"
    summary += (
        f"\n- Average work experience (all applicants): {avg_work_exp_all:.1f} years"
    )
    summary += f"\n- Average work experience (admitted students): {avg_work_exp_admitted:.1f} years"

    summary += "\n\nAdmission Rates by Work Experience Range:"
    for category, admitted, total, rate in work_exp_category_stats:
        summary += (
            f"\n- {category}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    # Key insights
    best_major = major_stats[0]
    best_industry = industry_stats[0]

    summary += "\n\nKey Insights:"
    summary += (
        f"\n- Highest admission rate by major: {best_major[0]} at {best_major[3]:.1f}%"
    )
    summary += f"\n- Highest admission rate by industry: {best_industry[0]} at {best_industry[3]:.1f}%"

    if avg_work_exp_admitted > avg_work_exp_all:
        summary += f"\n- Admitted students have slightly more work experience on average ({avg_work_exp_admitted:.1f} vs {avg_work_exp_all:.1f} years)"
    else:
        summary += "\n- Work experience shows minimal difference between admitted and all applicants"

    return summary

Jab aap function define kar lein, to bas use call karein aur results print karein:

print(get_summary_context_message(df))

Step 3: Cool Part - LLMs ke Saath Kaam Karna 🤖

Yahan se cheezein interesting ho jaati hain aur aapka manual data extraction ka kaam rang laata hai.

LLMs ke Saath Kaam Karne ke liye Python Helper Function

Agar aapke paas accha hardware hai, to main is tarah ke simple tasks ke liye local LLMs use karne ki strong salah dunga. Main actual LLM processing ke liye Ollama aur Mistral model ka latest version use karta hoon.

Agar aap OpenAI API ke through ChatGPT jaisa kuch use karna chahte hain, to aap woh bhi kar sakte hain. Aapko bas neeche diye gaye function ko modify karna hoga taaki aap apna API key set kar sakein aur Langchain se sahi instance return kar sakein.

Aap jo bhi option chunein, get_llm() ko ek test message ke saath call karne par error nahi aana chahiye:

def get_llm(model_name: str = "mistral:latest") -> ChatOllama:
    """
    Local LLM inference ke liye ek ChatOllama instance banata aur configure karta hai.
    
    Yeh function ek ChatOllama client ko initialize karta hai jo local Ollama server se
    connect karne ke liye configured hota hai. Client ko deterministic output
    (temperature=0) ke saath set kiya gaya hai taaki same input ke liye consistent
    responses milen.
    
    Parameters
    ----------
    model_name : str, optional
        Use karne wale Ollama model ka naam. Default "mistral:latest" hai.
    
    Returns
    -------
    ChatOllama
        Ek configured ChatOllama instance.
    """
    return ChatOllama(
        model=model_name, base_url="http://localhost:11434", temperature=0
    )


print(get_llm().invoke("test").content)

Summarization Prompt

Yahan aap creative ho sakte hain aur apne LLM ke liye ultra-specific instructions likh sakte hain. Maine demonstration ke liye cheezon ko halka rakha hai, lekin aap yahan experiment karne ke liye free hain. Koi ek sahi ya galat prompt nahi hota.

Aap jo bhi karein, curly brackets ka use karke format arguments ko shamil karna na bhoolein – yeh values baad me dynamically fill ki jayengi:

SUMMARIZE_DATAFRAME_PROMPT = """
You are an expert data analyst and data summarizer. Your task is to take in complex datasets
and return user-friendly descriptions and findings.

You were given this dataset:
- Name: {dataset_name}
- Source: {dataset_source}

This dataset was analyzed in a pipeline before it was given to you.
These are the findings returned by the analysis pipeline:

<context>
{context}
</context>

Based on these findings, write a detailed report in {report_format} format.
Give the report a meaningful title and separate findings into sections with headings and subheadings.
Output only the report in {report_format} and nothing else.

Report:
"""

Summarization Python Function

Prompt aur get_llm() functions declare karne ke baad, bas sabko jodne ka kaam baaki hai. get_report_summary() function arguments leta hai jo prompt me format placeholders ko fill karenge, phir us prompt ke saath LLM ko invoke karke ek report generate karta hai.

Aap Markdown ya HTML formats me se chun sakte hain:

def get_report_summary(
    dataset: pd.DataFrame,
    dataset_name: str,
    dataset_source: str,
    report_format: Literal["markdown", "html"] = "markdown",
) -> str:
    """
    Ek pandas DataFrame se AI-powered summary report generate karta hai.
    
    Yeh function ek dataset ko analyze karta hai aur ek LLM ka use karke ek
    comprehensive summary report generate karta hai.
    
    Parameters
    ----------
    dataset : pd.DataFrame
        Analyze aur summarize karne ke liye pandas DataFrame.
    dataset_name : str
        Dataset ke liye ek descriptive naam.
    dataset_source : str
        Dataset ke source ya origin ke baare me jaankari.
    report_format : {"markdown", "html"}, optional
        Report ke liye desired output format. Default "markdown" hai.
    
    Returns
    -------
    str
        Ek formatted summary report.
    """
    context_message = get_summary_context_message(df=dataset)
    prompt = SUMMARIZE_DATAFRAME_PROMPT.format(
        dataset_name=dataset_name,
        dataset_source=dataset_source,
        context=context_message,
        report_format=report_format,
    )
    return get_llm().invoke(input=prompt).content

Function ka use karna seedha hai – bas dataset, uska naam, aur source pass karein. Report format default roop se Markdown hai:

md_report = get_report_summary(
    dataset=df, 
    dataset_name="MBA Admissions (2025)",
    dataset_source="https://www.kaggle.com/datasets/taweilo/mba-admission-dataset"
)
print(md_report)
← Back to Awesome AI Projects