tags:
by : Aashiya Mittal
July 28th 2023
Share

Prepare for Efficient, Automated, and Advanced Insights with Pandas-AI and witness generative AI capabilities.

Have you ever imagined that you would be able to interact with your data just like best friends? No one might have thought of it.

What if I say, you can do it now?

Well, this is what Pandas AI is for. It is an incredible Python library that empowers your data frames with the capabilities of Generative AI. the time has gone when you spent hours staring at complex rows and columns without making any meaningful progress.

So, Does it replace Panda?

Worry not, Pandas AI is not here to replace Panda, it can be considered as an extension of Panda. It comes with limitless features, imagine having a data frame that can write its own reports or one that can effortlessly analyze complex data and present you with easily understandable summaries. The possibilities are awe-inspiring!

In this concise guide, we’ll take you through a step-by-step journey of harnessing the power of this cutting-edge library, regardless of your experience level. Whether you’re an experienced data analyst or just starting out, this guide equips you with all the necessary tools to confidently dive into the world of it. 

So sit back, relax, and let’s embark on an exploration of the thrilling possibilities that it has to offer! Before we deep dive into Pandas AI, let’s brush Panda basics and key features.

What is Panda and its Key Features?

Pandas is a powerful open-source Python library that provides high-performance data manipulation and analysis tools. It introduces two fundamental data structures- DataFrame and Series, which enable efficient handling of structured data. 

Let’s explore some of the key features of pandas.

  • It provides high-performance, easy-to-use data structures like DataFrames, which are similar to tables in a relational database.
  • Panda allows you to read and write data in various formats, including CSV, Excel, SQL databases, and more.
  • It offers flexible data cleaning and preprocessing capabilities, enabling you to handle missing values, duplicate data, and other common data issues.
  • Panda provides powerful indexing and slicing functions, allowing you to extract, filter, and transform data efficiently.
  • It supports statistical operations such as grouping, aggregation, and calculation of summary statistics.
  • Panda offers a wide range of data visualization options, including line plots, scatter plots, bar charts, and histograms.
  • It integrates well with other popular Python libraries like NumPy and Matplotlib.
  • Panda is widely used in data analysis, scientific research, finance, and other fields where working with structured data is required.

It is an extension of Panda with the capabilities of generative AI, taking data analysis to another level. Now, let’s get started with it.

Pandas AI: a step ahead of data analysis game

It refers to a Python library called “Pandas AI.” It is a powerful tool that incorporates generative artificial intelligence capabilities into the popular data manipulation and analysis library called Pandas.

Introducing it, an incredible Open Source Project! It expands the power of Pandas, a Python library, by adding generative artificial intelligence features. Acting as a user-friendly interface on top of Pandas, it allows you to interact with your data effortlessly. By using smart prompts with LLMs APIs, you can transform your data into a conversational format. This means you can directly engage with your data, making data exploration more intuitive and interactive. 

The best part? With it, you don’t have to create custom in-house LLMS, saving both money and resources.

Extensive Role of Pandas AI in Data Analysis

As we have already mentioned that it is an extension of the Panda capabilities. But how? Let’s explore the role of it in improving the world of data analysis for good.

Leveraging Automation Power

It brings the power of artificial intelligence and machine learning to the existing Python Pandas library, making it a next-gen tool for simplifying data analysis. It has cut down the time analysts spent on repetitive complex tasks by automating them within minutes. Pandas enhances the productivity of analysts as they can now only focus on high-end decision-making. 

It has reduced the time and efforts of analysts in managing the below operations fall within the data analysis pipeline.

  • Data filtering
  • Data sorting
  • Data grouping
  • Data Restructuring
  • Data cleaning
  • Data integration
  • Data manipulation
  • DataFrame description
  • Data standardization
  • Time series analysis

Imagine, the implementation of AI to the above operations. Start thinking about where can you implement AI and automate your daily tasks.

Next-level Exploratory Data Analysis

When it comes to analyzing data, Exploratory Data Analysis (EDA) is a critical step. It helps analysts uncover insights, spot patterns, and catch any unusual data points. Now, imagine taking EDA to the next level with the help of Pandas AI. This incredible tool automates tasks like data profiling and visualization. It digs deep into the data, creating summary statistics and interactive visuals. This means analysts can quickly understand the nature and spread of different variables. With this automation, the data exploration process becomes faster, making it easier to discover hidden patterns and relationships efficiently.

Advanced-Data Imputation and Feature Engineering

Dealing with missing data is a frequent hurdle in data analysis, and filling in those gaps accurately can greatly affect the reliability of our findings. Here’s where Pandas AI steps in, harnessing the power of AI algorithms to cleverly impute missing values. By detecting patterns and relationships within the dataset, it fills in the gaps intelligently. 

But that’s not all! It takes a step further by automating feature engineering. It identifies and creates new variables that capture complex connections, interactions, and non-linear patterns in the data. This automated feature engineering boosts the accuracy of predictive models and saves valuable time for analysts.

Predictive Modeling and Machine Learning

Pandas AI effortlessly blends with machine learning libraries, empowering analysts to construct predictive models and unlock profound data insights. It simplifies the machine learning process by automating model selection, hyperparameter tuning, and evaluation. Analysts can now swiftly test various algorithms, assess their effectiveness, and pinpoint the best model for a specific challenge. The beauty of Pandas AI lies in its accessibility, allowing even non-coders to harness the power of machine learning for data analysis.

Accelerating Decision-making with Simulations

With Pandas AI, decision-makers gain the power to explore potential outcomes through simulations. By adjusting data and introducing different factors, this library enables users to investigate “what-if” situations and assess the effects of different strategies. By simulating real-world scenarios, Pandas AI helps make informed decisions and identify the best possible courses of action. It’s like having a crystal ball that guides you toward optimal choices.

Get Started with Pandas AI

Here’s how you can get started with Pandas, including some examples and their corresponding output.

Installation

Before you start using PandasAI, you need to install it. Open your terminal or command prompt and run the following command.

pip install pandasai

Import Pandas using OpenAI

Once you have completed the installation, you’ll need to connect to a powerful language model on the backend, the OpenAI model. To do this, you’ll need to follow these steps.

  • Visit OpenAI and sign up using your email or connect your Google Account.
  • In your Account Settings, look for “View API keys” on the left side.

 

Import Pandas using OpenAI

  • Click on “Create new Secret key”.
  • Once you have your API keys, import the required libraries into your project notebook.

These steps will allow you to obtain the necessary API key from OpenAI and set up your project notebook to connect with the OpenAI language model.

Now, you can move to import the following.

 

import pandas as pd

from pandasai import PandasAI

from pandasai.llm.openai import OpenAI

llm = OpenAI(api_token=your_API_key)

Running Model on the DataFrame with Pandas AI

Run the OpenAI model to Pandas AI, using the below command.

 

pandas_ai = PandasAI(openAImodel)

Run the model on the data frame using two parameters and ask relevant questions.

For example-

 

pandas_ai.run(df, prompt='the question you would like to ask?')

Now that we have everything in place, let’s start asking questions.

Let’s interact with DataFrames using Pandas AI

To ask questions using Pandas AI, you can use the “run” method of the PandasAI object. This method requires two inputs: the DataFrame containing your data and a natural language prompt that represents the question or commands you want to execute on your data.

To verify the accuracy of the results, we will compare the outputs from both Pandas and Pandas AI. By observing the code snippets, you can see the outcomes produced by each approach.

Querying data

You can ask PandaAI to return DataFrame rows with a column’s value greater than a specific value.

For example-

import pandas as pd

from pandasai import PandasAI

# Sample DataFrame

df = pd.DataFrame({

    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],

    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],

    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]

})

# Instantiate a LLM

from pandasai.llm.openai import OpenAI

llm = OpenAI(api_token="YOUR_API_TOKEN")

pandas_ai = PandasAI(llm)

pandas_ai(df, prompt='Which are the 5 happiest countries?')
Output-

6            Canada

7         Australia

1    United Kingdom

3           Germany

0     United States

Name: country, dtype: object

Asking Complex Queries

In the above example, if you want to query to find the sum of the GDPs of the two most unhappy countries, you can run the following code.

For example-

pandas_ai(df, prompt='What is the sum of the GDPs of the 2 unhappiest countries?')
Output-

19012600725504

Data Visualization with Pandas AI

Visualizing data is essential for understanding patterns and relationships. Pandas perform data visualization tasks, such as creating plots, charts, and graphs. By visualizing data, you can gain insights and make informed decisions about AI modeling and analysis.

For example-

pandas_ai( df, "Plot the histogram of countries showing for each the gdp, using different colors for each bar", )

Data Visualization

For example-

prompt = "plot the histogram for this dataset"

response = pandas_ai.run(df, prompt=prompt)

print(f"** PANDAS AI: {response}")


Plot histogram with Pandas AI

Handling multiple DataFarmes Together Using Pandas AI

PandaAI allows you to pass multiple dataframes and ask questions based on them.

For example-

##Example of using PandasAI on multiple Pandas DataFrame

import pandas as pd

from pandasai import PandasAI

from pandasai.llm.openai import OpenAI

employees_data = {

    "EmployeeID": [1, 2, 3, 4, 5],

    "Name": ["John", "Emma", "Liam", "Olivia", "William"],

    "Department": ["HR", "Sales", "IT", "Marketing", "Finance"],

}
salaries_data = {

    "EmployeeID": [1, 2, 3, 4, 5],

    "Salary": [5000, 6000, 4500, 7000, 5500],

}

employees_df = pd.DataFrame(employees_data)

salaries_df = pd.DataFrame(salaries_data)

llm = OpenAI()

pandas_ai = PandasAI(llm, verbose=True, conversational=True)

response = pandas_ai([employees_df, salaries_df], "Who gets paid the most?")

print(response)
# Output: Olivia

Code source- GitHub

Enforcing Security

To create the Python code for execution, we first take a small portion of the dataframe, mix up the data (using random numbers for sensitive information and shuffling for non-sensitive information), and send only that portion.

If you want to protect your privacy even more, you can use PandasAI with a setting called enforce_privacy = True. This setting ensures that only the names of the columns are sent to the LLM, without sending any actual data from the data frame.

For example-

Example of using PandasAI with a Pandas DataFrame

import pandas as pd

from pandasai import PandasAI

from pandasai.llm.openai import OpenAI

from .data.sample_dataframe import dataframe

df = pd.DataFrame(dataframe)

llm = OpenAI()

pandas_ai = PandasAI(llm, verbose=True, enforce_privacy=True)

response = pandas_ai(

    df,

    "Calculate the sum of the gdp of north american countries",

)

print(response)
# Output: 20901884461056

Code source- GitHub

Pandas AI with other LLMs

GooglePalm

PaLM 2 is a new and improved language model made by Google. It’s really good at doing advanced thinking tasks like understanding code and math, answering questions, translating languages, and creating natural-sounding sentences. It’s even better at these things than our previous language models. We made it this way by using better technology and improving how it learns from data.

To use this model, you can get the Google Cloud API Key. After getting the key. Create an instance for the Google PaLM object.

Use the below example to call GooglePalm Model

from pandasai import PandasAI

from pandasai.llm.google_palm import GooglePalm

llm = GooglePalm(google_cloud_api_key="my-google-cloud-api-key")

pandas_ai = PandasAI(llm=llm)

Google VertexAI

If you want to use the Google PaLM models through Vertexai api, then you must have the following.

  • Google Cloud Project
  • Region of Project Set up
  • Install optional dependency google-cloud-aiplatform
  • Authentication of gcloud

After setting everything, then you can create the instance for Google PaLM using VertexAI. Use the below example to call Google VertexAI.

from pandasai import PandasAI

from pandasai.llm.google_palm import GoogleVertexai

llm = GoogleVertexai(project_id="generative-ai-training",

                     location="us-central1",

                     model="text-bison@001")

pandas_ai = PandasAI(llm=llm)

HuggingFace models

Same as OpenAI, you also need a HuggingFace models

 To use this model. You can get the key

Use the key for instantiating the HuggingFace models. PandasAI supports the following HuggingFace models-

  • Starcoder: bigcode/starcoder
  • OpenAssistant: OpenAssistant/oasst-sft-1-pythia-12b
  • Falcon: tiiuae/falcon-7b-instruct

 

For example-

 

from pandasai import PandasAI

from pandasai.llm.starcoder import Starcoder

from pandasai.llm.open_assistant import OpenAssistant

from pandasai.llm.falcon import Falcon

llm = Starcoder(huggingface_api_key="my-huggingface-api-key")

# or

llm = OpenAssistant(huggingface_api_key="my-huggingface-api-key")

# or

llm = Falcon(huggingface_api_key="my-huggingface-api-key")

pandas_ai = PandasAI(llm=llm)
  • If you want to continue without the key, then you can use the following method by setting the HUGGINGFACE_API_KEY environment variable.
from pandasai import PandasAI

from pandasai.llm.starcoder import Starcoder

from pandasai.llm.open_assistant import OpenAssistant

from pandasai.llm.falcon import Falcon

llm = Starcoder() # no need to pass the API key, it will be read from the environment variable

# or

llm = OpenAssistant() # no need to pass the API key, it will be read from the environment variable

# or

llm = Falcon() # no need to pass the API key, it will be read from the environment variable

pandas_ai = PandasAI(llm=llm)

Challenges Ahead of Pandas AI

As we delve into Pandas AI and its potential to transform data analysis, it’s crucial to address certain challenges and ethical considerations. Automating data analysis highlights important concerns regarding transparency, accountability, and bias. Analysts need to be cautious when interpreting and validating the results produced by Pandas AI, as they retain the responsibility for critical decision-making based on the insights derived. 

Let’s remember that while Pandas AI offers incredible possibilities, human judgment, and careful assessment remain indispensable for making informed choices.

Below are some other challenges that you must consider for better data analysis.

  • Interpretation of Prompts- The results generated by Pandas AI heavily rely on how the AI interprets the prompts given by users. In some cases, it may not provide the expected answers, leading to potential discrepancies or confusion.
  • Contextual Understanding- Pandas AI may struggle with understanding the contextual nuances of specific datasets or domain-specific terminology. This can sometimes result in inaccurate or incomplete insights.
  • Limited Coverage- Pandas AI’s effectiveness is influenced by the breadth and depth of its training data. If the library hasn’t been extensively trained on certain types of datasets or domains, its performance in those areas may be limited.
  • Handling Ambiguity- Ambiguous or poorly defined prompts can pose challenges for Pandas AI, potentially leading to inconsistent or unreliable outcomes. Clear and precise instructions are crucial to ensure accurate results.
  • Dependency on Training Data- The quality and diversity of the training data used to develop Pandas AI can impact its performance. Biases or limitations in the training data may influence the library’s ability to handle certain scenarios or produce unbiased insights.

Consider potential challenges and exercise caution when relying on Pandas AI for critical decision-making or sensitive data analysis. Consistent evaluation and validation of the generated results help mitigate these challenges and ensure the reliability of the analysis.

Pandas AI with Solid Future Prospects

PandasAI holds the potential to revolutionize the ever-changing world of data analysis. If you’re a data analyst focused on extracting insights and creating plots based on user needs, this library can automate the process efficiently. However, there are a few challenges to be aware of while using PandasAI.

The results obtained heavily rely on how the AI interprets your instructions, and sometimes it may not give the expected answers. For example, in the Olympics dataset, the AI occasionally got confused between “Olympic games” and “Olympic events,” leading to potentially different responses. 

Nevertheless, its advantages in simplifying and streamlining data analysis make it a valuable tool. It’s advanced functionalities and efficient capabilities are indispensable assets in a data scientist’s toolkit.

FAQs

Q1: What is Pandas AI and how can it help me with my data analysis?

Pandas AI is an enhanced representation of Pandas library, which applies artificial intelligence (AI) to make data analysis easier and quicker. It performs tasks such as data cleaning automatically and offers smarter insights with better visualizations.

Q2: How do the traditional Pandas differ from the AI version?

This software goes beyond the usual pandas by incorporating artificial intelligence into its features. For instance, it automates data cleaning, has advanced visualizations, offers predictive analytics, and allows for querying of data in natural language.

Q3: Can I use this tool alongside other tools in my existing workflow?

Yes, you can use this along with other tools like the Traditional Pandas library itself, NumPy, Matplotlib, or Seaborn without any issues.

Q4: Who can benefit from using Pandas AI? 

Pandas AI is beneficial for:

  • Data Scientists and Analysts
  • Business Analysts
  • Researchers
  • Developers

Q5: What are the advantages of using Pandas AI over traditional Pandas? 

Advantages of Pandas AI over traditional Pandas include:

  • Increased Automation: Reduces the need for manual data preparation.
  • Enhanced Insights: Provides deeper and more accurate analysis using AI.
  • Time Savings: Speeds up the data analysis process significantly.
  • User-Friendly: Easier to use for both novice and experienced users with its NLP features.

ABOUT THE AUTHOR

Aashiya Mittal

A computer science engineer with great ability and understanding of programming languages. Have been in the writing world for more than 4 years and creating valuable content for all tech stacks.

Latest Blog

White Label Solutions

White-label Solutions in Actions: Explore How OnGraph is Transforming Industries

Read more
Top Industries Benefiting from White-Label AI Software

Top Industries Benefiting from White-Label AI Software

Read more
Costs to Develop a Chatbot

How Much Does It Cost to Build an AI Customer Service Chatbot?

Read more