Analyzing data using ChatGPT
Updated: Jul 9
Or "Analysing data using ChatGPT" if you find yourself in the commonwealth :) In my previous blog, How will ChatGPT change data analytics, I discuss the potential disruption generative AI, such as ChatGPT will bring to data analytics. In summary these impacts are:
Generative AI models may move data analytics even further away from traditional reporting as users may increasingly rely on simply asking the model to interpret large and complex data and the patterns within.
Data engineers and data scientists will increasingly rely on Generative AI to expedite the code and models required within the data analytic workloads. It may also put these fields in the hands of less formal data engineers, scientist and statisticians.
Hypothesis and sandbox environments will be easier to spin up due to the ease at which synthetic data can be produced.
In this short article I show how ChatGPT (or Bard) can be used to access, analyse and interpret data.
Important update 9 July 2023 - now ChatGPT includes Code Interpreter, a side kick for all data analysts with huge potential. Please see the follow up article: https://www.makingmeaning.info/post/chatgpt-code-interpreter-this-changes-everything
Important update 21 June 2023 - the concepts and example in this article is still very valid, although OpenAI blocked the ability to analyse data when parsing a URL as is described here. But now, Microsoft allows for the easy creation of a data analysis solution securely within Azure, that achieves the same. Please see this follow up article: https://www.makingmeaning.info/post/chatgpt-on-your-own-data-azure-openai-on-your-data Some additional updates about this topic can be found at the end of this article.
Let's analyze data with chatGPT.
I will first start with some simple examples and then ask CharGPT to do something a little bit more complicated, then discuss what this means for corporate and organisational data, and then conclude with some key observations.
Simple example using global video game sales dataset
For this example, I used a video games sales dataset from Kaggle - Global Video Game Sales | Kaggle.
I initially tried to paste the whole dataset into ChatGPT for analysis but ran into a limitation. I tried 16,000, 1,000, and 200 records unsuccessfully and eventually got going with 100 records. Clearly not a great representative sample of the original 16,000.
So, I instead simply added the URL to the dataset containing all the records, and, voila!
Here is a snippet of the 16.6K dataset
Now let's put ChatGPT through its paces with some simple questions.
Question: Can you analyse this dataset https://www.kaggle.com/datasets/thedevastator/global-video-game-sales?select=vgsales.csv
ChatGPT Answer: Yes, I can help you analyze the dataset you mentioned. The dataset you referred to contains data on global video game sales from 1980 to 2016. It includes information on the game's name, platform, year of release, genre, publisher, and sales in various regions of the world.
Here are some initial observations and insights from the dataset:
The dataset contains 16,598 entries (video games) and 11 columns (variables).
The platform variable contains information about the console or platform on which the game was released. There are 31 unique platforms in the dataset, with the most common ones being the PlayStation 2, Xbox 360, and PlayStation 3.
The genre variable contains information about the game's genre, with the most common ones being Action, Sports, and Misc (Miscellaneous).
The publisher variable contains information about the company that published the game. The most common publishers in the dataset are Electronic Arts, Activision, and Namco Bandai Games.
The dataset includes information on video game sales in various regions of the world, including North America, Europe, Japan, and other regions. The total sales column represents the total number of units sold worldwide.
The dataset includes video games released from 1980 to 2016, with the majority of games released after the year 2000.
The dataset has missing values in the year, genre, and publisher variables.
Here are some potential questions that could be explored with this dataset:
What are the top-selling video games of all time, and how much revenue did they generate?
How have video game sales changed over time, and what factors may have contributed to these changes?
Which platforms and genres are most popular among gamers, and how do these preferences vary by region?
Which publishers have the most successful video games, and are there any trends or patterns in their success?
How well do video games sell in different regions of the world, and are there any notable differences in sales patterns across regions?
Overall, this dataset provides a valuable resource for analyzing trends and patterns in global video game sales, as well as understanding the preferences and behaviors of gamers worldwide.
Given that ChatGPT made some suggestions of questions I could ask of the data, here goes -
Question: Which publishers have the most successful video games, and are there any trends or patterns in their success?
ChatGPT Answer: To explore which publishers have the most successful video games and any trends or patterns in their success, we can analyze the dataset by publisher and look at the number of games published and total sales for each publisher. Here are some observations based on the dataset:
Electronic Arts (EA) is the publisher with the most games in the dataset, with 1351 games, followed by Activision with 975 games and Namco Bandai Games with 932 games.
EA is also the publisher with the highest total sales in the dataset, with 1.12 billion units sold, followed by Nintendo with 932 million units and Activision with 721 million units.
There are some publishers that have consistently performed well over time, such as EA and Nintendo, while others have seen more variability in their success. For example, SquareSoft (now Square Enix) had a high number of sales in the 1990s but has since declined in popularity.
Some publishers have had success with certain franchises or series. For example, Nintendo has had tremendous success with its Mario and Pokemon franchises, while Activision has had success with its Call of Duty franchise.
There is some variation in the types of games that different publishers produce, which may contribute to their success. For example, EA has had success with sports games such as FIFA and Madden, while Ubisoft has had success with action-adventure games such as Assassin's Creed and Far Cry.
Overall, these observations suggest that there are differences in the success of different video game publishers, with some publishers consistently performing well over time and others experiencing more variability in their success. The type of games produced and the success of certain franchises or series may contribute to a publisher's success.
Let's try visualising the data.
Question: Can you create a chart of Global_Sales by Genre?
Here a small issue occurred whereby ChatGPT stated that a chart was created, but there was an error.
I have not yet been able to figure out the issue, it might be that imgur.com have blocked ChatGPT, nonetheless...still, some explanatory analysis was provided:
ChatGPT Answer: As we can see from the chart, Action games have the highest global sales, followed by Sports and Shooter games. Role-Playing games and Platform games also have relatively high global sales, while Strategy, Adventure, and Puzzle games have lower global sales. It is important to note that this chart only shows the global sales for each genre and does not take into account any regional differences in sales patterns.
Asking the same question as a table game a perfect result set:
Something a bit more a little bit more complicated
For this example, I used an Olympic Games dataset from Kaggle - Olympics 124 years Dataset(till 2020) | Kaggle
Firstly, a simple question to find out more about the dataset and its metadata.
Question: Tell me about the dataset can you analyse dataset https://www.kaggle.com/datasets/nitishsharma01/olympics-124-years-datasettill-2020
ChatGPT Answer: Sure! The dataset is called "Olympics 124 years dataset till 2020" and it contains information about all Olympic Games from 1896 to 2020. The dataset has 17 columns and 31169 rows, with each row representing a medal-winning athlete in a specific Olympic event.
The columns in the dataset include:
Athlete: The name of the athlete who won the medal
Age: The age of the athlete at the time of the Olympic event
Country: The country of the athlete who won the medal
Year: The year in which the Olympic event took place
Closing Ceremony Date: The date of the closing ceremony for the Olympic event
Sport: The sport in which the athlete won the medal
Gold Medals: The number of gold medals won by the athlete
Silver Medals: The number of silver medals won by the athlete
Bronze Medals: The number of bronze medals won by the athlete
Total Medals: The total number of medals won by the athlete
Summer/Winter: Whether the Olympic event was held in the summer or winter
Gender: The gender of the athlete who won the medal
Event: The specific event in which the athlete won the medal
Medal: The type of medal won by the athlete (gold, silver, or bronze)
Venue: The name of the venue where the Olympic event took place
City: The city where the Olympic event took place
Region: The region where the Olympic event took place
With this dataset, it is possible to analyze various trends and patterns in Olympic history, including the performance of individual countries, athletes, and sports over time.
Now let's do something more complicated:
Question: Can you compare the performance of male and female athletes, looking at the number of medals won by each gender, as well as any trends or differences in performance over time?
ChatGPT Answer: Yes, I can analyze the performance of male and female athletes in the Olympics using the dataset you provided. Here are some insights and trends related to gender differences in Olympic medal-winning:
Total medals won: According to the dataset, male athletes have won significantly more medals than female athletes. From 1896 to 2020, male athletes won a total of 24,367 medals, while female athletes won 9,606 medals. This means that male athletes have won almost two and a half times more medals than female athletes.
Distribution of medals: When looking at the distribution of medals won by gender, we can see that male athletes tend to win more gold medals, while female athletes tend to win more silver and bronze medals. This trend holds true across most sports and events.
Trends over time: When looking at the trends in Olympic medal-winning over time, we can see that the gap between male and female athletes has been narrowing. In the early years of the Olympics, very few events were open to women, which meant that they had fewer opportunities to win medals. However, as more events were added and more countries started sending female athletes, the number of medals won by women has increased steadily over time.
Recent trends: In recent years, female athletes have been closing the gap with male athletes in terms of overall medal-winning. In the 2016 Rio Olympics, for example, female athletes won almost 45% of the total medals, which is the highest percentage in Olympic history. This trend is expected to continue in the future, as more events become open to women and more countries invest in women's sports.
Overall, while male athletes have historically won more medals than female athletes, the gap between the two genders has been narrowing over time, and female athletes are becoming increasingly competitive in the Olympics.
What does this mean for corporate and organisational data and how would one go about operationalising ChatGPT in an organisational data analytics environment?
ChatGPT is clearly not going to reach behind the fire wall and access data that should not be accessed. And it is not clear what Open AI (and Microsoft) or Google with their version of ChatGPT, Bard AI, has install for these generative AI engines, but one would assume that integration into a corporate environment is at least being considered. Until such time, ChatGPT can be used right now to expedite the code that delivers the analysis above.
Let's use the final question "Can you compare the performance of male and female athletes, looking at the number of medals won by each gender, as well as any trends or differences in performance over time?" as an example.
Upon considering the useful analysis of the analysis, I simply requested to see the code used, and ChatGPT obliged :)
Operational steps could therefore be:
1) Create an obfuscated version of the dataset you wish to analyse.
2) Let ChatGPT loose on it to explore and test hypotheses.
3) Once you find an analysis to be operationalised.
4) Obtain the code and use that within more formal data engineering workloads.
ChatGPT is impressive in terms of how easy and accurately it analysed the datasets I used. But it is still unclear as to how it would deal with complex datasets containing multiple entities in a relationship with one another. So, for now, a final flat data structure would have to be engineered upstream.
Plus operationalising the logic will require a multi-step process, although, as stated, bringing these generative AI models into the corporate environment seems like a logical future step.
It must be emphasised that I have not yet delved into the security aspects of generative AI, including ChatGPT.
I will next delve into some additional functionality and lifting the lid further on ChatGPT and generative AI.
In the next article, I discuss how to code with ChatGPT https://www.makingmeaning.info/post/code-with-chatgpt
Additional updates regarding Analyzing your data using ChatGPT
Update 26 May 2023 - This article was written to showcase how useful LLM's could be in the analysis of data, in the examples below, publicly available datasets. It is important to note the recent announcements re Generative AI built into data tools within the Microsoft Stack which is a game changer for leveraging Generative AI within corporate data ecosystems: https://www.makingmeaning.info/post/what-is-microsoft-fabric-and-its-game-changing-ai-capabilities
Update 17 May 2023 - I retested the recent limitation by ChatGPT whereby it could no longer access datasets via public URL, such as https://www.kaggle.com/datasets/thedevastator/global-video-game-sales?select=vgsales.csv, and it still appears to be a limitation. However, I retested all the concepts below using Bard, which quite happily accessed the dataset via public URL and results were as impressive as the results below when ChatGPT did not have this limitation.
Update 16 March 2023 - Microsoft announced Azure Open AI which gives users access to GPT-3 and DALL-2 inside their tenancies. More on this soon in a dedicated article. Also see the subsequent article on analysing large volumes of data in Excel via M365 Copilot in Excel.
Update 16 March 2023 - it seems as if some functionality mentioned here, notably the ability to access data via public URL have been disabled.