Code with ChatGPT
Updated: May 26
In a previous blog, How will ChatGPT change data analytics, I discuss the potential disruption generative AI, such as ChatGPT will bring to data analytics. In summary these impacts are:
Generative AI models such as ChatGPT may move data analytics even further away from traditional reporting as users may increasingly rely on simply asking the model to interpret large and complex data and the patterns within.
Data engineers and data scientists will increasingly rely on Generative AI to expedite the code and models required within the data analytic workloads. It may also put these fields in the hands of less formal data engineers, scientist and statisticians.
Hypothesis and sandbox environments will be easier to spin up due to the ease at which synthetic data can be produced.
And in my previous blog, I show how to analyze data with ChatGPT. Here, I look at the second topic, and that is how to code with ChatGPT.
I use two examples: (a) I show how ChatGPT can be used to expedite laborious TSQL by getting it to expedite the development of a common table expression (CTE), and I then (b) use it to teach me some of the basics in Python by leading me through tasks such as joining different datasets, selecting only a few attributes and finally create a similar result to the CTE example in SQL.
Update 26 May 2023 - This article was written to showcase how useful LLM's could be in coding activities. It is important to note the recent announcements re Generative AI built into data tools within the Microsoft Stack which is a game changer for leveraging Generative AI within corporate data ecosystems: https://www.makingmeaning.info/post/what-is-microsoft-fabric-and-its-game-changing-ai-capabilities
Update 16 March 2023 - Microsoft announced Azure Open AI which gives users access to GPT-3 and DALL-2 inside their tenancies. More on thos soon in a dedicated article.
Use ChatGPT to write my SQL code (TSQL).
Using the old trusted AdventureWorks database, I created a view of order transactions with 8,845 records as shown below.
As I showed in my previous article, using ChatGPT to analyse data is pretty straightforward, but those investigations showed me that there is a limit to the number of rows of data I can paste into my question (with a work around being to simply point ChatGPT directly to the dataset).
Of course, in this case, I am not interested in the analysis of the order transactions, but I want to do some data engineering over the data, and I want ChatGPT to supply me with the code, so I still need to give ChatGPT a frame of reference. So, pasting the first 100 records into the question is perfectly fine, as I just want ChatGPT to understand the structure of the data. So, I first ask ChatGPT to familiarize itself with the dataset:
It gave me a result (note I exclude the actual dataset analysis from the image below as that is irrelevant, I am currently not looking for ChatGPT to analyse the data as I am interested in it coding for me. If you want to see how to use ChatGPT to analyze data, feel free to ready my previous article here.
Once familiar with the data, I want ChatGPT to generate a common table expression for me using TSQL to determine how many individual orders was received for each product per order year. It is important to focus on how the question is asked so it is important to understand the dataset. In this case I want to know many times each product was ordered by order year. I know that I am facing a common table expression (CTE), and the number of times each product is bought likely relates to the PurchaseOrderDetailID, and it needs to include the order year.
Here is my question:
Please code for me and generate a common table expression in TSQL that shows the average count of purchase order detail ID for each product name, product id and year of order date.
And here is the response plus the explanation:
And here is the explanation given by ChatGPT:
This query first creates a common table expression named CTE_ProductOrders that joins the Products table with the PurchaseOrderDetails table and calculates the count of PurchaseOrderDetailID for each combination of ProductID, ProductName, and year of OrderDate.
The main query then selects the ProductID, ProductName, OrderYear, and the average of OrderCount from the CTE_ProductOrders CTE, grouping the results by ProductID, ProductName, and OrderYear. The results are ordered by ProductID, ProductName, and OrderYear.
Now of course in this instance, I am n0t pointing ChatGPT to an actual dataset, but to a frame of reference that I showed it in the first question. So, I now have to replace some of the assumed logic in my actual common table expression. In this case it is merely the from clause, that needs changing, and removing table aliases. My final query is shown below, and as you can see it is over 90% similar to the code suggested by ChatGPT.
Here are the results:
A second step is to thoroughly test the results, and in this case, ChatGPT got it spot on. Here are the test results:
Now secondly, let's look at a Python example.
Use ChatGPT to write my Python code.
Let's assume very little experience with Python and asking ChatGPT how to join three datasets. In this scenario I want to create a data frame equivalent to the dbo.vw_Orders that was used in the preceding SQL example, which is a join between Product, PurchaseOrderDetail and PurchaseOrderHeader datasets.
Here is my question:
Be my data scientist. I have three data frames in Python. Product, PurchaseOrderDetail and PurchaseOrderHeader. How do I join them if Product and PurchaseOrderDetail joins on ProductID and PurchaseOrderHeader and PurchaseOrderDetail joins on PurchaseOrderID?
And here is the response:
Of course, I now have to replace some of the items with my actual data frame names, but I largely kept the code recommended by ChatGPT intact as is shown here:
This returned the expected 8,845 rows and 47 columns.
I next asked ChatGPT how to select only the columns I need.
Be my data scientist. How do I select only certain columns from a data frame using Python?
And here is the response:
Again (and of course), I had to replace some of the items with my actual data frame and column names, but I largely kept the code recommended by ChatGPT intact as is shown here:
Results, spot on!!
Now lastly, remember this scenario assumes very little experience with Python, so I wondered if the was a common table expression equivalent in Python.
Is there a python equivalent to a common table expression? And here is the response:
And here is my very simplified version of that, with my Python CTE results:
This looks okay, but I need to test it. In the previous section, Use ChatGPT to write my SQL code, I used the same data as I did here to create the previous CTE in TSQL. These results should therefore be the same. But the results shown above, is clearly just a small subset. So again, asking ChatGPT:
Be my data scientist. Show me how to output a data frame to a file.
Here is the response:
Applying it to my scenario:
And happy to report that the results of the Python based CTE, matches the preceding TSQL version 100%.
That is pretty good!!
I am impressed. Even though I have been using SQL for years, I know how some new data engineers struggle with more complex functions such as CTE. So, to be guided by ChatGPT and how quickly it was to get to an accurate result, was impressive.
BUT even more so, I am a complete novice with Python. And to have been able to rely on ChatGPT and to arrive at a similar result set to the one generated in SQL, which was tested, shows how simple it really is.
It must be emphasised that ChatGPT should be used as an accelerator to code, which must then still be debugged, and results must always be tested.
Some responses to a previous article, Analyze data using ChatGPT, questioned if ChatGPT will ever be able to replace skilled data engineering or data science. I do not think it will. When logarithmic and trigonometric tables was replaced with scientific calculators at school, it sped up trigonometry, and students could focus on higher-order thinking skills once they mastered the basics, leaving the calculator to handle the mundane tasks. I feel the same applies here, truly understanding the code behind a CTE is not really a value add, applying it correctly is. Leaving the data engineer and data scientist to higher-order thinking.