When dealing with tabular data, you often will have that data laid out in a format that best suits day-to-day operations. For example, rows that represent purchase transactions for different customers.
A pivot table allows you to rotate rows into columns, creating a more organized and digestible format for large datasets. You can also aggregate data, such as calculating sums, averages, or counts, and display these results as unique columns for each summarized category.
Let's take a look at a visual that will help describe it. Here, we're grouping by the user value, then taking the sum of the count of each transaction and creating a unique column for each product.
Most of us will approach this by writing SQL with case statements for each category. While this works for simple cases, if you have a more dynamic set of values that you need to include in the case statement, maintaining this can be a nightmare, leading to troubles with data analysis where you have incomplete data.
SELECT
User,
SUM(CASE WHEN Product = 'Apple' THEN Count ELSE 0 END) AS Apple,
SUM(CASE WHEN Product = 'Orange' THEN Count ELSE 0 END) AS Orange,
SUM(CASE WHEN Product = 'Cherry' THEN Count ELSE 0 END) AS Cherry
FROM
your_table_name
GROUP BY
User;
I've tried solving this problem using raw SQL and creating a stored procedure that handles this. However, I could never solve this until I ran across dbt and leveraged some of its more advanced capabilities.
dbt is an incredibly popular tool in the data community today. If you search the internet for terms like Modern Data Stack in almost every image or article, you'll see a reference to dbt. It's everywhere; you must learn it if you're a data practitioner.
dbt stands for Data Build Tool, and according to their website, they describe it as:
dbt™ is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Anyone on the data team can safely contribute to production-grade data pipelines.
By leveraging dbt, you can create dynamic pivots! Let's take a look.
As you become more proficient in DBT, you will probably run across the concepts of Jina and Macros. Utilizing Jinja is a templating language used throughout the Python ecosystem. dbt supports using Jinja to create scripts that, when compiled, output standard SQL code for your database's or warehouse's interpretation. You might recognize it from its syntax utilizing curly braces { }. Jinja allows us to provide logic-based interaction with our code creation to build these dynamic queries.
If we listed off the steps to build that query above dynamically, it might look like this:
Let's see how we do that in dbt.
First, let's get the list of all products - we will wrap this as a Jinja function, which won't end up being in the final SQL code but will be executed when dbt compiles the SQL file.
That's a very simple select statement to get all distinct product names. However, you can put any SQL select logic in that query you need. Note that the function's name is in the first line after the keyword set.
Next, we need to run that query and convert the results into a list we can loop or iterate over later. The first line will execute the query we just defined, and the following code will convert it into a list of values. The else statement is needed for this to run successfully, where you need to generate an empty list if, for some reason, the query returns no results.
We must build our logic to loop over the list and dynamically create our CASE statement. Notice that we have a normal query that starts with the first two lines, and then we see additional Jinja syntax inserted that will create our dynamic SQL. There are two additional tricks in here:
When you compile all of the above code, you get the following perfect SQL syntax
select
user,
count(case when product = 'Apple' then 1 end) as apple,
count(case when product = 'Orange' then 1 end) as orange
from pivot_example
group by user
The output is a perfectly formed pivot of the original data.
As the article suggests, we want to make these dynamic. Let's try this by inserting two new rows with two new products. Notice that the second product has invalid spaces in its name, which cannot be used as a column name. The replace commands will take care of that for us.
INSERT INTO pivot_example (user, product, count) VALUES
('C', 'Cherry', 1),
('C', 'Lemon and Lime', 1);
Next, we need to build the table again in dbt, and the Pivot code will be updated in the data warehouse. Notice we have our two new products automatically added!
As mentioned above, we needed to rebuild the pivot view in dbt for the new values to be recognized. This can be achieved with simple automation in dbt, known as a Job. You can create one that runs nightly or whatever interval makes sense for you, and it will run just the new pivoted view build command. It will be a fast and efficient job ensuring your views always reflect the latest values.
If you want to see more complex example code, please look at this Gist on GitHub. You can also see the fully compiled version of the query in this Gist. Finally, the simple example used in this blog can also be found on GitHub.
In conclusion, mastering pivot tables with dbt simplifies transforming and organizing large datasets and enhances the flexibility and maintainability of your data workflows. By utilizing Jinja and macros in dbt, you can create dynamic, error-free SQL queries that adapt to changes in your data, ensuring your analytics remain accurate and up-to-date. This approach saves time and effort and empowers data practitioners to focus more on deriving insights rather than wrestling with complex SQL scripts.