Join Barbara Galiza and Timo Dechau on Thursday, May 21 for a live virtual workshop on signal engineering and the core decisions shaping modern performance marketing.

Register
Article

Scaling Statistics: Incremental Standard Deviation in SQL with dbt

January 1, 2025

Yuval Gorchover
Software Engineer

*Published on Medium by Yuval Gorchover on 01/01/2025*

Introduction

SQL aggregation functions can be computationally expensive when applied to large datasets. As datasets grow, recalculating metrics over the entire dataset repeatedly becomes inefficient. To address this challenge, incremental aggregation is often employed — a method that involves maintaining a previous state and updating it with new incoming data. While this approach is straightforward for aggregations like COUNT or SUM, the question arises: how can it be applied to more complex metrics like standard deviation?

Standard deviation is a statistical metric that measures the extent of variation or dispersion in a variable’s values relative to its mean.
It is derived by taking the square root of the variance.
The formula for calculating the variance of a sample is as follows:

Sample variance formula

Calculating standard deviation can be complex, as it involves updating both the mean and the sum of squared differences across all data points. However, with algebraic manipulation, we can derive a formula for incremental computation — enabling updates using an existing dataset and incorporating new data seamlessly. This approach avoids recalculating from scratch whenever new data is added, making the process much more efficient (A detailed derivation is available on my GitHub).

Derived sample variance formula

The formula was basically broken into 3 parts:
1. The existing’s set weighted variance
2. The new set’s weighted variance
3. The mean difference variance, accounting for between-group variance.

This method enables incremental variance computation by retaining the COUNT (k), AVG (µk), and VAR (Sk) of the existing set, and combining them with the COUNT (n), AVG (µn), and VAR (Sn) of the new set. As a result, the updated standard deviation can be calculated efficiently without rescanning the entire dataset.

Now that we’ve wrapped our heads around the math behind incremental standard deviation (or at least caught the gist of it), let’s dive into the dbt SQL implementation. In the following example, we’ll walk through how to set up an incremental model to calculate and update these statistics for a user’s transaction data.

Consider a transactions table named stg__transactions, which tracks user transactions (events). Our goal is to create a time-static table, int__user_tx_state, that aggregates the ‘state’ of user transactions. The column details for both tables are provided in the picture below.

To make the process efficient, we aim to update the state table incrementally by combining the new incoming transactions data with the existing aggregated data (i.e. the current user state). This approach allows us to calculate the updated user state without scanning through all historical data.

Now Lets Dive Into Code

The code below assumes understanding of some dbt concepts, if you’re unfamiliar with it, you may still be able to understand the code, although I strongly encourage going through dbt’s incremental guide or read this awesome post.

We’ll construct a full dbt SQL step by step, aiming to calculate incremental aggregations efficiently without repeatedly scanning the entire table. The process begins by defining the model as incremental in dbt and using unique_key to update existing rows rather than inserting new ones.


Next, we fetch records from the stg__transactions table.The is_incremental block filters transactions with timestamps later than the latest user update, effectively including "only new transactions".


After retrieving the new transaction records, we aggregate them by user, allowing us to incrementally update each user’s state in the following CTEs.


Now we get to the heavy part where we need to actually calculate the aggregations. When we’re not in incremental mode (i.e. we don’t have any “state” rows yet) we simply select the new aggregations.


But when we’re in incremental mode, we need to join past data and combine it with the new data we created in the INCREMENTAL_USER_TX_DATA CTE based on the formula described above.We start by calculating the new SUM, COUNT and AVG:


We then calculate the variance formula’s three parts

1. The existing weighted variance, which is truncated to 0 if the previous set is composed of one or less items:


2. The incremental weighted variance in the same way:


3. The mean difference variance, as outlined earlier, along with SQL join terms to include past data.


Finally, we select the table’s columns, accounting for both incremental and non-incremental cases:


By combining all these steps, we arrive at the final SQL model:

Conclusion

Throughout this process, we demonstrated how to handle both non-incremental and incremental modes effectively, leveraging mathematical techniques to update metrics like variance and standard deviation efficiently. By combining historical and new data seamlessly, we achieved an optimized, scalable approach for real-time data aggregation.

In this article, we explored the mathematical technique for incrementally calculating standard deviation and how to implement it using dbt’s incremental models. This approach proves to be highly efficient, enabling the processing of large datasets without the need to re-scan the entire dataset. In practice, this leads to faster, more scalable systems that can handle real-time updates efficiently. If you’d like to discuss this further or share your thoughts, feel free to reach out — I’d love to hear your thoughts!

Yuval Gorchover

Subscribe