Power BI Exact Distinct Count and Aggregations

It’s not news that Power BI supports aggregations. What might be news is that these aggregations support distinct count measures. Today.

The aggregation dialog has “Summarization” method. If you select “GroupBy” as the summarization method for a column, not only are all unique values of the column retained, but Power BI is clever enough to use the those column values for a distinct count measure on that column.

Consider a clickstream analysis model with a fact table containing the following columns:

  • DateId
  • EventId
  • MarketId
  • RegionId
  • UserId

There are four dimensions, Event, Date, Region, and Market and the goal is to get the distinct count of users. Assume that the table is very large – so we’re going to go with a directquery approach. The fact and dimension tables are modeled as below:

where our distinct count measure is simply:

Creating a simple grid visual, I can inspect the distinct count of users by event:

I can inspect the visualization’s query to my sql datasource from Power BI’s Performance Analyzer:

As expected, the distinct count is querying the fact table with a join to the dimension table (an inner join because the “Assume Referential Integrity” setting was set on the relationship between the two tables in my model).

But say I’m unhappy with the visualization’s performance. If that’s the case I can create an aggregation. To do so, I first create an aggregation table in my data source:

Note with a group by on the UserId column – this retains each UserId in the aggregate table that can then be used to calculate the distinct count.

A word of caution as there is a nuance here that might prevent you from seeing the performance gains you might expect. Compare the number of rows in the aggregation table to the fact table. That group by on UserID might be generating a very large table and if the row count of this new agg table and the original fact table are similar, you won’t see much in the way of a performance gain.

I added this new table to the Power BI model and created the relationship between it and the dim_Event dimension.

This allowed me to define this table as an aggregation (right click on the agg_UserEvent table and select “Manage Aggregations”:

Note the third row in the dialog where the UserId is stored as a GroupBy. This is the magic that enables Power BI to use the aggregation when the DistinctCount measure is evaluated (assuming that the dimensionality of the query matches the aggregation table).

(There was one thing that confused me at first. Why isn’t EventId mapped in this dialog? It’s because that relationship is defined in the model and doesn’t need to be redefined here. You can if you want to, but it’s redundant. Confusing, but makes sense.)

After refreshing the same visualization shown earlier, the query now hits the agg table instead of the fact table (again, the sql was taken from the performance analyzer):

Great!

Should you do this to improve the performance of your distinct count measures that use directquery? Well, it depends. As mentioned above, the size of the aggregation table might be pretty large depending on the cardinality of your distinct count column. That and the fact that Power BI still has to use the distinct keyword in the query when it hits the agg table (go back and take a look – it’s there!) might limit the utility of the aggregation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: