Distinct count is a common use case for analytics – show me someone with a web site or service and I’ll show you someone that wants to know how many unique customers they have.

The problem with distinct count is that it’s expensive with respect to both CPU and memory. When you have datasets in the billions of rows, the cost just gets too expensive to compute an exact distinct count and visualizations with these expressions just don’t perform the way you you want them to.

I have four billion rows in my fact table in which I have over a billion distinct users. I started with a simple card visualization showing this value – took over 40 minutes to return. No one is going to wait this long:

So enter approximate distinct count. Companies like Google, Facebook, and Twitter are using algorithms for getting a (remarkably accurate!) estimate of distinct count. If the companies that practically invented big data are using approximations for distinct count on big data, the same approach should good enough for many of our use cases as well.

Neither Power BI nor Analysis Services *really *support approximate distinct count so this will require a little bit of assembly on your part. But the results should be worth your while.

Aside: some of you might point out that DAX does have the function ApproximateDistinctCount that works with directquery sources. True! But I’m not a fan of this function just yet – it only works with directquery so doesn’t work with import models nor import aggregations on directquery sources.

A few words on the approach before we get into the implementation. We will be using two algorithms to estimate distinct count. The first, Linear Counting, is ideal for estimating smaller distinct count values but doesn’t scale to larger datasets. The second approach, HyperLogLog, is ideal for larger datasets but is really bad for smaller ones. But together, they’re magic.

Let’s dive in.

**Linear Counting** – Consider the situation where you start with a list of values that you want to count the distinct values and a Hash Table with m empty rows:

**Hash functions and tables** are not commonly used by most Power BI users but the concept is simple enough. A hash function converts a string or value to a binary value of a fixed length. We need to do this for a couple reasons. First, we can then deal with arbitrary data types the same way – we don’t need special handling of integers, strings, numerics, etc. Second, hash functions have the special characteristic that its results are randomly assigned over the destination range so when the hash values are assigned to a row in the hash table, they are uniformly distributed across the rows in that table.

Use a hash function to map each value to a row in the hash table and set the value to 1 for that row:

Do this for every value:

And when done, the expression for the approximate distinct count of values is:

**DC _{estimated}= -m ln( m / U )**

where m is the number of rows in our hash table and U is number of blank rows.

Intuitively, the accuracy of this approach can be relied on until there are almost no empty values left in our hash table. At that point either we need to increase the size of our hash table or we need another way to estimate large values. Enter HyperLogLog…

**HyperLogLog**

Before I begin, please check out the Philip Seamark’s blog post here. It was this writeup that got me to thinking that this approach could be applied to aggregations. The paper behind Philip’s blog is here.

The concept behind HyperLogLog (HLL) is simple. If a hash value is computed for each value of a set and every hash value has the same number of bits, then the likelihood of finding a sequence of consecutive zeros from a predefined point (the beginning, the end or anywhere in the middle) is 1 of 2^n:

This is intuitive – the more hashed values you look at, the higher the likelihood you’ll find one with a longer sequence of zeros.

But this has enormous potential errors! Just one value that hashes to a long string of zeros would couple completely throw the result off. So something else is added to the algorithm – the sample is subdivided into many buckets and then we find the maximum number of zeros for each bucket. (Hmm, you’re saying to yourself, buckets? Both the LC and HLL approach use buckets. Perhaps there is a way to unify the two approaches. But we’re getting ahead of ourselves).

Getting buckets is easy. We simply draw a line through our hash value, count the number of zeros to the right, and compute the bucket from the set of bits to the left (the graphic below is shamelessly stolen from Philip’s blog.)

Once we have computed the bucket and maximum number of zeros for each fact table row, the HLL estimate for distinct values is:

**HyperLogLog and Linear Counting Together**

The source paper states that when the estimated distinct count is less than 5/2 time the number of buckets the LC approach should be used otherwise HLL. After my own testing, I found a greater accuracy using LC when the ratio of empty rows to the available rows in our hash table was greater than 5% otherwise use HLL. In other words:

** Application to Power BI**

So how can I apply this approach to Power BI? It’s straightforward. The first task is to add the bucket (or row number) and count of zeros from the hash of the UserId. The sql to do this is (taken from P. Seamark’s blog):

SELECT CONVERT(int,CONVERT(binary(2),'0x' + SUBSTRING(md5,15,4),1)) % POWER(2,14) AS HLL_HashBucket , CHARINDEX('1', CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,19,1),1)) / 8) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,19,1),1)) / 4) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,19,1),1)) / 2) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,19,1),1)) / 1) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,20,1),1)) / 8) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,20,1),1)) / 4) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,20,1),1)) / 2) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,20,1),1)) / 1) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,21,1),1)) / 8) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,21,1),1)) / 4) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,21,1),1)) / 2) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,21,1),1)) / 1) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,22,1),1)) / 8) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,22,1),1)) / 4) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,22,1),1)) / 2) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,22,1),1)) / 1) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,23,1),1)) / 8) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,23,1),1)) / 4) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,23,1),1)) / 2) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,23,1),1)) / 1) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,24,1),1)) / 8) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,24,1),1)) / 4) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,24,1),1)) / 2) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,24,1),1)) / 1) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,25,1),1)) / 8) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,25,1),1)) / 4) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,25,1),1)) / 2) % 2 AS CHAR(1)) + CAST((CONVERT(int,CONVERT(binary(1),'0x0' + SUBSTRING(md5,25,1),1)) / 1) % 2 AS CHAR(1)) + '') -1 AS HLL_FirstZero FROM( SELECT * , CONVERT(VARCHAR(50),HASHBYTES('MD5',cast(UserID as VarChar)),1) as MD5 FROM [fact] ) AS A

Add the following DAX expression to the Power BI model:

That’s it! Check out the approximate results compared to exact results over a wide range of distinct count results:

Isn’t that amazing? Performance is pretty good to – that forty minute query is now down to 60 seconds! A 40x improvement:

And finally – how does this work with aggregations? If you go back to the expression used for the distinct count, the HyperLogLog approach requires the maximum value of the number of zeros by bucket and the Linear Counting approach simply requires the number of empty buckets. In other words, expressions that work with aggregations!

So step 1, create my aggregation table in the data source as follows:

Then add the aggregation table to the Power BI model and define the aggregation:

Note that I have to tell Power BI how the columns aggregate so it can use the values properly. (The C aggregation column can be ignored. I generally add a rowcount column whenever I design an aggregation because I always seem to need it – but it’s not necessary here.)

Now refresh the visualization:

Yes – that’s from 40 minutes to less than a second!

So a couple takeaways from this:

- Try and convince your users that approximate distinct count is good enough for big data. It’s very accurate with less than a 2% error. If your users object, consider two pages for your report where one has the speedy approximate values and the other has the exact values for the occasional validation.
- Use aggregations with approximate distinct count. Power BI aggregations don’t get a lot of love just yet, but as your Power BI models get bigger, you’ll find them very, very useful.