Understanding how parallelism works when your tabular model gets processed is confusing. How can parallelism be increased? Where are the potential bottlenecks?
Before we get into how parallelism works, there are several different types of jobs that happen and each is handled a bit differently. Let’s look at these different jobs that run when a table is processed:
- Data reader
- Calculated columns, hierarchies, and relationships are processed (Process Recalc)
The “Maximum Number of Connections” property on the datasource object defines the maximum parallelism for all jobs – not just the data reader jobs, but all jobs – for the datasource. For example, setting this value to 1 on the datasource forces all jobs in the pipeline to run sequentially (ie, each job in the above diagram is executed sequentially for a table and then the jobs are run sequentially for the next table). You can this property on the datasource by right-clicking on the datasource in SSMS and clicking properties:
The Data Reader job is straightforward – one is spun up for each partition being processed (constrained by maximum parallelism and available threads) and reads the data from your table in chunks and passes each chunk down the pipeline to the Encoding and Compression jobs. After a chunk is passed off to the jobs next in the pipeline, the data reader reads the next chunk while the previous one is being encoded and compressed.
The encoding process either hash encodes or value encodes each value. Hash encoding involves creating an entry in a dictionary table and storing the index of that entry into the segment. Value encoding stores the value itself or some mathematical operation on it. Sometimes the encoder decides that whatever algorithm it’s using no longer works and replaces it – and if this happens, all column values are re-encoded and all segments and all partitions are updated (experienced modelers can supply an encoding hint to avoid this situation).
After every column for every row is encoded and enough rows have been read to fill a segment (a segment is basic unit of the AS column storage engine with a default size of 8M rows – you can monkey with this by setting the DefaultSegmentRowSize server setting), the values are compressed ready to be saved.
There is one encoder and compression job per table regardless of how many partitions that table may have.
When all the processing is done and the dependencies are ready, the process recalc step recalculates and stores calc columns per segment and builds indexes for any affected relationships and hierarchies. This step normally begins after all the table’s partitions are read, but there can be dependencies on other tables that can prevent this from beginning until those dependencies are satisfied. This step does not require access to the data source but uses the data already read.
That’s the basics – now let’s look at where the typical bottlenecks lie.
Partition your data. If you have a medium to large source table (say more than 10M rows) consider partitioning your data to take advantage of parallelism. (The other reason you might want to partition your data is for incremental processing – this is outside the scope of this blog.) Partitions are a basic unit of parallelism for data reader jobs and simply creating several can improve processing performance times.
Encoder Bottleneck. You may see a bottleneck at the encoding/compression stage if there are enough readers and enough partitions such that more work gets queued up than it can accomplish. If this happens, you’ll need to optimize the encoding stage.
Remove unused columns. There is a single encoder job per table – so this can become a bottleneck with a large number of columns because the more columns you have, the more encoding is done. So remove any column that is not used – they are not free.
Numeric values process faster than strings String values are always hash encoded and this takes longer than value encoding. This has a couple of ramifications. First, use integer data types when you have a choice, eg, using 1 and 0 for binary values is preferable to “T” and “F”. The former can be value encoded whereas the latter cannot. Second, when you have a large fact table with string values, consider storing those string values in a dimension table and replacing the values in the fact table with an integer key.
Replace calc columns with expressions in the source db. When the data in any partition is refreshed from the source, all calc columns for all partitions have to be reevaluated. This might be expensive – so if you have a calc column expression that can be pushed upstream to the data source, you might see your processing times decrease.
Turn off MDX Attribute Hierarchies. Both Power BI and Analysis Services create attribute hierarchies for each column for client tools that use MDX instead of DAX for querying tabular models (this includes Excel and Tableau). Processing these attribute hierarchies is done during the ProcessRecalc job and can be expensive. If you don’t use client tools that require attribute hierarchies you can turn off this extra work by setting the property isAvailableInMdx = false on the column. And even it you do use such client tools, these attribute hierarchies are rarely if ever used for measures in a fact table – in this case, set isAvailableInMdx = true for dimension columns and set to false for fact table columns.
To diagnose if creating the attribute hierarchies are a problem – create a trace and launch a ProcessFull on the table. Look for EventSubclass “59 – Tabular object processing” signifying a hierarchy processing and look for the hierarchies with the same name as your columns:
The isAvailableInMDx property is not exposed in SSDT nor in Power BI. Either script out the TMSL script and set it in SQL Server Management Studio:
or update with Tabular Editor which does expose the setting.
Avoid Multiple ProcessRecalcs. Each time a partition is processed, any affected calculated columns, relationships and hiearchies are reprocessed as well. So if you process multiple partitions, do it inside a transaction.