An optimized QVD loads at a blazing speed. You can load 50 to 100 million rows in matter of minutes. So, it always makes sense to create a common data layer with a QVD that contains common transformations that many Qlikview applications can consume.
For starters, optimized QVD is the one that contains no transformations other than column aliasing or few other exceptions Qlikview engine allows such as an exists function with a single field.
Having said that, there are some instances in which you have to think creatively to make sure that you can load millions of rows in a reasonable time frame.
In a recent project, one of my teammates needed to load an optimized QVD of a reference table with over 22 million rows. Unlike a type 2 slowly changing dimension — in which a new row gets added when an attribute value changes — every row gets repopulated monthly with a date key.
He faced a real conundrum when Qlikview natural join created multiple rows in the fact table due to the fact that same key field existed for each month partition in the reference table.
The challenge was to find the latest date key from the reference table and then left join it with the fact to avoid duplicates.
Since I can’t share actual data model, in this Qlikview tutorial, I have created a sample data set to show you how we managed not only to handle duplicate values but also to improve time to perform data loading. For the sake of this tutorial, I am not using QVD as a source file either.
Let’s create a simple data set containing Employee and Sales tables.
For each month, same employee record gets populated with the respective title.
QVD load on a monthly basis should pick the highest date(for that monty) and use that as a filter to join the reference table with the Sales fact table.
So, in April, the date filter should use 3/1/2013 to join Employee table with the Sales table.
Now, let’s load the reference table and find the latest date key.
As you can see, latest date key value from a resident load will be used as a filter to join the reference table with the fact to avoid duplicate rows.
Albeit this works, using a max function on over 22 million rows still takes several minutes. There were multiple large reference tables in this data model.
Soon, it was apparent to find a better way to find the latest key.
We decided to use a simple order by instead of using the max function. And it did the trick. Assuming that you have an optimized QVD with about 22 million rows, it takes over 2 minutes to find latest date key using max function. A simple order by took only 15 seconds.
Now, let’s left join the reference table with the fact.
On the surface, difference between 2 mins vs 15 seconds seems meager.
Nonetheless, if you have 10 or 15 reference tables with similar requirement then the performance difference becomes obvious.
Readers: Have you dealt with similar situation? If so, I’d love to hear your thoughts as it might benefit someone else trying to figure out same conundrum.