Parquet, ADLS Gen2, ETL, and Incremental Refresh in one Power BI Dataset
A year ago, I was developing a solution for collecting and analyzing usage data of a Power BI premium capacity. There were not only some simple log files, but also data that I had to convert into a slowly changing dimension type 2. Therefore, I decided for the following architecture — Azure Data Factory pipelines collect data on daily basis, the raw data is stored in a data lake forever, and the cleansed data is then moved to a SQL Server database. Because the data is stored on a SQL Server, I can use incremental refresh in Power BI service. It works perfectly.
But the times are changing, new requirements are coming, and I have found a new straight way, how to load logs directly from the data lake into a Power BI dataset. And all that incrementally! Where I need a SCD, a SQL Server database will stay in the middle, whereas for all other data I can use a new mindset.
In this article I want to show you how you can load parquet files stored in an Azure data lake direct into your Power BI dataset. It involves an incremental refresh and an ETL process, too!
Checking all the ingredients
Before we start cooking, let’s check all the ingredients in our recipe.
We need an Azure blob storage or even better an Azure data lake storage Gen2 (ADLS Gen2) which adds a hierarchical namespace. The performance benefit of the latter option is described in a great blog post written by Chris Webb.
Do not forget to own sufficient rights for the data you want to access. You need at least the RBAC role Storage Blob Data Reader or Read+Execute in ACL. For more information, please go to Access control model for Azure Data Lake Storage Gen2 | Microsoft Docs.
When we are already in the Azure Portal, let’s find the URL of the storage account. Go to the Group Settings in the left pane and navigate to Properties. You see a lot of URLs. Copy the link of the data lake storage.
The next ingredient is mastering the incremental refresh together with ETL. If you have not read my previous deep-dive blog post Power BI Incremental Refresh with ETL yet, I highly recommend that you do so now.
And the last important ingredient is the brand-new parquet connector. Well, I speak all the time about parquet files, but we can also do the same with CSV, JSON, or XML files.
Let’s start cooking!
We create an empty dataset in Power BI Desktop and open Power Query Editor. Then, we want to get new data from our ADLS Gen2.
The connector expects a URL of a container in our storage, but we can also navigate to a folder and from there load files recursively. Let’s navigate to the folder refreshables/cleansed — it is the folder that contains the cleaned files we want to load into our empty dataset.
After a click on OK and then on Transform data, Power Query Editor shows a preview of files from our data lake.
We can see that there are two columns which can be used to filter by date: Name and Date modified. It is up to you and up to your use case, which column you will use as a filter for the incremental refresh.
It would be also great to see the content of the parquet files. As you can see in the screenshot above, there is a column called Content of the data type binary. We need to interpret the data as a parquet file.
Unfortunately, there is not such a column transformation, which converts some binaries directly to parquet. Do not panic, we can still use one of these options and modify the M code in formula bar. There is a function called Parquet.Document, which converts a parquet binary into a table.
After this short extension, let’s go back to incremental refresh. As we have already seen, we have two columns as good candidates for the incremental refresh filter. We choose the more complicated way now, that means, we take the text column Name as filter column.
At first, our query adlsRefreshables_E (E is a short for extract) loads metadata from the data lake. Then, it creates two variables FirstFileName and LastFileName, which identify the first and the last file according to the incremental refresh parameters RangeStart and RangeEnd. (Do you feel lost? Do not worry and visit my previous blog post Power BI Incremental Refresh with ETL for a deep-dive explanation.) On the line 7 of the M code is written the condition for the incremental refresh. It says: Load only files with names >= FirstFileName and < LastFileName. According to the official documentation, it does not matter on which side you place the equal sign, as long as there is only one.
The next steps are nothing spectacular. We select two columns Content and Name, parse the parquet binaries, and expand table columns.
You may have noticed on the last screenshot that it is just a reference to the query adlsRefreshables_E. Yes, it is true. We do not do any additional transformations in this example to keep the solution as simple as possible.
After the boring transformation step, we will have fun in the load part of ETL again. We have to apply the incremental refresh filter already used in the extraction step once again. This repeated filter does not change the set of rows and it is completely useless. Why do we still apply that again? Because if we do not, Power BI Desktop does not allow us to set up an incremental refresh.
Power BI Desktop says that it is unable to confirm if the query can be folded. It is not bad news, because the ADLS Gen2 connector was developed by smart guys. It downloads only metadata at first and loads the binaries later when we really need them. Long live lazy evaluation!
Do you need a proof? Here we go!
I have prepared some proof because I wanted to see with my own eyes that it works in the exact way I have expected.
Proof nr. 1: Fiddler
The first thing to point out is that Power BI downloads only metadata. That is the red highlighted request in the top of all HTTP requests. It asks for all files in the folder refreshables/cleansed recursively. The response is an array of paths, look at the orange highlighted boxes. There is a size, a name, and the last modified timestamp of files. The data is the same we saw at the beginning of this article in Power Query Editor.
A little bit later, when the Power Query engine needs the content of files, it downloads them one by one, look at the blue box. There are always two requests for one file. The first one is a HEAD request, the second one is a GET request. You can recognize them also by size of the HTTP response. HEAD requests have the length of 0 bytes, the GET requests contain some data in the vast majority of cases. You can see that in the green box.
Proof nr. 2: Refresh in Power BI Desktop
If you do not have Fiddler and you want to see if your filter works well, you can observe the amount of data downloaded from your data lake in Power BI Desktop.
The first animation shows downloading of all data without any filter.
The second animation shows downloading of filtered rows. It proves that only a subset of files is downloaded and not all of them.
Proof nr. 3: Partitions
And the last check examines the incremental refresh in Power BI service. I want to see that the model contains partitions which were created on different timestamps.
We have configured the incremental refresh that it refreshes the last two whole days only. It means we should see different timestamps in the last 2 partitions in comparison to previous ones. Let’s check it.
First, we need to refresh our dataset twice in Power BI service. Please remember proximate timestamps of refreshes: 15:35 and 15:37 in my local time zone. Remove one hour to get UTC.
Then, we go to the settings of our dataset in Power BI service and copy the connection string. You can do it by clicking on the button Copy.
We open Tabular Editor and click in the menu on Connect to Tabular Server. We paste the copied connection string from the previous step into the field server and click on OK. Tabular Editor connects to our dataset and shows us the model of it. We navigate to the table Refreshables and expand partitions.
All partitions were initially created at approximately 14:35 UTC. This is the timestamp of our first refresh of the dataset.
But the last screenshot of this article shows that the partition 2021Q10224 was created approximately two minutes later. Exactly as we expected that. Only the last two partitions were processed in the second refresh, again.
A short conclusion
That is all for today. Now you know how to combine parquet, ADLS Gen2, ETL, and an incremental refresh in one Power BI dataset. If you have any questions, please let me know down in the comments.