Fabric Data Factory: Issues with CU usage in copy activities

1

Issue

At several customer sites, we migrate existing BI platforms from Azure services such as Azure Synapse, Azure Data Factory, or Azure Databricks to Fabric. One key step in the migration process is transferring Azure Data Factory pipelines to Fabric Data Factory.

In some cases, we work with a large number of small source tables (e.g., from an Azure SQL Database). After the migration, I reviewed the Fabric Capacity Metrics report and was surprised to see that a single execution of the daily load process consumed nearly 30% of the available capacity on an F8 instance. The majority of this usage was attributed to pipeline operations.

Given the size of the capacity, I initially believed that an F8 instance would be more than sufficient for the customer, considering the relatively small amount of data and the complexity of the calculations. So, why was the capacity usage so high?

Test Environment Setup

Next, I conducted an investigation on a Fabric pipeline with a Copy Data task that loads 12 tables from a test database into Parquet. The Copy Data task is executed within a ForEach loop. The goal was to explore ways to optimize the CU (Compute Unit) usage of Copy Data tasks.

What does Microsoft say?
According to the pricing page for Fabric pipelines, the following statement is provided for “Data Movement” tasks (Copy Data activity):
“Data Movement service for Copy activity runs. You are charged based on the Capacity Units consumed during the Copy activity execution duration.”

In the pricing breakdown for how “Data Movement” is charged, Microsoft states:
Metrics are based on the Copy activity run duration (in hours) and the intelligent optimization throughput resources used.

Isch_0-1734521388989.png
Source: Pricing for data pipelines – Microsoft Fabric | Microsoft Learn, 18.12.2024

But what exactly is “intelligent optimization”? According to Microsoft’s “Copy Activity Performance and Scalability Guide”, several factors are involved, such as parallel copy for partitioned sources and intelligent throughput optimization.

Source: Copy activity performance and scalability guide – Microsoft Fabric | Microsoft Learn

To investigate further, I conducted three tests with different settings, modifying the intelligent throughput optimization (ITO) option by comparing “Max” versus “Auto” and adjusting the batch count in the ForEach loop to 6. The results showed that the batch count significantly impacts the execution duration, while the ITO setting has little to no effect.

The results showed that the batch count significantly impacts the execution duration, while the ITO setting has little to no effect.

Now, let’s turn our attention to the Fabric Metrics App to examine the consumed CUs. What insights does it reveal about the resource usage?

All pipelines are charged the same. However, by examining the details more closely, we can see how many CUs are used by each individual activity. According to Microsoft’s pricing calculation, the duration of the operations is a key factor in determining the cost.

Source: https://learn.microsoft.com/en-us/fabric/data-factory/pricing-pipelines

his suggests that the duration should directly impact the CU calculation and costs. However, when we examine the individual operations, they all consume 360 CUs, regardless of the runtime.

This was quite unexpected.

Referring to a statement from a blog post, this is also what I assumed to be the basis for the calculation:

In my eyes:

  1. 1.5 CU per hour gives 0,0004166 CU per second.
  2. Say 30 s duration. 30 * 0,0004166 = 0,0125.
  3. Now how many intelligent optimization throughput resources are used? Was set to auto, so unclear.
    But even assuming a 
    maximum of 256, we only get 256 * 0,0125 =  3,2 CU (s). Far from listed 360!

Source: Solved: Minimum CU (s) billing per copy? Or am I just bad … – Microsoft Fabric Community

Let’s take a look at the real-life scenario at the customer mentioned at the start of this post. When we examine the correlation between the duration of the operation and the CUs consumed, we find that nearly all data movement operations are consuming 360 CUs!

In fact, 99% of the operations at the customer result in 360 CUs.

When I look at the duration, it’s clear that the operations with higher CUs are generally the “long-running” ones as well, but there are only few.

Here, we observe another interesting pattern: It appears that the CUs are calculated in 360-unit increments. This could potentially be linked to a time calculation in seconds, perhaps something like ((60 * 60) / 10)?

Conclusion

Based on the findings, it appears that Microsoft’s pricing for data copy activities within Fabric pipelines may not accurately reflect the true consumption based on task duration.

It seems that very small copy tasks are rounded up to at least one minute of usage, leading to an inflated cost. This rounding effect significantly impacts customers with a large number of small objects, as even though the data is minimal, the usage calculation results in high consumption of Capacity Units (CUs).

The implication is that optimizing individual tasks may not have as much impact on billing as expected, while reducing the number of tasks to be processed could have a more substantial effect on overall costs. This discrepancy in how CUs are calculated warrants further clarification from Microsoft, particularly for scenarios involving many small data movements.


Be cautious when working with Data Factory copy activities, especially during the migration of pipelines from Azure Data Factory to Fabric Data Factory. The way usage and costs are calculated differs significantly between the two platforms!

PS: This is a re-written post based on my contribution in Microsoft’s Data Factory forums: Is the pricing of Fabric pipeline data copy activi… – Microsoft Fabric Community

Enhancing Power BI Visuals for Color Accessibility: Balancing Color and Clarity

1

In the quest to make data visualizations more inclusive, it’s crucial to address the needs of color-blind people, while still preserving the effective storytelling that colors provide. In this post, we’ll explore some practical adjustments made to Power BI built-in visuals, that not only enhance accessibility but also retain the meaningful use of color.

Understanding Color Blindness

Color blindness affects approximately 8% of men and 0.5% of women globally. It occurs when the eye’s color-detecting cells fail to respond to certain wavelengths of light. This condition can make it difficult to distinguish between colors, particularly reds and greens, or blues and yellows.

For many color-blind individuals, visuals that rely solely on color for differentiation can be challenging to interpret. To address this, we need to implement design strategies that enhance clarity. Often, a color-blind friendly palette of colors is used. But this may come with some disadvantages to the real meaning of the colors.

The Importance of Color in Storytelling

Colors in data visualization do more than just decorate—they convey meaning and context. Lets look at the following example of three visuals displaying the age categories of Switzerland’s population:

Note: While there are many ways to enhance visuals further, this post focuses on color-related adjustments to improve accessibility.

  • Yellow signifies the “golden age” of seniors (65+), symbolizing experience and wisdom.
  • Green represents vitality and growth for individuals aged 19-64, reflecting an active and productive life stage.
  • Light blue (symbolizing youth under 20) conveys freshness, new beginnings, and potential. Light blue often evokes a sense of calm and clarity, making it ideal for representing the younger demographic.

These colors help users quickly understand and remember the context of the data. However, when viewed by individuals with color blindness, distinguishing these colors becomes problematic if used alone. As someone with “normal” color vision, I could only imagine how such visuals would look like. In monochrome, differentiating between categories becomes difficult without color cues, underscoring the need for enhanced visual clarity.

Addressing Color Accessibility

To ensure that our visuals are accessible to everyone, including those with color blindness, I have implemented several key adjustments:
Legends make it much easier to interpret individual visuals. For the line and bar charts, legends were added to clearly identify categories, while the pie chart now uses data labels instead of a legend, providing direct category information. To further enhance accessibility, different line styles—dotted, dashed, and solid—were applied to the line chart for each category. In the stacked column chart, borders were introduced:

  • The youth category remains default
  • The age group 20-64 has a grayscale border
  • The senior category features a solid black border

These adjustments collectively improve the clarity and readability of the visuals, even for those viewing them in grayscale.

Before Adjustments: In grayscale, it’s difficult to distinguish between categories due to the lack of color cues.

After Adjustments: With the addition of distinct line styles and borders, the grayscale visuals become much clearer, showing how these changes enhance readability.

Final Visuals

The final updated visuals incorporate color and additional design elements to improve accessibility while preserving the storytelling aspect. These adjustments not only help those with color blindness but also streamline the user experience for all viewers.

In summary, while these adjustments involve a minor increase in visual space for the additional legends, they significantly enhance both accessibility and overall clarity. By retaining the original color storytelling and incorporating complementary visual aids like borders and line styles, we ensure that our Power BI reports are both inclusive and effective.

Swiss federal elections 2023: Displaying results [demo report in German]

2

Data visualization has revolutionized the way we understand and interpret election results. In this context, I created a Power BI report to showcase the results of the Swiss National Council elections held in 2015 and 2019. With data directly ingested from open platforms provided by the Swiss Government, this report gives a comprehensive view of election results down to each municipality.

I recommend to display the report on a desktop-sized screen. Many of the visuals are interactive, detailed results from single municipalities can be displayed by hovering over map of Switzerland (image 1).

The second page will display results on the election day (coming 22th October 2023). Display the report in full-screen to get the best experience. In upcoming posts, I will explain the technical background and add new content.

Live report (select in the bottom right corner for full screen)

(Mis-)use custom map visuals in Power BI

6

Use case

In this blog post, I will show you how to use custom map visuals in Power BI to display something different than a region on a world map.

I first stumbled upon custom map visuals years ago using Reporting Services Mobile Reports. Back then, my boss used to visualize figures on different regions of Switzerland based own region definitions given by the business.

Based on personal interest, I created a dataset using open data to get the votes of members in the Swiss National Council. I want to display the members on their effective place in the seat plan of the parliament. As I looked at the image of the parliament, I suddenly thought about it as a map. It actually displays a certain object placed on a respective geographical location – even if there is no need for longitude and latitude (in terms of placement on our world map)

Swiss Parliament (Source: https://www.parlament.ch)

Approach

I often try to avoid 3rd party visuals and I check first, if the requirement could be fulfilled with a built-in visual. So I gave the custom map visuals a go!

First, you need to activate the shape map visuals in the preview features settings in Power BI desktop. It will show up in your visual selection pane afterwards.

Custom map visuals need a GeoJSON file to display the custom maps. I’d have to generate one using the parliament seating plan.

The seat plan above needs do be represented as GeoJSON file defining the single seats as objects. They also need to have a ID to be able to identify and use them in the report. In this specific example i needed 3rd party tools to convert the seat plan first to a SVG, then to a JSON file (I have used the following converter: Online GIS/CAD Data Converter | SHP, KML, KMZ, TAB, CSV, … (mygeodata.cloud)).

In the JSON file resulting, an id has to be added to every polygon defined in the JSON file. Those Ids are connected to the respective data in Power BI afterwards. Adjustments like this or other transformations can be done with the online tool mapshaper (mapshaper).

Hint

On one point, I had to rotate the whole JSON file by 180 degrees. I was able to achieve this by using the following command in mapshaper: “mapshaper -affine rotate=180”. Just if you are struggling with this 😉

The Id in your data is now connected to the polygons by using the Id in the “Location” attribute of the custom map visual.

Conclusion

This was a first trial on (mis-) using custom map visuals not showing “real” geographic data on the world map but allocating areas on a complete different map, like a seating plan.

Do you have used custom map visuals for other purposes than real world maps as well? Let me know in the comments!

This is my final report in action, showing vote results based on my seating map:

Hello world!

3
“was there” selfie

I just returned home after an intense week fully booked with speaker sessions about data. I was attending the SQLbits 2023 in Wales, which is basically the largest data platform conference.

Now I got a backpack full of new ideas, new connections and new knowledge. I was impressed about the whole data community actively sharing their knowledge with free videos, blogs or sessions. In one session called “keynote to the community”, all attendees gathered in the big auditorium. Some key persons of the community invited all attendees to actively share knowledge with others as well.

As I thought about this, I remembered the numerous times I was glad that someone wrote about a specific topic, explaining it and providing solutions. Especially in our very specialized area of data engineering, it is crucial to be able to access this knowledge.

Even if this blog may not be read by a hundred persons, it could help the one or the other in daily business. Maybe I just want to give something back, for all the thousands of blog posts helping me during my journey on data platforms.

Have fun!