TMDL demo 1/3: RLS Generator

Disclaimer

This blog post is part of a three-part series based on my speaker session at the European Microsoft Fabric Community Conference 2025 (FabCon) in Vienna, which I had the pleasure of delivering together with my colleague Roger Unholz.

Our session, titled “TMDL Playoffs”, was a fast-paced showdown where we shared our favorite tips and tricks for working with TMDL. In this series, I’ll take a deeper dive into each topic we presented on stage and explore them in more detail.All source files referenced throughout the posts are available in my public GitHub repository.

Please note, that the scripts are not tested for every scenario, that they are only created for demonstration purposes and should never be applied on any running, productive enviroments.
Link: ivsch/TMDLPlayoffs.

Idea

The configuration of Role Level Security (RLS) within Power BI semantic models can sometimes be straightforward, but at other times it can feel like opening Pandora’s box. In most customer scenarios I encounter, the RLS setup is fairly basic. Nevertheless, managing RLS across a large number of semantic models can be a real challenge, especially when ensuring that the correct settings are applied consistently.

To address this, I often maintain role definitions outside of the semantic model to provide a clear overview of all enabled RLS settings. That could be a Excel file as well as a small “security application”. This should then be the basic information for a simple script generator to quickly and reliably apply the defined roles to a Power BI semantic model file.

TMDL RLS definition

To check existing RLS definitions in your Power BI semantic model, first enable the TMDL view in Power BI Desktop settings (currently available under Preview features).

On the right side of your semantic model, you’ll find all the objects available in the TMDL scripting language. If you drag and drop the Roles object into the script window, you’ll see the definition of the roles.

Outcome

Even if it’s not a complete solution for handling complex RLS definitions, the most common/basic scenarios can be covered with the Jupyter Notebook apply_rls in the GitHub source folder.

In the very first cell, you can provide all settings for the roles you want to apply. At the moment, the script only works with string values; it can be extended/adjusted to support other data types.


RLS_DEFINITION = [
    {
        "role_name":"CustomerSegmentManagersEnterprise",
        "table_name":"DimCustomer",
        "rls_field_name":"Segment",
        "allowed_values":["Consumer", "Enterprise"]
    },
    {
        "role_name":"CustomerSegmentManagersSMB",
        "table_name":"DimCustomer",
        "rls_field_name":"Segment",
        "allowed_values":["SMB"]
    }
]

This configuration generates a script for two roles—CustomerSegmentManagersEnterprise and CustomerSegmentManagersSMB—that filters the DimCustomer[Segment] column to the specified values.

With those settings, the following output is provided and can be taken over to existing semantic model files.

Generated createOrReplace TMDL script:

createOrReplace

	role CustomerSegmentManagersEnterprise
		modelPermission: read

		tablePermission DimCustomer = [Segment] == "Consumer" || [Segment] == "Enterprise"


	role CustomerSegmentManagersSMB
		modelPermission: read

		tablePermission DimCustomer = [Segment] == "SMB"

Extendability

With further extensions, this functionality could be applied to an application that centrally defines roles and their respective security settings. This would simplify the management of RLS settings across multiple semantic models. In a pro version, you could even imagine downloading the Power BI semantic model, editing the TMDL role definitions, and re-uploading it via the Fabric REST API.

Clash of Data Ingestion: Comparing CU Usage (and other factors) in Fabric

Disclaimer

This blog post contains information about CU usage in Microsoft Fabric. Since I am not a specialist in capacity usage and the Usage Metrics App is still evolving, please consider this information as indicative rather than definitive. Additionally, the choice of technology always depends on best practices, specific frameworks, and the customer’s unique situation.

Pleas also consider the limited amount of data. These extension tests are based on the WWI-Sample database and contains only a few tables of the Sales area:

Idea

Following my previous blog post, I’d like to explore alternative methods for data ingestion using Microsoft’s World Wide Importers sample database. My goal is to compare the CU usage across different ingestion technologies in Microsoft Fabric. In addition, I aim to offer recommendations and ratings for each technology. This post will provide a quick overview of various ingestion approaches based on an Azure SQL Database, with the hope that it will help you with your own projects.

Results

I aim to not only assess the CU usage of the different technologies but also evaluate the overall duration of the ingestion process, the flexibility, and the intuitiveness of each respective technology.

CU usage

Let’s address the elephant in the room first. Phew, I must admit that I found it quite challenging to get accurate numbers for the CU usage. For example, with ingestion via the notebook, the service took some time just to start the session. I couldn’t find an easy way to separate the CU usage for session startup from the actual execution. Despite this, I did some retesting and believe I got at least close to the correct figures. Of course, CU usage can vary depending on how it’s implemented in your framework and your specific projects. For instance, if you start a new session for each notebook to load data, the CU usage overhead might be higher.

Looking at the metrics app, I found that Dataflow Gen2 seems to consume the most CUs, while notebooks perform significantly better than Data Factory Pipelines with copy activities—this disparity is what motivated me to write this blog post. This fact brings us to an important takeaway: If you’re dealing with many small tables, consider switching from Pipelines to Notebooks. Pipeline Copy Activities always round up to a full minute (as I explained in my previous blog post), which can lead to unnecessary CU consumption. Moving to Notebooks can help you optimize both performance and costs.

Database mirroring is a bit of a deal-breaker, considering that the CU usage includes the initial setup of the mirrored database. However, it’s also difficult to determine which activities are strictly related to the mirroring, as there were numerous other operations happening in the target lakehouse (where I loaded the data). I focused on the relevant timestamps and took all the lakehouse operations into account (WWI-Sample represents the mirrored database).

Nevertheless, here are the results for the CU usage:

Microsoft describes database mirroring as a low-cost technology, which is absolutely true—the initial mirroring process is very resource-efficient. However, using a mirrored database isn’t always an option in every environment. In my real-world scenario, which led me to this topic, mirroring wasn’t viable because I needed a complete capture of the entire day as Parquet files in my destination lakehouse for archival purposes. Achieving this efficiently requires full control over the ingestion process, which isn’t easily possible with mirroring.

Let’s have a look at the other evaluation factors.

Overall duration

In terms of overall duration, notebook execution was the fastest, completing the ingestion of all Sales tables from the WWI-Sample database in just 44 seconds. This was followed by Dataflows Gen2 at 49 seconds and Data Factory Pipelines at 88 seconds. The entire database mirroring process took about 5 minutes from setup to the first full synchronization, but once configured, it could be extremely fast for handling new data in near real-time.

Type Duration [s]
Notebook44
Dataflow Gen 249
Data Factory Pipeline88
Mirrored DB300*

Flexibility

If we focus more on real-time projects—like mine originally was—both Dataflows and the Mirrored DB are immediately out of consideration. I’ve already mentioned the need for an “archive” function, but when it comes to Dataflows, the main drawback for me is the lack of programmability. I wasn’t able to easily ingest all tables within the Sales schema of the database. Yes, Power Query offers extensive transformation capabilities, but when it comes to pure data ingestion, the process feels too rigid for my needs.

That said, I still want to give some credit for how configurable each ingestion method is in adapting to specific requirements. This brings me to my (very personal) evaluation:

TypeFlexibility
Notebook★★★★★
Dataflow Gen 2★★
Data Factory Pipeline★★★★
Mirrored DB★★

You may see that I really like working with notebooks. I was amazed that I can ingest the data with just a few lines of code. The following code snippet is just the ingestion of the tables using the SQL server data dictionary, all lines of code before were just setting up the database connection and importing libraries.

Imagine what you can do with all the custom libraries in Python!

# Define the SQL query or table name
query = " (SELECT TABLE_SCHEMA, TABLE_NAME  FROM INFORMATION_SCHEMA.TABLES  WHERE TABLE_SCHEMA = 'Sales') as qry"  # Replace with your schema and table name
# query = " SELECT 1"
# Load data from SQL Server into a Spark DataFrame using the ODBC driver
df = spark.read.jdbc(url=jdbc_url, table=query, properties=connection_properties)

# Show the first few rows of the DataFrame
for tbl in df.collect():
    if not tbl[1].endswith("_Archive"):
        df_read = df = spark.read.jdbc(
            url=jdbc_url, table=f"{tbl[0]}.{tbl[1]}", properties=connection_properties
        )
        df.write.mode("overwrite").parquet(f"Files/raw_ntb/{tbl[1]}")

Data Factory Pipelines offer a high degree of customization through loops, parameters, variables, and more. However, I see notebooks as the clear winner in this regard. As mentioned earlier, the mirroring process has very limited flexibility, and Dataflows remain quite static.

Intuitiveness

Let’s look at this from another perspective. What if your company doesn’t have any Python expertise? What if you want to enable employees to set up data ingestion processes on their own, without needing deep technical knowledge?

In this scenario, I see Dataflows as the clear winner. With its visual interface for transforming data, it provides the most user-friendly experience, making it the easiest ingestion method for non-technical users. Mirroring is also quite intuitive, but it’s more of a straightforward selection process—choosing which tables to ingest rather than configuring an entire workflow.

Data Factory Pipelines, on the other hand, require a deeper understanding of the technology, including how expressions and functions work within Azure Data Factory. Notebooks demand even more technical expertise, as a basic understanding of Python is essential. While they offer the most flexibility, they are not the ideal choice for teams without programming knowledge.

So, when ease of use is the priority, Dataflows stand out as the most accessible option.

TypeIntuitiveness
Notebook
Dataflow Gen 2★★★★★
Data Factory Pipeline★★★
Mirrored DB★★★★

Conclusion

As always, choosing the right technology depends on the specific needs of your project and customer scenario. In my case, I’m leaning towards using notebooks within our framework for data ingestion. But before you jump in, ask yourself: Are your developers ready for this technology?

Finding the sweet spot between capacity usage, ease of use, and flexibility is like trying to balance a three-legged stool—lean too far in one direction, and you might end up on the floor. My advice? Start with some Proof-of-Concept cases to test the waters. That way, you’ll have a solid foundation to pick your favorite data ingestion method in Fabric—without too many surprises along the way!

Fabric Data Factory: Issues with CU usage in copy activities

Issue

At several customer sites, we migrate existing BI platforms from Azure services such as Azure Synapse, Azure Data Factory, or Azure Databricks to Fabric. One key step in the migration process is transferring Azure Data Factory pipelines to Fabric Data Factory.

In some cases, we work with a large number of small source tables (e.g., from an Azure SQL Database). After the migration, I reviewed the Fabric Capacity Metrics report and was surprised to see that a single execution of the daily load process consumed nearly 30% of the available capacity on an F8 instance. The majority of this usage was attributed to pipeline operations.

Given the size of the capacity, I initially believed that an F8 instance would be more than sufficient for the customer, considering the relatively small amount of data and the complexity of the calculations. So, why was the capacity usage so high?

Test Environment Setup

Next, I conducted an investigation on a Fabric pipeline with a Copy Data task that loads 12 tables from a test database into Parquet. The Copy Data task is executed within a ForEach loop. The goal was to explore ways to optimize the CU (Compute Unit) usage of Copy Data tasks.

What does Microsoft say?
According to the pricing page for Fabric pipelines, the following statement is provided for “Data Movement” tasks (Copy Data activity):
“Data Movement service for Copy activity runs. You are charged based on the Capacity Units consumed during the Copy activity execution duration.”

In the pricing breakdown for how “Data Movement” is charged, Microsoft states:
Metrics are based on the Copy activity run duration (in hours) and the intelligent optimization throughput resources used.

Isch_0-1734521388989.png
Source: Pricing for data pipelines – Microsoft Fabric | Microsoft Learn, 18.12.2024

But what exactly is “intelligent optimization”? According to Microsoft’s “Copy Activity Performance and Scalability Guide”, several factors are involved, such as parallel copy for partitioned sources and intelligent throughput optimization.

Source: Copy activity performance and scalability guide – Microsoft Fabric | Microsoft Learn

To investigate further, I conducted three tests with different settings, modifying the intelligent throughput optimization (ITO) option by comparing “Max” versus “Auto” and adjusting the batch count in the ForEach loop to 6. The results showed that the batch count significantly impacts the execution duration, while the ITO setting has little to no effect.

The results showed that the batch count significantly impacts the execution duration, while the ITO setting has little to no effect.

Now, let’s turn our attention to the Fabric Metrics App to examine the consumed CUs. What insights does it reveal about the resource usage?

All pipelines are charged the same. However, by examining the details more closely, we can see how many CUs are used by each individual activity. According to Microsoft’s pricing calculation, the duration of the operations is a key factor in determining the cost.

Source: https://learn.microsoft.com/en-us/fabric/data-factory/pricing-pipelines

his suggests that the duration should directly impact the CU calculation and costs. However, when we examine the individual operations, they all consume 360 CUs, regardless of the runtime.

This was quite unexpected.

Referring to a statement from a blog post, this is also what I assumed to be the basis for the calculation:

In my eyes:

  1. 1.5 CU per hour gives 0,0004166 CU per second.
  2. Say 30 s duration. 30 * 0,0004166 = 0,0125.
  3. Now how many intelligent optimization throughput resources are used? Was set to auto, so unclear.
    But even assuming a 
    maximum of 256, we only get 256 * 0,0125 =  3,2 CU (s). Far from listed 360!

Source: Solved: Minimum CU (s) billing per copy? Or am I just bad … – Microsoft Fabric Community

Let’s take a look at the real-life scenario at the customer mentioned at the start of this post. When we examine the correlation between the duration of the operation and the CUs consumed, we find that nearly all data movement operations are consuming 360 CUs!

In fact, 99% of the operations at the customer result in 360 CUs.

When I look at the duration, it’s clear that the operations with higher CUs are generally the “long-running” ones as well, but there are only few.

Here, we observe another interesting pattern: It appears that the CUs are calculated in 360-unit increments. This could potentially be linked to a time calculation in seconds, perhaps something like ((60 * 60) / 10)?

Conclusion

Based on the findings, it appears that Microsoft’s pricing for data copy activities within Fabric pipelines may not accurately reflect the true consumption based on task duration.

It seems that very small copy tasks are rounded up to at least one minute of usage, leading to an inflated cost. This rounding effect significantly impacts customers with a large number of small objects, as even though the data is minimal, the usage calculation results in high consumption of Capacity Units (CUs).

The implication is that optimizing individual tasks may not have as much impact on billing as expected, while reducing the number of tasks to be processed could have a more substantial effect on overall costs. This discrepancy in how CUs are calculated warrants further clarification from Microsoft, particularly for scenarios involving many small data movements.


Be cautious when working with Data Factory copy activities, especially during the migration of pipelines from Azure Data Factory to Fabric Data Factory. The way usage and costs are calculated differs significantly between the two platforms!

PS: This is a re-written post based on my contribution in Microsoft’s Data Factory forums: Is the pricing of Fabric pipeline data copy activi… – Microsoft Fabric Community