Data Mozart https://data-mozart.com/ Make music from your data Tue, 17 Feb 2026 10:10:52 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.5 https://data-mozart.com/wp-content/uploads/2021/02/cropped-Logo-2-32x32.png Data Mozart https://data-mozart.com/ 32 32 The new house on the block: Why you need a smoke alarm, not just a newspaper https://data-mozart.com/the-new-house-on-the-block-why-you-need-a-smoke-alarm-not-just-a-newspaper/?utm_source=rss&utm_medium=rss&utm_campaign=the-new-house-on-the-block-why-you-need-a-smoke-alarm-not-just-a-newspaper https://data-mozart.com/the-new-house-on-the-block-why-you-need-a-smoke-alarm-not-just-a-newspaper/#respond Tue, 17 Feb 2026 08:44:09 +0000 https://data-mozart.com/?p=9309 You’ve probably spent the last year hearing about Lakehouses and Warehouses until your ears are ringing. If a Lakehouse is a massive library and a Warehouse is a high-end bank vault, the Eventhouse is a 24-hour Air Traffic Control Tower.

The post The new house on the block: Why you need a smoke alarm, not just a newspaper appeared first on Data Mozart.

]]>
You’ve probably spent the last year hearing about Lakehouses and Warehouses until your ears are ringing. But today, we’re talking about another house in Microsoft Fabric – the Eventhouse.

I’ve recently had a chance to work with a large client on the Real-Time Intelligence project implementation, and was amazed by the power of Eventhouse!

If a Lakehouse is a massive library and a Warehouse is a high-end bank vault, the Eventhouse is a 24-hour Air Traffic Control Tower. In technical terms, an Eventhouse is a logical container tailored for streaming data. It’s built on the Kusto (KQL) engine. While the rest of Fabric is busy thinking about ‘data at rest’ (data that’s sitting in a file waiting to be read), the Eventhouse is obsessed with ‘data in motion’. It’s where your logs, your IoT sensor data, and your real-time clicks live. It doesn’t just store data – it catches it mid-air.

“The newspaper” analogy

You might be thinking, ‘Can’t I just put my sensor data in a Lakehouse?’ Well, sure you can. But there’s a massive catch: latency.

Think about it like this:

  • The Warehouse/Lakehouse approach is like reading the morning newspaper. It’s great! It’s detailed, it’s curated, and it tells you exactly what happened yesterday. But if your house is currently on fire, a newspaper delivered tomorrow morning isn’t going to help you much.
  • The Eventhouse approach is the live smoke alarm. It’s scanning the air every second. It doesn’t care about the history; it cares about what is happening right now.

If you have a million rows of data coming in every minute from a factory floor, you don’t want to wait for a Spark job to spin up, ‘shuffle’ the data, and write a Delta table. You want to query it the millisecond it hits the wire. That is why we need an Eventhouse.

The difference in architecture

Warehouses and Lakehouses are built for throughput (moving big chunks of data at once). The Eventhouse is built for concurrency and low latency (handling thousands of tiny ‘events’ coming in at the same time and letting you query them instantly).

The reason Eventhouse is so fast is the engine: KQL (Kusto Query Language). If SQL is a precise surgical scalpel, then KQL is a high-powered leaf blower. It is designed specifically to scan through billions of rows of text and logs in seconds.

In an Eventhouse, you aren’t worried about complex joins between 50 tables. You’re doing ‘Search’ and ‘Filter’ across massive streams. It’s built to index everything by default, so when you hit ‘Run,’ the answer is there before you’ve finished clicking the mouse.

So, the rule of thumb for your next project:

  1. If your data is coming in batches and you want to do standard reporting, go Warehouse.
  2. If you have massive raw unstructured data, such as PDFs, images, video, or audio files, and you want maximum flexibility, go Lakehouse.
  3. If your data is streaming, comes in high volumes, and you need to react in seconds, go Eventhouse.

Additionally, the big differentiator between the Lakehouse/Warehouse vs. Eventstream approach could be: use Lakehouse/Warehouse for analytical workloads, and use the Eventhouse for operational workloads.

It’s not about which one is ‘better’ – it’s about choosing the right tool for the speed of your business.

Eventhouse vs. KQL Database

This is a point of confusion for a lot of people because, when you look at the workspace, they look very similar. But there is a clear parent-child relationship here.

Think of it like this: The Eventhouse is the apartment building, and the KQL Database is the apartment.

The Eventhouse is the high-level container. It’s the architectural “host.” When you create an Eventhouse, you aren’t just creating a place to put data, you are also setting up the compute and management layer.

  • Shared resources: The Eventhouse manages the “capacity” (resources) for everything inside it.
  • Centralized management: You use the Eventhouse to manage things like data ingestion, networking, and security at scale across multiple KQL databases.
  • Think of it: The building provides the electricity, the water, the security guard at the front desk, and the overall structure. Without the building, the individual apartments have nowhere to exist.

The KQL Database is where the actual work happens. This is where your data is stored and where your queries run.

  • Storage and tables: This is where you create your tables, your functions, and your materialized views.
  • Data isolation: You might have one KQL Database for “Production Logs” and another for “Test Logs” inside the same Eventhouse.
  • Think of it: This is your actual living space. It has your furniture (tables) and your rules (functions). You can have many apartments in one building, but they all share the building’s foundation.

Eventhouse OneLake availability

You’ve probably heard about the OneCopy principle in Fabric, right? This means, we create a single copy of the data in OneLake, and all the Fabric engines can read the same copy of the data. This principle is based on the unified, open table format, called Delta Parquet. However, when you store the data in Eventhouse, it’s not stored in Delta Parquet format. Which defeats the purpose of the OneCopy principle, right?

Well, not really. There is a special feature, called Eventhouse OneLake availability, that allows you to automatically create a logical copy of the Eventhouse data in Delta Parquet format, and then query this data using any other Fabric workload, including Direct Lake storage mode for Power BI semantic models. This feature can be enabled on both the entire KQL database level and on the individual table level.

Key strengths

Let me now quickly introduce some of the key strengths of the Eventhouse in Microsoft Fabric. In this article, I’ll just briefly introduce them, and then I plan to cover them in more detail in the future.

Shortcuts

I’ve already written about shortcuts in Microsoft Fabric. As a reminder, a shortcut is a OneLake object that points to another storage location. Essentially, it allows you to create a virtual copy of the data without physically moving it from its original location.

Shortcuts in a KQL database conceptually work the same as in the Lakehouse, but there are also some specifics when it comes to their physical implementation. As already mentioned, I’ll cover KQL database shortcuts in detail in one of the next articles.

Eventhouse endpoint for Lakehouse/Warehouse

The Eventhouse endpoint is a capability in Microsoft Fabric that lets users query lakehouse or warehouse data in a high-performing and real-time querying way, using both KQL and SQL.

Auto-indexing

One of the biggest headaches in database history is indexing. In the old days, you had to guess which columns people would query and build indexes manually. If you missed one, your query would crawl.

In an Eventhouse, everything is indexed by default. Because it’s built on the Kusto engine, it indexes data as it arrives. You don’t ‘tune’ an Eventhouse, you just feed it.

Partitioning

Then there’s partitioning. Usually, you have to decide: “Do I partition by year? By month? By department?” If you get it wrong, moving data is a nightmare.

The Eventhouse handles partitioning based on ingestion time automatically. It’s like having a closet that reorganizes itself every night so the clothes you wear most often are right in front.

And the compression? It’s amazing. Eventhouse automatically squishes data down.

Native time-series

Now, let’s talk about time series. Most databases treat time like just another number. But the Eventhouse understands time.

If you want to see a trend, find an anomaly, or predict where a sensor reading is going, you usually need a math degree and 200 lines of Python. In Eventhouse, we use KQL, which has “Time Series” baked into its DNA.

Standard SQL is like looking at a series of polaroid photos. You have to flip through them one by one to see what changed. Eventhouse is like having a video editing suite. You can use one command like make-series to fill in missing data points, or series_decompose to strip away the “noise” and see the actual trend. It’s built to answer “What’s normal?” and “When did it stop being normal?” in milliseconds.

Complex JSON and XML parsing

Let’s wrap it up by talking about the messy stuff: nested JSON and XML. We’ve all seen those JSON files that look like a giant tangled ball of Christmas lights – arrays inside objects inside other arrays.

In a traditional warehouse, you’d spend three days writing a flattening script. The Eventhouse has a feature called Dynamic Columns. You just dump the JSON in. You don’t even have to define the schema first! It’s like a high-end wood chipper. You throw in a whole tree (a complex, nested JSON object), and the Eventhouse shreds it into perfectly organized wood chips (columns) that you can query immediately. You want to reach deep into a nested array? You just use a simple dot notation (like Log.Details.Sensor.Value). It’s that easy.

Stay tuned for more about Eventhouse and Real-Time Intelligence workloads in Microsoft Fabric!

Thanks for reading!

P.S. Special thanks to Minni Walia from Microsoft for providing valuable feedback for this post

Last Updated on February 17, 2026 by Nikola

The post The new house on the block: Why you need a smoke alarm, not just a newspaper appeared first on Data Mozart.

]]>
https://data-mozart.com/the-new-house-on-the-block-why-you-need-a-smoke-alarm-not-just-a-newspaper/feed/ 0
Why every Analytics Engineer needs to understand data architecture https://data-mozart.com/why-every-analytics-engineer-needs-to-understand-data-architecture/?utm_source=rss&utm_medium=rss&utm_campaign=why-every-analytics-engineer-needs-to-understand-data-architecture https://data-mozart.com/why-every-analytics-engineer-needs-to-understand-data-architecture/#respond Thu, 12 Feb 2026 12:53:37 +0000 https://data-mozart.com/?p=9299 Get the data architecture right, and everything else becomes easier. Data architecture is the blueprint that determines where your data lives, how it moves between systems, what happens to it during transformations, and who gets access to what.

The post Why every Analytics Engineer needs to understand data architecture appeared first on Data Mozart.

]]>
Get the data architecture right, and everything else becomes easier.

I know it sounds simple, but in reality, little nuances in designing your data architecture may have costly implications. Hence, in this article, I wanted to provide a crash course on the architectures that shape your daily decisions – from relational databases to event-driven systems.

Before we start, I want you to remember this: your architecture determines whether your organization is like a well-planned city with efficient highways and clear zoning, or like a metropolis that grew without any planning and now has traffic jams everywhere.

I’ve seen this firsthand. A company that had grown rapidly through acquisitions had inherited data systems from each company they bought, and nobody had ever taken the time to think about how all of it fit together. Customer data lived in five different CRM systems, financial data was split between three different ERPs, and each system had its own definition of basic concepts like customer and revenue. Their “weekly” business review took two weeks to prepare. Six months later, after implementing a well-thought-out data architecture, they could generate the same review in under two hours.

The difference wasn’t the latest technology or huge budgets. It was simply having a thoughtful approach to how data should be organized and managed.

In this blog post, adapted from a chapter of the book Analytics Engineering with Microsoft Fabric and Power BI, which I’m writing together with Shabnam Watson for O’Reilly, I’ll walk you through the core data architecture types, their strengths, weaknesses, and where each one truly shines. Fasten your seatbelts!

1 Relational Database 1970s 2 Data Warehouse 1990s 3 Data Lake ~2010 4 Data Lakehouse 2020 5 Data Mesh 2020s 6 Event- Driven Modern The Data Architecture Evolution

1. Relational Database – The Fine Old Wine

Relational databases date all the way back to the 1970s, when Edgar F. Codd proposed the relational model. At its core, a relational database is a highly organized, digital filing cabinet. Each table is a drawer dedicated to one thing, think of customers, orders, products. Each row is a single record, each column a specific attribute.

The relational part is where the power comes from. The database understands how tables are connected. It knows that Customer X in the Customers table is the same Customer X who placed an order in the Orders table. This structure is what allows us to ask complex questions using SQL.

When working with relational databases, you follow a strict rule called schema-on-write. Think of building a house: you must have a detailed blueprint before you can start laying the foundation. You define every room, every window, and every doorway upfront. The data must fit this blueprint perfectly when you save it. This upfront work ensures everything is consistent and the data is trustworthy.

The opposite approach, called schema-on-read, is like dumping all your building materials into a big pile. There’s no blueprint to start. You only decide how to structure it when you need to build something. Flexible? Absolutely. But it puts the burden of making sense of the chaos on whoever analyzes the data later.

Schema-on-Write Blueprint first ID | Name | Email 1 | Alice | a@… 2 | Bob | b@… Then write data Schema-on-Read {“name”:”…”} IMG_042.jpg log_2024.csv sensor.xml Dump everything Structure it when you read Shape later

2. Relational Data Warehouse – The Analyst’s Playground

Relational databases were (and still are) fantastic at running the daily operations of a business – processing orders, managing inventory, updating customer records. We call these operational (OLTP) systems, and they must be lightning-fast.

But this created a huge challenge – what I like to call the “Don’t touch the live system!” problem. Business leaders needed to analyze data. But running complex analytical queries on the same live database processing thousands of transactions a minute would grind everything to a halt. It’s like trying to do a deep inventory count in a busy supermarket during peak hours.

The solution? Create a separate playground for analysts. The relational data warehouse was born: a dedicated database built specifically for analysis, where you centralize copies of data from various operational systems.

Sales DB CRM Inventory ERP ETL Data Warehouse Single Source of Truth Sales Mart Reports & Dashboards Finance Mart Reports & Dashboards Marketing Mart Reports & Dashboards

Two Schools of Thought: Inmon vs. Kimball

There are two fundamental approaches to building a data warehouse. The top-down approach, introduced by Bill Inmon – “the father of the data warehouse” – starts with designing the overall, normalized data warehouse first, then creating department-specific data marts from it. It gives you consistent data representation and reduced duplication, but comes with high upfront costs and complexity.

The bottom-up approach, championed by Ralph Kimball, flips this around. You start by building individual data marts for specific departments using denormalized fact and dimension tables. Over time, these connect via conformed dimensions to form a unified view. It’s faster to get started, more flexible, and cheaper, but risks inconsistencies and data silos if not managed carefully.

Inmon — Top-Down “Build the warehouse first” Sales DB CRM ERP Centralized Data Warehouse Normalized · Single Source of Truth Sales Mart Finance Mart Mktg Mart Top → Down Consistent but expensive & rigid VS Kimball — Bottom-Up “Build the marts first” Sales DB CRM ERP Sales Mart Facts + Dims Finance Mart Facts + Dims Mktg Mart Facts + Dims Conformed Dimensions (Customer, Product, …) Unified Warehouse (emerges over time) Bottom → Up Fast to start but risks silos

3. Data Lake – The Promise and the Swamp

Around 2010, a new concept emerged promising to solve all our problems (sounds familiar?): the data lake. The sales pitch was alluring – unlike a structured data warehouse, a data lake was essentially a massive, cheap storage space. You don’t need a plan, just dump everything in there: structured data, log files, PDFs, social media feeds, even images and videos. Store everything now, figure out what to do with it later.

This is the schema-on-read approach in practice. And unlike relational data warehouses, which provide both storage and compute, a data lake is just storage – no bundled compute engine. It relies on object storage, which doesn’t require data to be structured in tabular form.

For a while, the hype was real. Then reality hit. Storing data in a lake was easy – getting it out in a useful way was incredibly difficult. Business users were told: “Hey, all the data is in the lake! Just open a Jupyter notebook and use some Python to analyze it.” Most business users didn’t have advanced coding skills. The crystal-clear lake quickly became a murky, unusable data swamp.

The Promise “Crystal-clear data lake” REALITY The Reality json csv ?? pdf “Murky data swamp”

But the data lake didn’t disappear. The industry realized the initial vision was flawed, but the core technology remained incredibly useful. Today, the data lake has found its true calling – not as a warehouse replacement, but as a staging and preparation area: the perfect place to land raw data before deciding what to clean, transform, and promote for reliable analysis.

4. Data Lakehouse — The Best of Both Worlds

When you combine a data warehouse and a data lake, what do you get? A data lakehouse. Databricks pioneered this term around 2020, and the concept has been gaining serious traction ever since.

I can almost hear you asking: “Wait, didn’t you just say data lakes failed spectacularly trying to satisfy exactly these requirements? Why would this work now?”

Fair question. There was a single change to the classic data lake approach, but it was big enough to shift the entire paradigm: adding a transactional storage layer on top of existing data lake storage. This layer, exemplified by Delta LakeApache Iceberg, and Apache Hudi, enables the data lake to work more like a traditional relational database management system, with ACID transactions, schema enforcement, and time travel.

Data Lake Cheap object storage All data formats Schema-on-read No compute + Transactional Layer ACID transactions Schema enforcement Time travel = Lakehouse ✓ Lake flexibility ✓ Warehouse reliability ✓ All data formats ✓ Cost-efficient The single change that shifted the entire paradigm

The lakehouse promotes a compelling idea: remove the need for a separate relational data warehouse and leverage only a data lake for your entire architecture. All data formats: structured, semi-structured, and unstructured, are stored in the lake, and all analysis happens directly from it. The transactional layer is the missing ingredient that makes this feasible.

5. Data Mesh – Decentralizing Data Ownership

So data lakehouses solved the storage and analysis problem. Case closed, right? Not exactly. As companies grew, even a great centralized data platform created a new bottleneck.

Think of your central data team as the kitchen of a very popular restaurant. Marketing, Sales, Finance, and Logistics all place complex “orders” (data requests). The kitchen staff – your data engineers – are skilled but swamped. They don’t have deep, nuanced understanding of every “dish.” The marketing team asks for a customer segmentation, and the kitchen has to first ask: “What do you mean by an active customer?” The result? A long line of frustrated “customers” and a burned-out kitchen staff.

Data mesh asks a radical question: what if, instead of one central kitchen, we gave each department its own specialized kitchen station? And what if we made the domain experts – the people who truly know their own data – responsible for preparing high-quality data products for everyone else?

The four pillars of data mesh architecture

Data mesh rests on four key principles: domain-oriented ownership (the people closest to the data own it), data as a product (treated with the same care as any customer-facing product), a self-serve data platform (central team provides the infrastructure, domains build the products), and federated computational governance (global standards enforced through a council with domain representatives).

A word of caution: Data mesh isn’t a technology you buy and install. It’s a sociotechnical shift, as much about organizational structure and culture as it is about technology. It’s not for every company, especially smaller ones. Domain teams need genuine technical expertise and long-term commitment, and data sharing can easily become a political issue.

6. Event-Driven Architecture – The Gossipy Neighbor

Now let’s switch gears. Think of event-driven architecture as the gossipy neighbor approach to data – systems that react instantly to things happening, rather than constantly checking for updates. Instead of System B asking System A every five minutes “Hey, did anything happen yet?” (like checking your fridge hoping food has magically appeared), an event-driven system taps you on the shoulder the moment something important occurs.

A customer places an order? That’s an event. The system that creates it is the producer. The systems that listen and react are consumers. And the intermediary where events get posted is the event broker – think Apache Kafka, Azure Event Hubs, or Eventstream in Microsoft Fabric.

The beauty lies in the words loosely coupled. The Marketing team can spin up a new service that listens to “Customer Signed Up” events without requiring the Sales team to change a single line of code. If the welcome email service crashes, new customers still get signed up – the events just pile up in the broker, waiting for the service to recover.

But this power comes with trade-offs. You now have a new piece of infrastructure to manage. Debugging gets harder because when something goes wrong, tracing a single event across multiple decoupled systems can be a serious challenge. And the broker doesn’t always guarantee the order of delivery: you might get an “Order Shipped” event before the “Order Paid” event.

When to use it: Real-time analytics (IoT, clickstream, fraud detection), microservices integration, and asynchronous workflows.

When NOT to use it: Simple CRUD apps, tightly coupled workflows requiring immediate guaranteed responses, and strictly transactional systems where multi-step processes must succeed or fail atomically.

The Cheat Sheet

There’s no magic bullet – each architecture has its place. Here’s the quick comparison to help you decide:

The Key Takeaway

Understanding when to use what is the crucial skill for any analytics engineer. Every single day, you make decisions about how to structure data, where to store it, how to transform it, and how to make it accessible. These decisions might seem minor in the moment: Should I create this as a view or a table? Where should I put this transformation logic? – but they add up to create the foundation your entire analytics ecosystem sits on.

The data architecture landscape has evolved from normalized relational databases, through the “don’t touch the live system!” era of data warehouses, past the spectacular rise and fall (and redemption) of data lakes, into the lakehouse paradigm that gives us the best of both worlds. Modern approaches like data mesh push ownership to the people closest to the data, and event-driven architectures let systems react instantly rather than constantly polling for updates.

This article is adapted from the “Analytics Engineering with Microsoft Fabric and Power BI book.

Want the full deep dive with hands-on examples? Check out the book for the complete picture – including data modeling, medallion design pattern, and real-world Fabric implementations.

Last Updated on February 12, 2026 by Nikola

The post Why every Analytics Engineer needs to understand data architecture appeared first on Data Mozart.

]]>
https://data-mozart.com/why-every-analytics-engineer-needs-to-understand-data-architecture/feed/ 0
Beyond the Lakehouse: First Thoughts on Fabric IQ https://data-mozart.com/beyond-the-lakehouse-first-thoughts-on-fabric-iq/?utm_source=rss&utm_medium=rss&utm_campaign=beyond-the-lakehouse-first-thoughts-on-fabric-iq https://data-mozart.com/beyond-the-lakehouse-first-thoughts-on-fabric-iq/#respond Tue, 03 Feb 2026 08:24:25 +0000 https://data-mozart.com/?p=9261 Fabric IQ represents a fundamental shift in how Microsoft Fabric operates. While the rest of the platform has been laser-focused on where your data lives (lakehouses, warehouses, eventhouses), Fabric IQ is obsessed with what your data means. 

The post Beyond the Lakehouse: First Thoughts on Fabric IQ appeared first on Data Mozart.

]]>
If you’ve been working in enterprise data for a while, you’ve probably experienced the following scenario: the marketing team calls a customer anyone who’s ever downloaded a whitepaper from your company’s website. Sales defines a customer as someone with an active contract. And finance? They only count someone as a customer after the first payment clears. Same word, three different meanings, and chaos in every cross-functional dashboard. Welcome to the world of semantic drift – the silent killer of data initiatives that Microsoft Fabric IQ is designed to eliminate.

Announced at Microsoft Ignite 2025, Fabric IQ represents a fundamental shift in how Microsoft Fabric operates. While the rest of the platform has been laser-focused on where your data lives (lakehouses, warehouses, eventhouses), Fabric IQ is obsessed with what your data means

Ontology at the core

At its core sits the Ontology item – think of it as the Rosetta Stone for your business. An ontology is a shared, machine-understandable vocabulary that defines the things in your environment (represented as entity types), their facts (represented as properties), and the ways they connect (represented as relationships). If you’re thinking “this sounds like a fancy data model,” you’re not far from the truth. However, there is one critical difference: unlike a data model that lives in a single Power BI semantic model or database schema, an ontology sits above all your physical data sources and provides a single source of truth that every tool, every analyst, and every AI agent can reference.

The following illustration shows the simplified overview of the Fabric IQ architecture:

The power of Binding

The power of Fabric IQ happens through a concept called binding. When you create an ontology, you’re not just drawing boxes and arrows in a diagram – you’re rather connecting those abstract business concepts to real data living across your OneLake ecosystem. A single entity type like Customer can be bound to columns from your lakehouse tables, eventhouse streams, and even existing Power BI semantic models. The ontology handles the heavy lifting: it maps data types, defines identity keys, describes how columns map to properties, and establishes how those keys create relationships across multiple sources. This means when your marketing automation system, your CRM, and your billing platform all have slightly different customer records, the ontology can reconcile them into a single, coherent business entity. Updates in upstream sources require manual refresh, but once refreshed, every downstream consumer, whether it’s a Power BI report, a Fabric Data Agent, or an Operations Agent, sees the same unified view.

You might be wondering: “Can’t I just standardize my data upfront in a data warehouse and call it a day?” Well, sure you can. But, that only works if you control every single data source and can enforce perfect consistency at ingestion time. In the real world, you’re dealing with acquired companies running different systems, third-party APIs with their own schemas, and legacy databases that predate your entire career. The ontology approach says: “Let the data be messy at the source level, but enforce semantic consistency at the consumption layer.” It’s the difference between trying to make everyone speak the same language versus having a really good translator.

What makes Fabric IQ particularly powerful for Real-Time Intelligence workloads is its integration with Graph capabilities in Microsoft Fabric. While traditional analytics asks “what happened?” and “why did it happen?”, graph-enabled ontologies let you traverse complex relationships to answer questions like “Which stores have fewer ice cream sales when their freezer temperature rises above -18°C?” or “Find all shipments exposed to risky routes and the related cold chain breaches that resulted.” 

The ontology declares which things connect and why, while the Graph engine stores and computes those traversals with sub-second performance. This isn’t just about drawing pretty network diagrams. It’s more about enabling AI agents and analysts to follow cause-and-effect chains across domains that traditional SQL joins simply can’t handle efficiently.

The real-world applications of Fabric IQ are already emerging across industries. Some companies are leveraging ontology in Fabric IQ to unify transmission and distribution grid data, breaking down the silos created by traditional relational databases. By structuring relationships between assets, events, and operational domains, such as weather systems, outage management systems, GIS data, and telemetry data, they’re building a foundation for advanced forecasting and risk evaluation. 

In the retail sector, companies are using ontologies to link sales transactions, supply chain events, IoT sensor data, and customer behavior into unified business entities that Operations Agents can monitor in real-time. When a freezer temperature sensor crosses a threshold, the agent doesn’t just send an alert – it understands the downstream implications for inventory spoilage, sales impact, and customer satisfaction, because all those concepts are explicitly modeled in the ontology.

Despite its emerging popularity, Fabric IQ is not a silver bullet. If your organization struggles to agree on basic definitions now, an ontology won’t magically create consensus. What it will do is make those disagreements painfully obvious. The power comes when you commit to treating the ontology as a living, governed artifact with clear ownership, approval workflows, and version control. When that happens, you unlock benefits that cascade through your entire data ecosystem: consistent semantics across tools, fewer one-off models, less duplicated logic, and most importantly, AI agents that can move beyond answering questions to taking safe, auditable actions. 

Because when business rules and constraints live in the ontology rather than scattered across 47 different Power BI semantic models and Fabric notebooks, your AI systems finally have the context they need to not just be helpful, but to actually understand your business.

Build your first Ontology

Let’s quickly see Fabric IQ in action. In this section, we will build our first Ontology. At the moment of writing, there are two options for creating an ontology:

  • Generate Ontology from semantic model – automatically generates an ontology on top of the existing Power BI semantic model
  • Build directly from OneLake – manually create the ontology by binding properties directly from OneLake

In our example, we will build an Ontology directly from OneLake. Before we start, there are certain prerequisites – you must enable specific settings on your Fabric tenant. All these settings can be found under the Tenant settings tab of the Admin portal.

  • Enable Ontology item
  • User can create Graph
  • Allow XMLA endpoints and Analyze in Excel with on-premises semantic models – required only if you want to generate an ontology from a semantic model

The following settings are required only if you want to use ontology with a Fabric Data agent:

  • Users can create and share Data agent item types
  • Users can use Copilot and other features powered by Azure OpenAI
  • Data sent to Azure OpenAI can be processed outside your capacity’s geographic region, compliance boundary, or national cloud instance
  • Data sent to Azure OpenAI can be stored outside your capacity’s geographic region, compliance boundary, or national cloud instance

For this example, we will need some sample data. You can download all the necessary files from here

Next, let’s make this data available in the lakehouse. Let’s upload the files to the Files section of the lakehouse, and then transform them to Delta tables in the Tables section, as shown below. We will do that for all the files except the FreezerTelemetry file.

Next, we will create a new Eventhouse in the same Fabric workspace. Once the eventhouse is created, click on the three dots next to the default KQL database, choose Get data, and then Local file, as illustrated below:

Provide the name for the table, and load the data from the FreezerTelemetry.csv file you previously downloaded. After confirming all the default settings, you should be able to see the FreezerTelemetry table in the KQL database, as shown in the following illustration:

Now that the data is ready, let’s start building the ontology. Click on the New item and choose the Ontology item, as displayed below:

We’ll provide the SalesOntology as the ontology name. Now, let’s create entity types and data bindings. Entity types are the representation of your business objects. In our example, we have three business objects: store, products, and salesEvent. So, let’s first create the Store entity type in our ontology. 

Once you create the Store entity type, you will be navigated to the configuration pane, where you can configure bindings and add data to entity type, as displayed in the following illustration:

Connect to the lakehouse you previously created, and from the dropdown list, pick the dimstore table. The figure below displays the default, static binding type. The source column side lists column names in the source data, while the property name side lists their corresponding property names on the Store entity type within the ontology. Don’t change the default property names, which match the source column names.

In the next step, we will add the entity type key, which is the StoreId column, as shown below:

Follow the same steps to create entity types Product and SalesEvent. Product entity maps to the dimproducts table in the lakehouse, and should use the ProductId column as a key, whereas the SalesEvent entity maps to the factsales table and should have the SaleId column as a key. 

Now that we’ve created all the entities, let’s establish relationships to represent contextual connections in the data. The following illustration displays the process of creating a relationship between the Store and the SalesEvent entities:

In the Relationship configuration pane, fill in the required configuration details, as displayed below:


Next, let’s create the soldIn relationship by following the same steps as in the previous relationship creation workflow. The only difference is that the source entity type is the Product, and the source column is the ProductId. 

Once you complete this step, you should be able to see the SalesEvent entity with two relationships (has, soldIn) to the Store and Product entities, as shown below:

Then, add the FreezerId as the key value for the Freezer entity. Similar to the previous cases, we need to configure the binding of the Freezer entity. Pick the freezer table from the lakehouse, and configure column bindings as displayed below:

In the following steps, we will enrich our ontology with time-related data. From the Entity type configuration pane, under the Bindings tab, choose the Add data to entity type. This time, we will pick our eventhouse as a source, and the FreezerTelemetry table as a source table. The following figure shows the binding settings for the Freezer entity:

The Freezer entity now has two bindings – static, to the Freezer lakehouse table, as well as the time-series, to the FreezerTelemetry eventhouse table. 

Finally, let’s establish a relationship between the Freezer and Store entities. The following figure provides all the necessary information for the relationship configuration:

When you bound data to your entity types in the previous steps, the ontology automatically generated instances of those entities that are tied to the source data. You can get all the details at a glance by choosing the Entity type overview option from the top ribbon. This will display various information about the selected entity, as shown below:

Connecting with Data agents

As the final step, we will create a Data agent and use the ontology to provide the answers grounded in the ontology’s definitions and bindings. First, let’s create a new data agent in the same workspace we previously used to create our lakehouse, eventhouse, and ontology. We’ll name the agent: SalesOntologyAgent.

Once the agent is created, we will add the ontology as a data source. Next, let’s feed our agent with the additional instructions. Click ok the Agent instructions, and then add the “Support group by in GQL” at the bottom of the instruction window, as displayed below:

It’s time to explore our ontology by using natural language questions. For the starter, try these two prompts:

  1. For each store, show any freezers operated by that store that ever had a humidity lower than 46 percent
  2. What is the top product by revenue across all stores?

The figure below displays that the response references entity types, such as Store and Freezer, and their relationships, not just raw tables.

From there, you can continue exploring the ontology with a data agent by trying out some prompts on your own.

A word of warning: for this demo, I’ve been using F4 Fabric capacity, and it quickly burst!

As you can see in the illustration below, a screenshot from the Capacity Metrics App, the Graph item (Sales Ontology) exhausted all resources and caused my capacity to be throttled.

Hence, if you plan to use Fabric IQ and Ontology, make sure that your Fabric capacity is powerful enough to support these workloads.

Summary

The future is here! Bringing features like ontologies and graphs into the realm of everyday analytical tasks represents a huge leap forward in terms of how we perceive data. However, before you start creating ontologies like crazy in your Fabric workspace, please bear in mind a couple of things:

  • The feature is still in preview and in very early stages of life, so be prepared for the various limitations and/or glitches
  • Any updates made to the ontology’s upstream data sources need to be manually refreshed before they are available in the ontology item
  • Check the list of the current limitations on Microsoft’s official documentation

Thanks for reading!

Last Updated on February 3, 2026 by Nikola

The post Beyond the Lakehouse: First Thoughts on Fabric IQ appeared first on Data Mozart.

]]>
https://data-mozart.com/beyond-the-lakehouse-first-thoughts-on-fabric-iq/feed/ 0
Microsoft Fabric – My Top 5 New Features from 2025! https://data-mozart.com/microsoft-fabric-my-top-5-new-features-from-2025/?utm_source=rss&utm_medium=rss&utm_campaign=microsoft-fabric-my-top-5-new-features-from-2025 https://data-mozart.com/microsoft-fabric-my-top-5-new-features-from-2025/#respond Wed, 24 Dec 2025 19:13:38 +0000 https://data-mozart.com/?p=9213 There were tons of Fabric announcements in 2025! In this post, I distill the noise and pick top 5 across the entire platform

The post Microsoft Fabric – My Top 5 New Features from 2025! appeared first on Data Mozart.

]]>
It’s the end of the year, and you can find all kinds of retrospectives everywhere. So, I thought: why not throw another one:)?

Microsoft Fabric just turned two a couple of weeks ago (at Ignite in November, to be more precise). As the product is still very much a “work in progress”, we have overseen literally hundreds of new features in the last 365 days. Obviously, not all of them are equally important – some were simply trying to fix the obvious issues in the existing workloads, or trying to catch up either with competitors or with some functionalities we had in the older Microsoft data platform solutions, whereas the others were targeting super niche use cases.

Therefore, in this article, I’ll try to distill what I consider the biggest announcements around Microsoft Fabric in 2025.

Before we jump into the Top 5, a few important disclaimers:

Disclaimer #1: The list is based on my personal opinion. Which means that the list may be strongly biased:)

Disclaimer #2: There are no Power BI features in the list. This is done intentionally. Although Power BI officially belongs to the Microsoft Fabric family, it’s a way more mature product that evolves at a whole different pace compared to the rest of the Fabric platform. I plan to publish my Power BI list as a separate one

Disclaimer #3: When picking my top 5, I was trying to evaluate not just the immediate impact and/or value that particular feature brings to Fabric workloads, but also to think about its potential impact in the future

With all that being said, here is the list. The top 5 are not sorted in any particular order of their importance. It’s simply “best from the rest” and I rate them equally within the list.

#1 OneLake Security

Since Fabric was announced as an “end-to-end analytics platform”, its security model was…Let’s say, cumbersome. From a security standpoint, Fabric looked like a “Frankenstein,” with each part of the platform having its own security model. That was certainly beating “the unification mantra” that Fabric was proclaiming since day one – because, how can we talk about unification if the security model is not unified?

We were crying for a proper security model and it was finally announced in 2025. Although still in public preview and with many limitations still in place, OneLake Security aims to overcome this fragmented security approach and, instead of pushing security rules to individual engines (Spark vs. SQL vs. KQL vs. Power BI), enforce them directly on the tables stored in OneLake.

The idea behind OneLake Security is fairly simple: security “lives” with the data, and no matter which Fabric engine is later used to process the data, security rules will be enforced for any downstream workload.

The year 2026 should be a real breakthrough for OneLake Security – it will probably go to GA, and we all sincerely hope that we will finally get a robust and unified security model across the entire Fabric platform.

You can find more details about the OneLake security access control model here.

#2 Fabric IQ

Out of nowhere, at Microsoft Ignite, we got another workload in Fabric! Fabric IQ was introduced as Microsoft’s attempt to enter the new semantic data layer arena, where traditional semantic layers are enhanced to support numerous AI workloads and machine reasoning over data stored across the organization. There are four items in the IQ workload:

  • Ontology
  • Graph
  • Data agent
  • Operations agent
  • Power BI semantic model

The key component of Fabric IQ is called Ontology. This is not a new term – it has existed for centuries in philosophy. But, let’s not talk about philosophy – this is a technical blog, right:)?

Palantir, a company that is considered a pioneer in this area, defines an ontology like this: “An ontology refers to the systematic mapping of data to meaningful semantic concepts.”

I like to think about Ontology as a conceptual model for the data. I already wrote about the conceptual model and what it represents.

Onotologies and IQ definitely deserve a separate article, so I won’t go into details here. Here is just a simple illustration of why ontology is important, especially for AI agents’ rapidly growing involvement:

In the simplest words: The Ontology turns “Column A equals value X” into “The Customer bought an expired Product”.

I’m quite confident that IQ will become the “1st class” Fabric citizen in 2026, so we should all look closely at the development of this new workload.

Learn more about Fabric IQ here.

#3 Spark Autoscale Billing

I hate this name! If there were a prize for the most misleading feature name in Fabric, Spark Autoscale Billing would be on the podium for sure. But, name aside, I really like this feature.

In a nutshell, Spark Autoscale Billing enables a pay-as-you-go model for Spark workloads in Fabric. Yes, I know, it doesn’t sound exciting, but let me explain why this feature found its place in my top 5 list…First, let’s quickly illustrate how Spark Autoscale Billing (I hate this name, have I already told you that?!) works.

Imagine you have a couple of Spark jobs that are “killing” your Fabric capacity. They consume a lot of resources, so other operations might be jeopardized. Turn on Autoscale Billing, and these Spark jobs will be executed using serverless compute resources that are provisioned outside of your “regular” capacity, and you will pay the bill for these compute resources separately. So, your Fabric capacity resources are not consumed by these jobs, but you’ll be charged extra for this additional compute.

Now, you are probably wondering: ok, this is nice, but why is this such a big deal? Well, today it’s Spark jobs only, but who knows if in the future we will get a possibility to run other Fabric workloads using a classic serverless, pay-as-you-go model. Wouldn’t it be great if we were capable of setting any workload that consumes a lot of CUs to run in a serverless mode? When this feature was announced at FabCon in Las Vegas, I shared my thoughts that it might be a first step in “outsourcing” heavy workloads from Fabric capacity, in order to protect regular capacity workloads.

More information about Autoscale Billing for Spark can be found here.

#4 Data Agents

Copilot is cool, but the (AI) party is somewhere else:) Yes, looking at demos, Copilot might seem like your perfect AI companion to assist you in developing various solutions in Microsoft Fabric. And, don’t get me wrong, it’s ok-ish for basic tasks, but it struggles as soon as you stretch it over the predefined limits.

And, that’s where the Data agent comes into play. With the Data agent, you are in charge, and you can build your own conversational Q&A systems. While Copilot relies on its self-generated context, Data agents can be configured by providing additional, personalized context. This is done by providing both instructions to agents, as well as example queries to empower the Data agent to respond in a more deterministic way to common user queries.

Last but not least, unlike Fabric copilots, whose scope is limited to Fabric, data agents are standalone items, which can be integrated with external systems like Copilot Studio, AI Foundry, IQ, Teams, and many more.

Learn more about Data agents here.

#5 Workspace Identity

One of the coolest new bits in Microsoft Fabric is Workspace Identity – basically a managed identity that’s automatically created and tied to a Fabric workspace. Think of it as a friendly service principal that Fabric manages for you so you don’t have to mess with credentials, secret keys, or certificates. Once you enable it in your workspace settings, this identity can authenticate against resources that support Microsoft Entra (Azure AD) without you writing a single password into your pipelines or configs. It’s a huge win for security and reduces operational headaches, especially when you’re working with firewall-restricted storage like Azure Data Lake Gen2 through trusted workspace access.

Behind the scenes, Fabric creates a matching app registration and service principal in Microsoft Entra ID for each workspace identity, and it automatically handles credential rotation and lifecycle management. Since it’s managed by Fabric, you don’t have to babysit secrets or worry about leaks, although you do still need to assign the right permissions to the identity for the resources you want to access. Only workspace admins can create and configure these identities, and if you delete the workspace or the identity itself, anything depending on it will break and can’t be restored – so handle it carefully!

Workspace Identity also plays nicely with trusted workspace access: services like OneLake shortcuts, pipelines, semantic models, and Dataflows Gen2 can use it to “talk” to secured data sources without storing credentials. That makes building secure, credential-free data pipelines much easier and keeps your environment tighter and less error-prone.

You can read more about Workspace identity here.

More cool features that didn’t make it into Top 5, but are still too good not to be mentioned…

As I mentioned, there were many other fantastic features added or improved, such as:

But they didn’t make it in my top 5 list, even though I might end up using some of them more frequently than the top 5 features. Again, for the top 5, I also considered the number of use cases and their potential impact on the future Fabric workloads and the general platform direction.

So, my dear Fabricators, that was all for this year…I’m super excited to follow how Fabric will develop in 2026 and see which features will make it to the top 5 next year.

Thanks for reading!

What were your top 5 Fabric features in 2025? Let me know in the comments.

Last Updated on December 24, 2025 by Nikola

The post Microsoft Fabric – My Top 5 New Features from 2025! appeared first on Data Mozart.

]]>
https://data-mozart.com/microsoft-fabric-my-top-5-new-features-from-2025/feed/ 0
Facts & Figures – Data Mozart’s 6th Birthday… https://data-mozart.com/facts-figures-data-mozarts-6th-birthday/?utm_source=rss&utm_medium=rss&utm_campaign=facts-figures-data-mozarts-6th-birthday https://data-mozart.com/facts-figures-data-mozarts-6th-birthday/#comments Mon, 22 Dec 2025 10:22:02 +0000 https://data-mozart.com/?p=9202 It's Data Mozart's 6th birthday! Reflections on 2025 and let's make more music from the data in 2026!

The post Facts & Figures – Data Mozart’s 6th Birthday… appeared first on Data Mozart.

]]>
It’s already been six years…Time flies and that’s so true! On this same date five years ago, I did a short retrospective of year 1…Then, I did the same thing four years ago for year 2. In December 2022, I “covered” year 3. Then, two years ago, I described year 4. Finally, last year I shared year 5 experiences. Now, it’s time to reflect on everything that happened in the previous 365 days.

Blogging

In terms of blogging, my productivity was way below usual – I’ve published only 13 posts (including this one)! There’s been a lot of hectic periods with a lot of regular work to do, so I have hardly found spare time to write. But, hey, it’s still 1 blog post per month, which is not that bad at all! It’s not that I didn’t have ideas or motivation to write (I have a serious backlog of ideas for blog posts) – I simply struggled to find time to write:(…

Or, to be more honest, I spent my writing time on another “places” (but, more on that later:))

I’ll repeat: I firmly believe that content is king! Therefore, I tried my best to produce quality content that would be both interesting and helpful to other people. And, trust me, there is no better feeling than when someone gives you feedback that your text/idea helped to solve their problem.

Or, even better – when someone tells you that your blog posts helped them pass the exam, such as DP-600 (Fabric Analytics Engineer) or DP-700 (Fabric Data Engineer)!

In the 1st year, data-mozart.com was visited by ~22.000 users, and had 36.000 unique sessions, while in year 2, there were ~73.000 users and ~105.000 sessions. Then, in year 3, there were ~103.000 users and ~ 151.000 unique sessions…In year 4, ~159.000 users and ~260.000 sessions. Year 5 was absolutely crazy, with ~215.000 users and ~361.000 sessions. Since the writing was less frequent in Year 6, the decrease in numbers was expected:

Translated into page views, it’s ca. 285.000 page views:

Similar to the previous year, the absolute winner was the blog post where I demystified the Parquet file format, and the one in which I explained how to work with tables in Microsoft Fabric.

Substacking

The reason I’ve been blogging less at data-mozart.com is that I moved some of my writings to Substack. I started more than a year ago, and published almost 50 articles over time, building a community of 2.000+ subscribers along the way!

I’ll probably continue using Substack for most of my musings about data in 2026 as well…

Speaking

This was again a very productive year from the speaking perspective, especially considering that I’ve intentionally given up on traveling to conferences in the first half of 2025, because my son just started his school, so I had to take him there and pick him up. I’ve presented at 17 events during 2025, which means 1.5 presentations per month. Again, most of the sessions were delivered virtually (14) – but there were some delivered in person as well (3)!

As a speaking highlight, I’d mention the Dynamics Community Summit North America in Orlando, where I got the best ever swag for not only being a speaker, but also a member of the Programming Committee. I also enjoyed presenting in person at other fantastic events: Data Platform Next Step in Billund, Power Conference Prague, to name a few!!!

And, I owe huge gratitude to all of the organizers, for giving me an opportunity to present, and all attendees for deciding to spend their most precious asset – TIME – with me! THANK YOU!

I’m looking forward to more in-person events in 2026. I already have two confirmed engagements – beginning of March, doing a full-day precon with my good friend Augustin Dokoza Bukvic, and presenting two additional sessions at Power BI Gebruikersdagen – this will be my first time in Utrecht, and I’m super excited about it! Also, two sessions at FabCon Atlanta in March, yaaay!

Books

Bam! This is a brand new section in this annual review! Last year, it was called: Writing a book, Fundamentals of Microsoft Fabric”…Now, it’s written! Ben Weissman and I did a thing, and our book has already been sold in more than 1000 copies, which is amazing for a tech book!

But, you know what’s even crazier?! I’m writing another one, together with Shabnam Watson: Analytics Engineering with Microsoft Fabric and Power BI. It can be read as the early release at O’Reilly’s learning platform, where we add new raw chapters every month.

If someone asked me 18 months ago if I would ever write a book, I’d say: “Never!” Now, if someone asked me if I would write a third book, I’d say: “Maybe”:) I guess this wisdom comes with maturity:)

Also, writing two books is probably the main reason why my blogging suffered in the previous 12 months.

Huge, huge thanks to Michelle Smith, for providing all O’Reilly opportunities and supporting me along the way. Also, a big shout to our editors, Shira Evans and Corbin Collins, for their tremendous work on these projects.

Youtube-ing

I started the Data Mozart Youtube channel in the summer of 2021, as many people asked for a video to complement the blog post. After 11 videos in 2021, and 7 videos in 2022, 9 videos in 2023, and 21 videos in 2024, I published 13 videos in 2025! I plan to continue publishing new and fresh video content in 2025, especially around the Microsoft Fabric and Power BI.

As of today, I’m close to 7.000 subscribers, which is huuuge for me!

I have some super cool and interesting plans for 2026 when it comes to YouTube-ing, so stay tuned:)

Courses

I was very active in 2024 as a Pluralsight author. After creating 3 Pluralsight courses in 2021, 5 more in 2022, and 3 courses in 2023 (and updated two older ones), and 5 courses in 2024, I’ve created 3 new courses in 2025, which gives 19 courses in total! Btw, as I’m writing this, I’m also working on two additional courses! What a year!

Huge thanks to Austi Whetten and Kaylee Udinski for believing in me, but also to all the other lovely, kind, and supportive people from Pluralsight. It’s really a privilege to be part of the Pluralsight family.

I’ve also published two courses for O’Reilly Media: Mastering PL-300 Microsoft Power BI Data Analyst, and a full Power BI – From Beginner to Advanced course. Again, huge thanks to Michelle Smith and Amelia Blevins for their support in these projects.

Joining iLink Digital

Beginning in October, I joined iLink Digital as a Principal Architect. The first three months were nothing short of amazing, as I got an opportunity to work with many talented people in various enterprise-grade projects. I like this day-to-day technical work, and I’m very much looking forward to helping our clients modernize their data estate in 2026 by implementing Microsoft Fabric.

What next?

Well, 2025 was (again) a year to remember – mostly in a positive way!

I would lie if I said it wasn’t exhausting sometimes, especially in the last months of the year – but, I truly enjoy working with data, and even more enjoy sharing my knowledge and experience. There is no better reward than when someone says to you that your article or idea helped to solve a problem or pass the exam!

I hugely appreciate how much I’ve learned, and how much I’m still learning on a daily basis from different data community experts – that being said, I promise to continue making music from the data in 2026!

If you have come so far reading this article, then I’d really like to give you special kudos! Thanks for reading, happy holidays, enjoy the festive period in peace and health with your loved ones, and hope to see a lot of you in person in 2026!

Cheers!

Last Updated on December 22, 2025 by Nikola

The post Facts & Figures – Data Mozart’s 6th Birthday… appeared first on Data Mozart.

]]>
https://data-mozart.com/facts-figures-data-mozarts-6th-birthday/feed/ 2
Direct Lake Models: Are They OneLake or SQL? (And How to Check!) https://data-mozart.com/direct-lake-models-are-they-onelake-or-sql-and-how-to-check/?utm_source=rss&utm_medium=rss&utm_campaign=direct-lake-models-are-they-onelake-or-sql-and-how-to-check https://data-mozart.com/direct-lake-models-are-they-onelake-or-sql-and-how-to-check/#comments Mon, 20 Oct 2025 11:49:34 +0000 https://data-mozart.com/?p=9127 Learn how to quickly identify if your published Direct Lake model uses OneLake or SQL option

The post Direct Lake Models: Are They OneLake or SQL? (And How to Check!) appeared first on Data Mozart.

]]>
In my recent Microsoft Fabric training, I’ve been explaining the difference between the Direct Lake on OneLake and Direct Lake on SQL, as two flavors of Direct Lake semantic models. If you are not sure what I’m talking about, please start by reading this article. The purpose of this post is not to examine the differences between these two versions, but rather to clarify some nuances that might occur. One of the questions I got from participants in the training was:

“How do we KNOW if the Direct Lake semantic model is created as a Direct Lake on OneLake or Direct Lake on SQL model?”

Fair enough, so let’s dig deeper and understand how to determine the real flavor of your Direct Lake models…

#Option 1: TMDL View in Power BI Desktop

Honestly, the TMDL view in Power BI Desktop is one of the game-changing features! In this particular example, we will leverage the TMDL view to find out if our model is using the OneLake or SQL option.

From Power BI Desktop, I connected to the published semantic model named CMS_DL_OL and chose to edit the model. This model was created as a Direct Lake on OneLake model. Once I drag the model definition to the TMDL view, all the way at the bottom, look for the expression property:

If the M expression refers to AzureStorage.DataLake, it’s a Direct Lake on OneLake model.

On the flip side, when you open the Direct Lake on SQL model, the expression will look different:

Instead of the Direct Lake and lakehouse name that we had in the previous case, now the expression refers to the DatabaseQuery, and includes connection details of the SQL Analytics Endpoint of the lakehouse.

#Option 2: Tabular Editor

Tabular Editor is an amazing tool! I already wrote a couple of articles describing how to leverage Tabular Editor in various scenarios, either to complete tasks that are not natively supported by Power BI Desktop, or to complete certain supported tasks in a more efficient way compared to Power BI Desktop.

In this case, let’s examine how Tabular Editor may help us to identify if the Direct Lake model is OneLake or SQL. You can use both Tabular Editor 2 (a free version of the product) and Tabular Editor 3 (a paid version).

In the Shared Expressions node of the model, you’ll find the same information as previously in the TMDL view.

Hence, whenever the shared expression contains DirectLake, this means that the model is based on OneLake.

On the flip side, when the expression is a DatabaseQuery, as you may notice in the illustration below, the model is SQL-based:

What about Fabric Warehouse as a data source?

In the previous examples, we’ve seen the differences between semantic models created on top of the Fabric Lakehouse. The fair question would be: what happens if we use the Fabric Warehouse, since the warehouse doesn’t provide a SQL Endpoint in addition?

To cut a long story short: everything works exactly the same as with lakehouses. Although the SQL Endpoint of the warehouse is not “visible” as a separate item (like with lakehouse), they are built in the warehouse itself. So, the expression property will follow the same logic as with lakehouses.

Thanks for reading!

Last Updated on October 20, 2025 by Nikola

The post Direct Lake Models: Are They OneLake or SQL? (And How to Check!) appeared first on Data Mozart.

]]>
https://data-mozart.com/direct-lake-models-are-they-onelake-or-sql-and-how-to-check/feed/ 1
From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in Microsoft Fabric https://data-mozart.com/from-dataslows-to-dataflows-the-gen2-performance-revolution-in-microsoft-fabric/?utm_source=rss&utm_medium=rss&utm_campaign=from-dataslows-to-dataflows-the-gen2-performance-revolution-in-microsoft-fabric https://data-mozart.com/from-dataslows-to-dataflows-the-gen2-performance-revolution-in-microsoft-fabric/#comments Wed, 15 Oct 2025 08:28:46 +0000 https://data-mozart.com/?p=9094 Dataflows were (rightly?) considered "the slowest and least performant option" for ingesting data into Power BI/Microsoft Fabric. However, things are changing rapidly and the latest Dataflow enhancements changes how we play the game

The post From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in Microsoft Fabric appeared first on Data Mozart.

]]>
In the ocean of announcements from the recent FabCon Europe in Vienna, one that may have gone under the radar was about the enhancements in performance and cost optimization for Dataflows Gen2.

Before we delve into explaining how these enhancements impact your current Dataflows setup, let’s take a step back and provide a brief overview of Dataflows. For those of you who are new to Microsoft Fabric – a Dataflow Gen2 is the no-code/low-code Fabric item used to extract, transform, and load the data (ETL).

A Dataflow Gen2 provides numerous benefits:

  • Leverage 100+ built-in connectors to extract the data from a myriad of data sources
  • Leverage a familiar GUI from Power Query to apply dozens of transformations to the data without writing a single line of code – a dream come true for many citizen developers
  • Store the output of data transformation as a delta table in OneLake, so that the transformed data can be used downstream by various Fabric engines (Spark, T-SQL, Power BI…)

However, simplicity usually comes with a cost. In the case of Dataflows, the cost was significantly higher CU consumption compared to code-first solutions, such as Fabric notebooks and/or T-SQL scripts. This was already well-explained and examined in two great blog posts written by my fellow MVPs, Gilbert Quevauvilliers (Fourmoo): Comparing Dataflow Gen2 vs Notebook on Costs and usability, and Stepan Resl: Copy Activity, Dataflows Gen2, and Notebooks vs. SharePoint Lists, so I won’t waste time discussing the past. Instead, let’s focus on what the present (and future) brings for Dataflows!

Changes to the pricing model

Let’s briefly examine what’s displayed in the illustration above. Previously, every second of the Dataflow Gen2 run was billed at 16 CU (CU stands for Capacity Unit, representing a bundled set of resources – CPU, memory, and I/O – used in synergy to perform a specific operation). Depending on the Fabric capacity size, you get a certain number of capacity units – F2 provides 2 CUs, F4 provides 4 CUs, and so on.

Going back to our Dataflows scenario, let’s break this down by using a real-life example. Say you have a Dataflow that runs for 20 minutes (1200 seconds)…

  • Previously, this Dataflow run would have cost you 19.200 CUs: 1200 seconds * 16 CUs
  • Now, this Dataflow run will cost you 8.100 CUs: 600 seconds (first 10 minutes) * 12 CUs + 600 seconds (after first 10 minutes) * 1.5 CUs

The longer your Dataflow needs to execute, the bigger the savings in CUs you potentially make.

This is amazing on its own, but there is still more to that. I mean, it’s nice to be charged less for the same amount of work, but what if we could make these 1200 seconds, let’s say, 800 seconds? So, it wouldn’t save us just CUs, but also reduce the time-to-analysis, since the data would have been processed faster. And, that’s exactly what the next two enhancements are all about…

Modern Evaluator

The new preview feature, named Modern Evaluator, enables using the new query execution engine (running on .NET core version 8) for running Dataflows. As per the official Microsoft docs, Dataflows running the modern evaluator can provide the following benefits:

  • Faster Dataflow execution
  • More efficient processing
  • Scalability and reliability

The illustration above shows the performance differences between various Dataflow “flavors”. Don’t worry, we will challenge these numbers soon in a demo, and I’ll also show you how to enable these latest enhancements in your Fabric workloads.

Partitioned Compute

Previously, a Dataflow logic was executed in sequence. Hence, depending on the logic complexity, it could take a while for certain operations to complete, so that other operations in the Dataflow had to wait in the queue. With the Partitioned Compute feature, Dataflow can now execute parts of the transformation logic in parallel, thus reducing the overall time to complete.

At this moment, there are certain limitations on when the partitioned compute will kick in. Namely, only ADLS Gen2, Fabric Lakehouse, Folder, and Azure Blob Storage connectors can leverage this new feature. Again, we’ll explore how it works later in this article.

3, 2, 1…Action!

Ok, it’s time to challenge the numbers provided by Microsoft and check if (and to what degree) there is a performance gain between various Dataflows types.

Here is our scenario: I’ve generated 50 CSV files that contain dummy data about orders. Each file contains approximately 575.000 records, so there are ca. 29 million records in total (approximately 2.5 GBs of data). All the files are already stored in the SharePoint folder, allowing for a fair comparison, as Dataflow Gen1 doesn’t support OneLake lakehouse as a data source.

I plan to run two series of tests: first, include the Dataflow Gen1 in the comparison. In this scenario, I won’t be writing the data into OneLake using Dataflows Gen2 (yeah, I know, it defeats the purpose of the Dataflow Gen2), as I want to compare “apples to apples” and exclude the time needed for writing data into OneLake. I will test the following four scenarios, in which I perform some basic operations to combine and load the data, applying some basic transformations (renaming columns, etc.):

  1. Use Dataflow Gen1 (the old Power BI dataflow)
  2. Use Dataflow Gen2 without any additional optimization enhancements
  3. Use Dataflow Gen2 with only the Modern evaluator enabled
  4. Use Dataflow Gen2 with both the Modern evaluator and Partitioned compute enabled

In the second series, I’ll compare three flavors of Dataflow Gen2 only (points 2-4 from the list above), with writing the data to a lakehouse enabled.

Let’s get started!

Dataflow Gen1

The entire transformation process in the old Dataflow Gen1 is fairly basic – I simply combined all 50 files into a single query, split columns by delimiter, and renamed columns. So, nothing really advanced happens here:

The same set of operations/transformations has been applied to all three Dataflows Gen2.

Please keep in mind that with Dataflow Gen1 it’s not possible to output the data as a Delta table in OneLake. All transformations are persisted within the Dataflow itself, so when you need this data, for example, in the semantic model, you need to take into account the time and resources needed to load/refresh the data in the import mode semantic model. But, more on that later.

Dataflow Gen2 without enhancements

Let’s now do the same thing, but this time using the new Dataflow Gen2. In this first scenario, I haven’t applied any of these new performance optimization features.

Dataflow Gen2 with Modern Evaluator

Ok, the moment of truth – let’s now enable the Modern Evaluator for Dataflow Gen2. I’ll go to the Options, and then under the Scale tab, check the Allow use of the modern query evaluation engine box:

Everything else stays exactly the same as in the previous case.

Dataflow Gen2 with Modern Evaluator and Partitioned Compute

In the final example, I’ll enable both new optimization features in the Options of the Dataflow Gen2:

Now, let’s proceed to the testing and analyzing results. I will execute all four dataflows in sequence from the Fabric pipeline, so we can be sure that each of them runs in isolation from the others.

And, here are the results:

Partitioning obviously didn’t count much in this particular scenario, and I will investigate how partitioning works in more detail in one of the following articles, with different scenarios in place. Dataflow Gen2 with Modern Evaluator enabled, outperformed all the others by far, achieving 30% savings compared to the old Dataflow Gen1 and ca. 20% time savings compared to the regular Dataflow Gen2 without any optimizations! Don’t forget, these savings also reflect in the CU savings, so the final CU estimated cost for each of the used solutions is the following;

  • Dataflow Gen1: 550 seconds * 12 CUs = 6.600 CUs
  • Dataflow Gen2 with no optimization: 520 seconds * 12 CUs = 6.240 CUs
  • Dataflow Gen2 with Modern Evaluator: 368 seconds * 12 CUs = 4.416 CUs
  • Dataflow Gen2 with Modern Evaluator and Partitioning: 474 seconds * 12 CUs = 5.688 CUs

However, I wanted to double-check and confirm that my calculation is accurate. Hence, I opened the Capacity Metrics App and took a look at the metrics captured by the App:

Although the overall result accurately reflects the numbers displayed in the pipeline execution log, the exact number of used CUs in the App is different:

  • Dataflow Gen1: 7.788 CUs
  • Dataflow Gen2 with no optimization: 5.684 CUs
  • Dataflow Gen2 with Modern Evaluator: 3.565 CUs
  • Dataflow Gen2 with Modern Evaluator and Partitioning: 4.732 CUs

So, according to the Capacity Metrics App, a Dataflow Gen2 with Modern Evaluator enabled consumed less than 50% of the capacity compared to the Dataflow Gen1 in this particular scenario! I plan to create more test use cases in the following days/weeks and provide a more comprehensive series of tests and comparisons, which will also include a time to write the data into OneLake (using Dataflows Gen2) versus the time needed to refresh the import mode semantic model that is using the old Dataflow Gen1.

Conclusion

When compared to other (code-first) options, Dataflows were (rightly?) considered “the slowest and least performant option” for ingesting data into Power BI/Microsoft Fabric. However, things are changing rapidly in the Fabric world, and I love how the Fabric Data Integration team makes constant improvements to the product. Honestly, I’m curious to see how Dataflows Gen2’s performance and cost develop over time, so that we can consider leveraging Dataflows not only for low-code/no-code data ingestion and data transformation requirements, but also as a viable alternative to code-first solutions from the cost/performance point of view.

Thanks for reading!

Update: Thanks to my friend Sandeep Pawar from Fabric CAT team, I’ve been pointed to the great video by Pat Mahoney (Sandeep’s colleague from Microsoft), that provides more details about configuring the partitioning in Dataflows Gen2.

Last Updated on October 16, 2025 by Nikola

The post From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in Microsoft Fabric appeared first on Data Mozart.

]]>
https://data-mozart.com/from-dataslows-to-dataflows-the-gen2-performance-revolution-in-microsoft-fabric/feed/ 1
Power BI Pro Trick: Sort Visuals by Fields NOT on the Chart! https://data-mozart.com/power-bi-pro-trick-sort-visuals-by-fields-not-on-the-chart/?utm_source=rss&utm_medium=rss&utm_campaign=power-bi-pro-trick-sort-visuals-by-fields-not-on-the-chart https://data-mozart.com/power-bi-pro-trick-sort-visuals-by-fields-not-on-the-chart/#respond Thu, 21 Aug 2025 11:12:29 +0000 https://data-mozart.com/?p=9022 Specific client requirements sometimes may force us to think out of the box and find not-so-obvious solutions

The post Power BI Pro Trick: Sort Visuals by Fields NOT on the Chart! appeared first on Data Mozart.

]]>
Recently, I was dealing with a Power BI report where the client had a very specific requirement – to sort the data in the visual based on a particular field from the semantic model. The only “issue” was that this particular field wasn’t part of the visual. So, while figuring out how this can be accomplished (because, yes, everything can be accomplished when the client needs it, hehe), I decided to write it down and share it with everyone who might find it useful.

Setting the stage

Let me briefly explain the client’s requirement first. For this example, I’ll be using the sample Adventure Works database.

As you may notice, I’m displaying the total sales amount per color and education level. Now, regular sorting allows you to sort the data based on the fields that are part of the visual – Color or Sales Amount:

However, this will sort the data points based on the OVERALL sales amount value. But what if I need to sort the data based on the sales amount value for a specific education level, such as High School (the orange one in our visual)? This was the exact requirement in my project.

DAX to the rescue!

As usual, DAX is our swiss-knife for solving numerous Power BI challenges! Let’s create a measure that will calculate the sales amount for the “High School” education only:

Sales High School = CALCULATE(
                            SUM(FactInternetSales[SalesAmount]),
                            DimCustomer[EnglishEducation] = "High School"
)

Now that we have this new measure in place, we can drag it to the Tooltips section of the visual:

Once the measure is in the Tooltips section, click on the three dots of the visual again…

Voila! We can now sort data based on the High School education sales! As you might have seen, the visual changed accordingly.

What about visuals that don’t support Tooltips?

Fair question! Not all visuals in Power BI support the Tooltips feature. A good example is a Table visual, which is commonly used in Power BI reports. So, how do we make our client happy in this case? Fear not, I’ll show you a workaround!

In the following illustration, you see our starting point:

No Tooltips section, and the data in the Table visual is sorted by Education. But, remember, the requirement is to sort it based on the total sales for the High School education level. In this case, we can’t trick Power BI, and we really need to include our DAX measure in the Table visual:

Here, we will apply the brute force and reduce the column width to 0 pixels to make it invisible…

Oh, God, what is this now??!! This looks so ugly, and it doesn’t accomplish the goal. Again, fear not:)

Simply open the Format pane and turn the Text wrap option off for both Values and Column headers:

Now, go back to the Table visual and collapse the Sales High School column until it becomes invisible:

If you click on the three dots for this Table visual, you can now sort the data based on the “invisible” column:

Yeah, I know it’s not the most sophisticated way of accomplishing the goal – but, hey, my client was more than happy with the solution. And, in most cases, that’s what counts in the end – a happy end user!

Have you found any better solutions to requirements like this? Drop me your ideas in the comments.

Thanks for reading!

Watch on YouTube:

Last Updated on August 29, 2025 by Nikola

The post Power BI Pro Trick: Sort Visuals by Fields NOT on the Chart! appeared first on Data Mozart.

]]>
https://data-mozart.com/power-bi-pro-trick-sort-visuals-by-fields-not-on-the-chart/feed/ 0
From Blank Page to Brilliant Content: Mastering AI-Powered Writing https://data-mozart.com/from-blank-page-to-brilliant-content-mastering-ai-powered-writing/?utm_source=rss&utm_medium=rss&utm_campaign=from-blank-page-to-brilliant-content-mastering-ai-powered-writing https://data-mozart.com/from-blank-page-to-brilliant-content-mastering-ai-powered-writing/#respond Wed, 02 Jul 2025 07:52:53 +0000 https://data-mozart.com/?p=8963 Let's be honest – writing is hard! But what if I told you that AI tools like ChatGPT could become your writing partner, helping you break through blocks and create better content faster?

The post From Blank Page to Brilliant Content: Mastering AI-Powered Writing appeared first on Data Mozart.

]]>
Let’s be honest – writing is hard! Whether you’re crafting a business proposal, working on a blog post, or trying to get your thoughts organized for a presentation, that blank page you are staring at can feel like your biggest enemy. But what if I told you that AI tools like ChatGPT could become your writing partner, helping you break through blocks and create better content faster?

I’m not talking about having AI write everything for you (that’s a recipe for generic, robotic content). Instead, I’m talking about using AI as a sophisticated writing assistant – think of it as having a really smart intern who’s great at brainstorming, organizing, and polishing your ideas.

Why Most People Get AI Writing Wrong?

I’ll tell you this straight away: most people approach AI writing tools completely wrong. I’ve been doing this for a while, too. They throw a simple question at ChatGPT, such as: “Write me a blog post about remote work” and expect magic to happen. What they get instead is generic content that sounds like it was written by, well, a robot.

The secret? AI is most effective when you treat it as a collaborative partner, rather than a magic content generation machine. You need to have conversations with it, give it direction, and iterate on ideas together.

The Building Blocks of Better AI Prompts

Before we dive into the writing process, let’s talk about how to actually communicate with AI effectively. A good prompt has several key ingredients:

  • Role – Tell the AI who it should be.
    • “As an experienced marketing manager…”
    • “As a fitness coach with 10 years of experience…”
  • Task – Be specific about what you want.
    • “Write a report”
    • “Create an outline”
    • “Generate ideas for…”
  • Format – How should the output look?
    • “As bullet points”
    • “In a table”
    • “As a step-by-step guide…”
  • Tone – What’s the vibe?
    • “Professional”
    • “Casual and friendly”
    • “Enthusiastic”
    • “Academic…”
  • Audience – Who’s reading this?
    • “For busy executives”
    • “For complete beginners”
    • “For technical experts…”
  • Scope – Set boundaries.
    • “Keep it under 500 words”
    • “Focus on the top 3 strategies”
    • “Cover the basics only…”

Let me show you the difference:

Bad prompt: “Write about email marketing”

Good prompt: “As an experienced digital marketing consultant, write a beginner’s guide to email marketing. Format it with clear headings and bullet points. Use an encouraging, friendly tone for small business owners who are just getting started. Keep it focused on the 5 most important strategies they can implement this week.”

See the difference? The second prompt gives the AI a clear role, specific instructions, and sets expectations.

The Four-Stage Writing Framework

Here’s where things get interesting. Instead of trying to write everything in one run, break your writing process into four distinct stages. Think of it like building a body:

The image was generated by AI

Stage 1: The Skeleton (Getting Ideas)

This is where you gather all your raw material – the bones of your piece. The goal isn’t to write yet – it’s just to collect ideas.

One powerful technique is using “three modes of persuasion” (based on Aristotle’s ethos, pathos, and logos) to generate comprehensive ideas:

  • Head (Logic): What are the facts, data, and logical arguments?
  • Heart (Emotion): What are the emotional angles and human stories?
  • Behavior (Ethics): What are the practical actions and ethical considerations?

Let’s say you’re writing about switching to renewable energy for businesses. You might ask:

  • “What are the factual, data-driven reasons a business should consider renewable energy?”
  • “What are the emotional and motivational factors that drive businesses to go renewable?”
  • “What are the practical steps and ethical considerations for businesses making this switch?”

This approach ensures you’re not missing major angles and provides you with rich, multi-dimensional content to work with.

Stage 2: The Muscles (Creating Structure)

Now you take those scattered ideas and build them into a motivated outline – the muscles that give your content shape and strength.

This is where the “Repeat and Refine” technique becomes your best friend. Start broad, then get specific:

“Give me 10 section ideas for an article about remote team management”

Once you get those options, pick the most promising and dig deeper:

“Give me 10 variations of the section about ‘Building Trust in Remote Teams'”

Then get even more specific:

“Create a detailed outline for ‘Building Trust in Remote Teams’ that covers both daily practices and long-term strategies”

Pro tip: If you have existing content (like a recorded presentation or meeting notes), you can just paste your rough content and ask: “Create a clear, organized outline from this material, including timing estimates for a 45-minute presentation.”

Stage 3: Trim the Fat (Refining Your Outline)

This is the stage most people skip, although it’s crucial. Take your outline and ruthlessly edit it. Ask yourself:

  • Can each section be covered in 4-5 pages or 10 minutes of speaking?
  • Does each section solve a specific problem?
  • What should stay, what should go, and what should be moved to an appendix?

But, the key thing to keep in mind is: do this trimming yourself, not with AI! You understand your audience and goals better than any AI tool.

Stage 4: The Skin (Writing the Content)

Finally, we get to the actual writing. But even here, we don’t have to start from scratch.

Try the so-called “Tent-Pole Sentences” technique:

  1. Write out the key points you want to make in each section as simple sentences
  2. Ask AI to expand each sentence into a full paragraph
  3. Edit and refine those paragraphs to match your voice

For example, if you’re writing about productivity tips, your tent-pole sentences might be:

  • “Most people confuse being busy with being productive”
  • “The best productivity system is the one you’ll actually use”
  • “Focus blocks are more valuable than scattered work sessions”

Then ask: “Expand each of these sentences into a compelling paragraph for a blog post about productivity. Use a conversational, slightly humorous tone for working professionals.”

Advanced Techniques: AI as Your Writing Toolbox

Once you’ve got the basics under your belt, AI becomes incredibly useful for specific writing tasks:

  • Grammar and Style: “Clean up the grammar in this paragraph while keeping the casual tone”
  • Vocabulary: “Give me 10 alternatives to the word ‘important’ that would work in this context”
  • Tone Adjustment: “Rewrite this email to sound more professional but still friendly”
  • Format Conversion: “Turn this technical explanation into a simple analogy that a 12-year-old could understand”
  • Perspective Shifts: “Rewrite this section in first person and change it to future tense”

The Conversation Approach

Here’s what separates good AI writing from great AI writing: treating it like a conversation, not a one-shot request.

Instead of: “Write me a marketing plan”

Try this flow:

  1. “I need to create a marketing plan for a local bakery. What are the key sections I should include?”
  2. “Great! Now focus on the social media section. What platforms would work best for a local bakery?”
  3. “Instagram sounds perfect. Give me 10 content ideas for an Instagram strategy for this bakery”
  4. “I love the ‘behind-the-scenes baking process’ idea. Give me 5 variations of that concept”

See how much better that conversation produces targeted, useful content?

Creating Visual Elements

Don’t forget that AI can help with more than just text. You can create simple diagrams and flowcharts:

“Create a Mermaid diagram showing the customer journey for an online course, from discovery to completion, including decision points”

This gives you visual elements that can make your writing more engaging and easier to follow.

Real-World Example: Planning a Workshop

Let me walk you through how this might work in practice. Imagine you need to plan a workshop on time management:

Stage 1 (Ideas):

  • “What are the logical, research-backed time management strategies?”
  • “What emotional blocks do people face with time management?”
  • “What are the practical first steps someone can take today?”

Stage 2 (Structure):

  • “Create an outline for a 3-hour time management workshop based on these ideas”
  • “Add timing estimates and interactive elements to this outline”

Stage 3 (Refine):

  • You need to do this manually, trimming sections that don’t fit your audience

Stage 4 (Write):

  • “Write engaging opening remarks that acknowledge people’s frustration with time management advice”
  • “Create 3 practical exercises for the ‘energy management’ section”

What to Watch Out For?

A few important warnings to bear in mind:

  • Don’t trust AI blindly – Always fact-check and verify information, especially statistics or technical details.
  • Avoid over-reliance – Use AI to augment your thinking, not replace it. The best ideas still come from your unique perspective and experience.
  • Mind the voice – AI tends toward a generic, corporate tone. Always edit to match your authentic voice.
  • Check for accuracy – AI can confidently present wrong information. When in doubt, verify with reliable sources.

Sticking All Together

The key to success with AI writing tools isn’t learning fancy prompts or tricks – it’s developing a systematic approach that you can repeat. Begin with simple projects and gradually progress to more complex writing tasks.

Remember, AI is like having a really smart research assistant and brainstorming partner. It’s not going to replace your creativity or judgment, but it can dramatically speed up the parts of writing that used to bog you down: getting started, organizing ideas, and polishing drafts.

The goal isn’t to let AI do your writing for you. It’s to spend less time struggling with blank pages and more time crafting ideas that matter. When you approach it as a collaboration rather than a shortcut, that’s when the real magic happens.

Your Next Steps

Ready to try this approach? Here’s what I recommend:

  1. Pick a simple project – Start with something low-stakes, like a blog post for example
  2. Work through the four stages – Don’t try to write everything at once
  3. Experiment with different AI tools – ChatGPT, Claude, Gemini – they each have strengths
  4. Keep a prompt library – Save the prompts that work well for you
  5. Practice the conversation approach – Treat each writing session as a back-and-forth dialogue

The future of writing isn’t about humans versus AI – it’s about humans working with AI to create better content faster. Once you figure it out, you’ll wonder how you ever wrote any other way.

Thanks for reading!

P.S. This article was created by AI, applying techniques covered in the article:)

Last Updated on July 2, 2025 by Nikola

The post From Blank Page to Brilliant Content: Mastering AI-Powered Writing appeared first on Data Mozart.

]]>
https://data-mozart.com/from-blank-page-to-brilliant-content-mastering-ai-powered-writing/feed/ 0
Triple Five – How to Perform Common Data Transformations in Fabric! https://data-mozart.com/triple-five-how-to-perform-common-data-transformations-in-fabric/?utm_source=rss&utm_medium=rss&utm_campaign=triple-five-how-to-perform-common-data-transformations-in-fabric https://data-mozart.com/triple-five-how-to-perform-common-data-transformations-in-fabric/#comments Tue, 27 May 2025 07:43:35 +0000 https://data-mozart.com/?p=8908 Discover how to perform 5 essential data transformations in Microsoft Fabric using PySpark, T-SQL, and KQL. Whether you're coding in Python, SQL, or Kusto, this post shows you how to clean, shape, and master your data - your way!

The post Triple Five – How to Perform Common Data Transformations in Fabric! appeared first on Data Mozart.

]]>
Data transformations are the bread and butter of every analytics project. Let’s be honest – raw data is usually – well, raw – which in many cases means messy, invalid, and inconsistent. Before you can extract valuable insights, you need to transform the data into something clean, structured, and ready for analysis.

By this point, if we talk about Microsoft Fabric specifically, you are probably aware that the same task can be done in multiple different ways. I’ve already written about how to pick the most suitable analytical engine. But, even then, the possibilities are tremendous. By that, I mean that choosing the analytical engine doesn’t prevent you from mixing more than one programming language for data transformations.

In the lakehouse, for example, you can transform the data by using PySpark, but also Spark SQL, which is VERY similar to Microsoft’s dialect of SQL, called Transact-SQL (or T-SQL, abbreviated). In the warehouse, you can apply transformations using T-SQL, but Python is also an option by leveraging a special pyodbc library. Finally, in the KQL database, you can run both KQL and T-SQL statements. As you may rightly assume, the lines are blurred, and sometimes the path is not 100% clear.

Therefore, in this article, I’ll explore five common data transformations and how to perform each one using three Fabric languages: PySpark, T-SQL, and KQL.

Extracting date parts

Let’s start by examining how to extract date parts from the data column. This is a very common requirement when you have only the date column, and you need to extend date-related information with additional attributes, such as year, month, etc.

Imagine that we have a source data that contains the OrderDate column only, and our task is to create the Year and Month columns out of it.

from pyspark.sql.functions import *

 # Create Year and Month columns
 transformed_df = df.withColumn("Year", year(col("OrderDate"))).withColumn("Month", month(col("OrderDate")))

The same can be done using T-SQL…

--Extracting date parts
SELECT SalesOrderNumber
    , OrderDate
    , YEAR(OrderDate) as OrderYear
    , MONTH(OrderDate) as OrderMonth
    , DATENAME(WEEKDAY, OrderDate) as DayOfWeek
FROM dbo.orders

…and KQL:

//extracting date parts
Weather
| extend dateYear = datetime_part("year", StartTime)
         , dateMonth = datetime_part("month", StartTime)
         , dayOfWeek = dayofweek(StartTime)

The only difference compared to T-SQL is that the dayofweek KQL function will return the number of days since the preceding Sunday, as a timespan.

Uppercase/lowercase transformation

Let’s learn how to apply uppercase/lowercase transformation in PySpark:

df_transformed = df.selectExpr(
    "SalesOrderNumber as order_ID",
    "upper(CustomerName) as customer_name_upper"
)
df_transformed.show()

Now, with T-SQL:

-- Alias, UPPER transformation
SELECT SalesOrderNumber as order_id
    , UPPER(CustomerName) as upper_cust_name
FROM dbo.orders

And, finally, let’s lowercase all the names by using KQL:

//select and rename columns
Weather
| project State
          , LowerState = tolower(State)

Conditional column

  • PySpark – in the following example, we are creating a conditional column price_range, based on the value in the existing UnitPrice column. If the value is greater than 1000, we want to set the value of the price_range to “High”. If it’s greater than 500, then the price_range is “Medium”. For all the other values, the price_range will return “Low”. You might be thinking: wait, if there is a Unit Price value of, let’s say, 2000, this can be both High and Medium. However, the expression evaluation happens in the order you defined in the statement. First, it will check if the particular value is greater than 1000. If this evaluates to true, no further checks will be performed.
from pyspark.sql.functions import col, when

df = df.withColumn(
    "price_range",
    when(col("UnitPrice") > 1000, "High")
    .when(col("UnitPrice") > 500, "Medium")
    .otherwise("Low")
)
df.select("SalesOrderNumber","OrderDate","CustomerName","Item","Quantity","UnitPrice", "price_range").show()
  • T-SQL
--Conditional column
SELECT SalesOrderNumber
    , OrderDate
    , CustomerName
    , Item
    , Quantity
    , UnitPrice
    , CASE 
        WHEN UnitPrice > 1000 THEN 'High'
        WHEN UnitPrice > 500 THEN 'Medium'
        ELSE 'Low'
    END AS PriceRange
FROM dbo.orders
  • KQL
//conditional column
Weather
| extend DamagePropertyClass = case (
    DamageProperty > 49999, "High"
    , DamageProperty > 9999, "Medium"
    , "Low"
)

Pivot transformation

Sometimes, we need to tweak the structure of the underlying data. Pivot transformation is a common example of transforming rows into columns. In this case, I’m creating a new dataframe df_pivot, which first groups the data by the order date, then pivots values from the price_range column, remember those highs, mediums, and lows from the previous example, and calculates the sum of the unit price.

df_pivot = df.groupBy("OrderDate").pivot("price_range").sum("UnitPrice")

df_pivot.show()

The T-SQL implementation requires slightly more complex code, as you may notice in the following snippet:

--PIVOT Transformation
SELECT 
    OrderDate 
    , [High] 
    , [Medium] 
    , [Low] 
FROM (
    SELECT 
        OrderDate
        , UnitPrice
        , CASE 
        WHEN UnitPrice > 1000 THEN 'High'
        WHEN UnitPrice > 500 THEN 'Medium'
        ELSE 'Low'
    END AS PriceRange
    FROM dbo.orders
) AS SourceTable
PIVOT (
    SUM(UnitPrice)
    FOR PriceRange IN ([High], [Medium], [Low])
) AS PivotTable;

Unlike PySpark and T-SQL, which both provide a built-in pivot function, there is no such function in KQL. However, there is a pivot plugin, and we can mimic PIVOT behavior by combining the summarize and evaluate operators. In this example, we are summing injuries direct per year, month and state, and then pivoting states to be columns.

//Pivot transformation
Weather
| summarize Total = sum(InjuriesDirect) by datetime_part ("year", StartTime), datetime_part ("month", StartTime), State
| evaluate pivot(State, sum(Total))

Aggregation and grouping

Let’s wrap it up by examining how to perform aggregation and grouping operations when transforming the data. This is one of the most frequent requirements in analytical scenarios.

First, PySpark. I’m grouping the data by the year and month columns, and then I’m performing the sum aggregate function on the Unit price column. In the last step, we can also rename the column if needed, which I did here.

sales_by_yearMonth = orders_df.groupBy("Year", "Month").agg(
    {"UnitPrice": "sum"}
).withColumnRenamed("sum(UnitPrice)", "Unit_Price")

sales_by_yearMonth.show()

Now, T-SQL:

--Aggregate and group data
SELECT YEAR(OrderDate) as OrderYear
    , MONTH(OrderDate) as OrderMonth
    , count(*) as TotalOrders
    , SUM(UnitPrice) as TotalUnitPrice
FROM dbo.orders
GROUP BY YEAR(OrderDate)
    , MONTH(OrderDate)

And, finally, KQL:

//aggregate and group data
//Use summarize operator
Weather
| summarize TotalRecords = count() by State

Conclusion

By understanding and leveraging the strengths of PySpark, T-SQL, and KQL, you can efficiently handle numerous data transformation tasks in Microsoft Fabric. Whether it’s filtering, aggregating, pivoting, cleaning, or enriching the data, choosing the right language for each scenario will enable faster and more resilient workflows, enhance productivity, and deliver more value from the raw data.

Thanks for reading!

Last Updated on May 27, 2025 by Nikola

The post Triple Five – How to Perform Common Data Transformations in Fabric! appeared first on Data Mozart.

]]>
https://data-mozart.com/triple-five-how-to-perform-common-data-transformations-in-fabric/feed/ 1