Unifying Real-Time Intelligence and Advanced Query Optimization in Microsoft Fabric

1. Introduction: The Shift Towards Integrated Data Platforms

Driven by overwhelming customer demand for simplicity, mounting regulatory pressure for open standards, and the need for greater internal engineering efficiency, the data and analytics industry is undergoing a fundamental shift away from siloed data engines toward highly integrated solutions. This evolution marks a departure from the “one size does not fit all” paradigm that led to a complex collection of specialized tools. Microsoft Fabric emerges as a direct response to this evolution—a comprehensive, AI-powered platform designed to unify every data source and analytics service into a single, cohesive experience.

This whitepaper provides a technical analysis of two key innovations within the Fabric ecosystem that exemplify this unified approach. First, we will explore its native Real-Time Intelligence (RTI) capabilities, an integrated suite of tools designed for ingesting, processing, and acting on streaming data with minimal latency. Second, we will examine the forward-looking concept of a Query Optimizer as a Service (QOaaS), a visionary approach to centralizing and standardizing query optimization across the platform’s diverse analytical engines. To fully appreciate these advancements, we must first understand the foundational architecture upon which they are built.

2. The Foundation: Microsoft Fabric’s Unified Architecture

To understand the advanced capabilities of Microsoft Fabric, one must first grasp its core architectural principles. The platform’s power stems from its unified data lake, lakehouse paradigm, and multi-engine design, which collectively eliminate the data silos and integration friction that have long characterized enterprise data landscapes. This foundation not only simplifies data management but also fosters a new level of collaboration between data engineers, data scientists, and business analysts.

2.1 OneLake: The “OneDrive for Data”

At the heart of Microsoft Fabric is OneLake, a single, unified, logical data lake for an entire organization. Like OneDrive, OneLake is automatically provisioned with every Fabric tenant, serving as the central repository for all analytics data and eliminating the overhead of creating and managing multiple data lakes for different business groups. Its core characteristics are designed for simplicity, governance, and openness.

  • A Single Data Lake: Every Fabric tenant has exactly one OneLake, removing the need to provision or manage additional resources. This single-source approach is fundamental to improving collaboration and simplifying the data estate.
  • Governed by Default: Within the tenant, data is organized into workspaces, which enable different parts of the organization to establish distributed ownership and apply specific access policies. This structure ensures that while the data lake is unified, governance can be delegated and managed according to business needs.
  • Built on Open Standards: OneLake is built on Azure Data Lake Storage (ADLS) Gen2. Critically, it automatically stores all tabular data—whether from a data warehouse or a lakehouse—in the open Delta Parquet format. This commitment to open standards prevents vendor lock-in and ensures broad compatibility.

Building on this model, Fabric introduces Domains as a higher-level organizational layer for logically grouping workspaces according to business-centric areas like “Sales,” “Marketing,” or “Finance.” This capability allows for true federated governance by delegating specific settings and controls to domain administrators, effectively enabling organizations to implement a “data mesh as a service” paradigm on OneLake.

A key enabler of this unified vision is Shortcuts. Shortcuts are references to data stored in other locations, whether within OneLake or in external storage like Amazon S3. They allow organizations to combine data from disparate sources into a virtual data product without requiring physical data movement or duplication, making files and folders appear as if they are stored locally within OneLake.

2.2 The Multi-Engine Lakehouse

Fabric’s architecture allows different analytical engines—including T-SQL, Apache Spark, and Analysis Services—to operate on a single copy of data stored in OneLake. Because all tabular data is persisted in the Delta Parquet format, a data engineer can use Spark to load data into a lakehouse, and a SQL developer can use T-SQL to load data into a data warehouse, with both contributing to the same underlying data lake.

This multi-engine, single-copy strategy is a strategic departure from traditional architectures that enforces a separation of concerns between compute and storage, allowing teams to leverage the best-fit engine for a given task without data duplication or lock-in. It eliminates redundant storage costs, reduces data consistency issues, and allows teams to use the best engine for the job without being constrained by where the data resides.

Power BI directly benefits from this architecture through its revolutionary Direct Lake mode. This mode connects the Power BI Analysis Services engine directly to the Delta Parquet files in OneLake, combining the high performance of traditional data import with the real-time nature of direct query. This is achieved without duplicating the data into a separate Power BI semantic model, a powerful demonstration of Fabric’s unified promise. This foundational ability to decouple compute engines from a single copy of data in OneLake is not merely an architectural elegance; it is the critical enabler for complex, high-throughput workloads, most notably the real-time analytics ecosystem we will now explore.

3. Deep Dive: Real-Time Intelligence (RTI) in Microsoft Fabric

In today’s fast-paced business environment, the ability to analyze and act on data as it happens is no longer a luxury but a strategic necessity. Microsoft Fabric addresses this imperative with Real-Time Intelligence (RTI), a powerful and fully integrated ecosystem designed to handle every stage of the real-time data lifecycle—from ingestion and processing to analysis, visualization, and automated action—with minimal latency.

3.1 The RTI Component Ecosystem

The RTI experience in Fabric is delivered through a suite of interconnected components, each with a specific function in the end-to-end data flow.

Eventstream

This is the foundational component for real-time data ingestion. Eventstream acts as an event listener, capturing and processing data pushed from a variety of sources, including Azure Event Hubs, Apache Kafka, and Azure IoT Hub. It provides a no-code canvas for transforming and enriching data in motion—such as filtering, joining, and aggregating—before routing it to a destination within Fabric.

Eventhouse

Eventhouse is a high-performance, scalable storage solution for event-driven data, functionally equivalent to Azure Data Explorer. It contains one or more KQL databases, which are highly optimized for time-series and log analytics, capable of querying billions of records in seconds.

Activator

Real-time intelligence is about taking action, not just observing. Activator is a rule-based engine that monitors incoming data streams for specific patterns, thresholds, or anomalies. When a predefined condition is met, Activator can trigger automated actions and alerts, integrating with tools like Power Automate to create sophisticated, event-driven workflows that respond to insights as they are discovered.

KQL Queryset

For deep exploration and analysis of real-time data, Fabric provides the KQL Queryset. This interface allows users to run queries using the powerful Kusto Query Language (KQL) against data stored in an Eventhouse. KQL is designed for high-speed pattern matching, anomaly detection, and aggregation, making it ideal for uncovering insights from massive volumes of event data.

3.2 Visualization and Centralized Management

Real-Time Dashboards provide live, interactive visualizations of streaming data with minimal latency. Functionally equivalent to Azure Data Explorer dashboards, these are built specifically for monitoring live data streams and do not use Power BI. They enable organizations to track rapidly changing KPIs and detect anomalies instantly, without first loading data into a semantic model.

AspectAnalysis
Primary Use CaseReal-Time Dashboard: Specifically built for monitoring live data streams and is commonly used for operational monitoring, alerting, and tracking key performance indicators (KPIs) that change rapidly. <br/><br/> Power BI Report: Excellent for historical analysis, trend discovery, and batch-reporting.
Data LatencyReal-Time Dashboard: Near-instant updates optimized for low-latency event tracking. <br/><br/> Power BI Report: Can introduce some latency depending on query complexity and whether it uses streaming datasets or DirectQuery mode.
Processing MethodReal-Time Dashboard: Directly visualizes data without the need to load data into a semantic model. <br/><br/> Power BI Report: Requires using streaming datasets or DirectQuery mode.

While both RTI and Power BI can visualize real-time data, they are optimized for different use cases.

Finally, the Real-Time Hub serves as the centralized control panel for the entire RTI ecosystem. It provides a single interface for configuring, monitoring, and managing all RTI components, giving users complete visibility and control over their real-time data flows.

3.3 Ecosystem Integration and Use Cases

A critical feature bridging RTI with the rest of the Fabric platform is “OneLake availability” for Eventhouse. When enabled, this feature makes the KQL database tables discoverable and queryable from other Fabric engines via OneLake shortcuts, which avoids data movement. The data is made available in OneLake in the Delta Lake format, allowing users to create Power BI reports in Direct Lake mode, query the data from a Warehouse, or analyze it in a Lakehouse notebook—all without moving or duplicating the original data.

This integration unlocks powerful use cases, such as operational reporting offloading. Instead of running real-time inventory reports directly against a transactional system like an Azure SQL Database, an organization can use Eventstream with change data capture (CDC) to replicate data to an Eventhouse in milliseconds. The reports can then be run against the Eventhouse, providing up-to-the-second insights while significantly reducing the computational load on the critical source system. While RTI demonstrates the power of integrating specialized engines on OneLake, this multi-engine paradigm introduces a profound architectural challenge: how can query performance be optimized holistically when T-SQL, Spark, and KQL engines each have their own distinct optimizers? This challenge motivates the forward-looking vision for a unified Query Optimizer as a Service.

4. The Future of Query Processing: Query Optimizer as a Service (QOaaS)

While integrated platforms like Microsoft Fabric solve the problem of data silos, they introduce a new layer of complexity: ensuring optimal query performance across multiple, distinct analytical engines operating on the same data. The challenge, therefore, is how to unify query optimization to deliver the best performance, regardless of which engine is used. The concept of a Query Optimizer as a Service (QOaaS) emerges as a visionary solution to this challenge—a centralized service designed to rule them all.

FeatureCustom QOQO as a LibraryQOaaS
Innovation speed
Engineering efficiency
New engine time-to-market
Cross-engine optimization
QO scalability
Workload Observability
Workload Optimization

4.1 Evaluating Query Optimizer Architectures

Historically, query optimizers (QOs) have been built as bespoke components tightly coupled to a specific engine (Custom QO) or as shared libraries that can be reused (QO as a Library). The QOaaS model proposes a radical decoupling, where the optimizer operates as a standalone service that multiple engines can call. This approach offers compelling advantages for a unified ecosystem.

As shown, a QOaaS architecture excels where others fall short, particularly in its ability to perform cross-engine and workload-level optimizations.

4.2 Prototyping QOaaS in Microsoft Fabric

To test the feasibility of this vision, an experimental prototype was developed within Microsoft Fabric. The goal was to use the mature, cost-based Unified Query Optimizer (UQO) from the Fabric Data Warehouse to optimize queries for the Fabric Spark engine, which has limited cost-based optimization (CBO) features, such as join re-ordering and broadcast-vs-shuffle join decisions. The key to enabling communication between the engines and the central optimizer was Substrait, an open-source, language-neutral specification for representing relational algebra.

Two prototype approaches were evaluated:

  1. QOaaS-v1: In this model, the UQO generated an optimized logical plan. This plan was then passed back to the Spark QO, which was responsible for the final physical implementation.
  2. QOaaS-v2: Here, the UQO produced an optimized logical plan that included physical implementation hints (e.g., recommending a broadcast join over a shuffle join). The Spark QO was modified to consider these hints when generating the final physical plan.

Experiments using the MSSales workload, characterized by join-heavy queries, revealed the significant potential of the QOaaS model. For 40-way joins, QOaaS-v1 improved performance by nearly 2x over the native Spark QO by producing a better join order. With additional physical implementation hints, QOaaS-v2 significantly improved performance further on most queries, which was attributed to the UQO’s more sophisticated cost-based optimization. However, tests on TPC-H Query 5 revealed a 1.5x slowdown with QOaaS-v2. This regression occurred because optimization decisions made by UQO prevented the Spark QO from applying its Bloom filters optimization effectively, demonstrating the sub-optimality of layering optimizers.

4.3 Key Challenges and Architectural Vision

The prototype surfaced critical lessons. The “patchy” approach of stacking two optimizers proved suboptimal, as decisions made by one could inadvertently block optimizations in the other. A truly unified QO must consider all optimization opportunities jointly. The experiments also highlighted the critical need to automatically recalibrate the cost model for different engines and workloads, as a single set of hard-coded cost parameters is insufficient. The use of ML-powered infrastructure like MLOS for automatic tuning was identified as a key enabler for this.

Based on these findings, a more comprehensive QOaaS architecture was proposed. This future-state design includes novel components to enable continuous learning and improvement:

  • Query Insight Store: A repository to capture historical query plans, execution times, and actual cardinalities from across all engines.
  • External Tuner Plugin: An interface allowing advanced, ML-based processes to analyze the data in the Query Insight Store and feed learned optimizations back into the system.

While QOaaS remains an ambitious vision, it represents the logical evolution of query processing in a truly unified data platform.

5. Conclusion: A Unified Vision for Data and Analytics

Microsoft Fabric answers the industry’s demand for a cohesive platform that moves beyond the fragmentation of siloed data tools. By providing a single, logical data lake and enabling multiple analytical engines to operate on one copy of data, Fabric fundamentally simplifies the enterprise data landscape and fosters unprecedented collaboration.

The innovations explored in this paper—Real-Time Intelligence and the Query Optimizer as a Service concept—are powerful demonstrations of this unified approach in action. RTI provides an end-to-end, integrated solution for turning streaming data into actionable insights, while the QOaaS vision charts a path toward centralized, intelligent query optimization that can elevate the performance of every engine in the ecosystem. Together, these advancements showcase a future where real-time insights and cross-engine performance optimization are not separate, complex challenges but integral features of a single, intelligent data platform.

Leave a Reply

Your email address will not be published. Required fields are marked *