Brillio Data Validation Suite used within an Azure-based Feature Store - Brillio
Tharun Mathew April 28, 2023

Brillio’s Data Validation Suite is built on Databricks and Great Expectations, is rule-based, and enables end-to-end data validation, monitoring, and reporting. Here are the key features of the solution:

  • Open source based that can be implemented in any Spark-based platform with minimal changes.
  • Extended using Azure Data Factory and Databricks for the Azure platform.
  • Enables perform rule-based data validation for every table refresh.
  • Stores validation output as delta tables and then PowerBI.
  • Triggers an automated email response to data and platform owners at any deviation from set expectations.

pastedGraphic.png

Figure 1: Solution Overview

As part of Brillio’s data validation solution, Azure Data Factory is used to orchestrate Databricks notebooks. After the data sets within the Data Lake have been refreshed, the ADF triggers are activated, starting the validation notebooks. Validation notebooks examine the SQL Server table or the config file within the data lake for rules and parameters.ake. At a column level, the rules define the benchmark that must be validated. The following snippet illustrates a sample validation configuration file. User-defined rules can easily be added or modified in these config files or tables.

pastedGraphic_1.png

Figure 2:Sample Rule Config table

Databricks notebooks are passed the tables to be checked and the rules to be validated as parameters. Upon completion of validation, the results are recorded in a SQL Server table. Users can create additional PowerBI dashboards based on these data sets, providing them with a real-time view of the data’s accuracy.

pastedGraphic_2.png

Figure 3:Validation Result Snippet

pastedGraphic_3.png

Figure 4: Validation Result Output

As a result, the data validation utility provides users with a comprehensive and real-time view of the accuracy of their data. The users are notified of any deviation from the set parameters within the rule table or configuration file. In order to fix issues or make a decision on whether to continue production runs, users perform a deep dive into the data, which includes detailed data analysis.

About the Author

 

Tharun Mathew

Tharun Mathew is a highly skilled Senior Data Architect at Brillio EU. A global leader with strong expertise in building large-scale data lakes and lakehouses on Azure, AWS, Databricks, and Spark. Additionally, Tharun is experienced in building enterprise feature stores and ML engineering products on Databricks and Spark.

Let’s create something brilliant together!

Let's Connect