Tharun Mathew is a highly skilled Senior Data Architect at Brillio EU. A global leader with strong expertise in building large-scale data lakes and lakehouses on Azure, AWS, Databricks, and Spark. Additionally, Tharun is experienced in building enterprise feature stores and ML engineering products on Databricks and Spark.
28th April, 2023
Brillio’s Data Validation Suite is built on Databricks and Great Expectations, is rule-based, and enables end-to-end data validation, monitoring, and reporting. Here are the key features of the solution:
Figure 1: Solution Overview
As part of Brillio’s data validation solution, Azure Data Factory is used to orchestrate Databricks notebooks. After the data sets within the Data Lake have been refreshed, the ADF triggers are activated, starting the validation notebooks. Validation notebooks examine the SQL Server table or the config file within the data lake for rules and parameters.ake. At a column level, the rules define the benchmark that must be validated. The following snippet illustrates a sample validation configuration file. User-defined rules can easily be added or modified in these config files or tables.
Figure 2:Sample Rule Config table
Databricks notebooks are passed the tables to be checked and the rules to be validated as parameters. Upon completion of validation, the results are recorded in a SQL Server table. Users can create additional PowerBI dashboards based on these data sets, providing them with a real-time view of the data’s accuracy.
Figure 3:Validation Result Snippet
Figure 4: Validation Result Output
As a result, the data validation utility provides users with a comprehensive and real-time view of the accuracy of their data. The users are notified of any deviation from the set parameters within the rule table or configuration file. In order to fix issues or make a decision on whether to continue production runs, users perform a deep dive into the data, which includes detailed data analysis.