Data Validation Using Databricks and Great_Expecations for Feature Stores - Brillio
                           Tharun Mathew April 28, 2023


The rapid growth of large-scale feature stores makes them susceptible to creeping data correctness issues that can go unnoticed during testing. Furthermore, the current data quality measures, such as cleansing and monitoring, are not tailored to identify specific features or business-related data validation issues that can arise over time or as a result of changes in the source data. In particular, this applies to external data sources such as credit bureau or financial data, which will experience changes not considered in the transformation logic for the feature calculation.

Automated data validation allows the monitoring of data assets across data stores and feature stores. Within feature stores, data validation frameworks can:

  1. Perform rule-based automated validation that ranges from simple averages to complex multi-column to ensure that data is accurate within boundaries. It is also likely that some of the use cases will require custom validations. Validations can happen with every feature store refresh.
  2. Easily maintain the data pipeline by adding/removing/updating rules at the table or column level.
  3. Ensure the visibility of validation results, which is important for users to make informed decisions. Depending on the type of feature store, data scientists may use features to build or score models in production. Therefore, it is crucial to provide users with tools such as dashboards and enable them to view real-time data validation results and be confident in the data. Furthermore, administrators can be notified if any data validation results are out of bounds so that they can take the necessary steps to correct and prevent future errors in production data scoring.

Databricks and Great_Expecations for Data Validation

Great_expecations is a powerful, shared, and open standard for data quality that can be extended as a data validation utility for feature stores. Great_expectations provides the ability to assert expectations on data that is loaded and transformed, which at the same time is validated for data issues. Additionally, great_expectations creates documentation and reports from these expectations, which can be extended as dashboards and notifications for users.

Delta Lake is an open-source storage layer that brings reliability to the data lake and provides ACID transactions, scalable metadata handling and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake, is fully compatible with Apache Spark APIs, and brings capabilities such as time travel, scalability, and support for PySpark libraries, making it well-suited to build large feature stores.

Additionally, Managed MLflow is built on top of MLflow, an open-source platform developed by Databricks to help manage the complete machine learning lifecycle with enterprise reliability, security, and scale. This makes Databricks a platform with a complete set of tools and utilities to build, deploy and manage ML models.

About the Author


Tharun Mathew

Tharun Mathew is a highly skilled Senior Data Architect at Brillio EU. A global leader with strong expertise in building large-scale data lakes and lakehouses on Azure, AWS, Databricks, and Spark. Additionally, Tharun is experienced in building enterprise feature stores and ML engineering products on Databricks and Spark.

Let’s create something brilliant together!

Let's Connect