Blog | Technology
28th April,   2023
Tharun Mathew is a highly skilled Senior Data Architect at Brillio EU. A global leader with strong expertise in building large-scale data lakes and lakehouses on Azure, AWS, Databricks, and Spark. Additionally, Tharun is experienced in building enterprise feature stores and ML engineering products on Databricks and Spark.
Introduction
The rapid growth of large-scale feature stores makes them susceptible to creeping data correctness issues that can go unnoticed during testing. Furthermore, the current data quality measures, such as cleansing and monitoring, are not tailored to identify specific features or business-related data validation issues that can arise over time or as a result of changes in the source data. In particular, this applies to external data sources such as credit bureau or financial data, which will experience changes not considered in the transformation logic for the feature calculation.
Automated data validation allows the monitoring of data assets across data stores and feature stores. Within feature stores, data validation frameworks can:
Databricks and Great_Expecations for Data Validation
Great_expecations is a powerful, shared, and open standard for data quality that can be extended as a data validation utility for feature stores. Great_expectations provides the ability to assert expectations on data that is loaded and transformed, which at the same time is validated for data issues. Additionally, great_expectations creates documentation and reports from these expectations, which can be extended as dashboards and notifications for users.
Delta Lake is an open-source storage layer that brings reliability to the data lake and provides ACID transactions, scalable metadata handling and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake, is fully compatible with Apache Spark APIs, and brings capabilities such as time travel, scalability, and support for PySpark libraries, making it well-suited to build large feature stores.
Additionally, Managed MLflow is built on top of MLflow, an open-source platform developed by Databricks to help manage the complete machine learning lifecycle with enterprise reliability, security, and scale. This makes Databricks a platform with a complete set of tools and utilities to build, deploy and manage ML models.