Unlocking valuable insights from a geospatial enterprise data lake
About the Customer
One of Europe’s largest debt management institutions with operations across mainland UK, DACH, and Nordics.
The regions acquired by the Credit Management Firm worked independently, using isolated data and disparate reporting systems. There was an immediate need for stronger data governance to regulate regional compliances.
Aspects that needed attention:
Increasing total cost of ownership in the existing on-premise model
Limiting capability to generate pioneering insights
Isolating and disparate systems functioning in silos
Lack of advanced analytics for data monetization
Organizations are often interested in analytics but fail in execution because they lack the knowledge of what initiatives to prioritize and are unaware of the best fit technologies to invest in. Brillio helps companies understand the path forward to optimally leverage data and help organizations serve their clients better. By using modern infrastructure (such as public cloud platforms) and advanced analytics we can help uncover tangible revenue and cost benefits.
Challenges faced by the client include:
Loss in productivity
Data science at snail’s pace
Failed data endpoint implementations
Booming data ecosystem
Brillio, as part of architecture consulting approach, proposed an Azure based data lake powered by Databricks to build data models for reporting and feature stores for data science requirements. The key components of the solution included:
SSIS based data egress from on-premises data sources to Azure data lake. The SSIS solution also performed GDPR compliant anonymization of PII data before it is pushed to the data lake.
Azure data lake served as the central data repository, serving both the product and user needs. The primary products served were data models for reporting and feature stores used for data science model builds and consumption. Additionally, the data lake was used to build additional reports and data science models
Azure Databricks was the core data transformation and processing component.
Databricks notebooks were used to perform data quality and governance transformation, which included DQ checks, audits, and data lineage audits.
The data science feature stores were built on terabyte scale data sets spanning 200M+ datasets spanning 15,000+ columns. The retro capability in feature stores enabled historical data correction based on changes in current data. This enabled data scientists to build accurate models based on changing data scenarios. Additionally, automated sample selection data sets were created on a daily basis, which data scientists used to select samples for analytics model development.
Dimensional model transformation used for canned reporting needs across the organization were transformed and persisted onto Azure SQL DWH. This was then consumed by Qlik Sense as part of daily, weekly, and monthly report generation.
Azure data factory was used to orchestrate data integration and transformation across the data platform. This includes orchestration of Databricks notebooks across platforms. The metadata driven approach enabled ADF jobs immediate restart ability and fault tolerance needed to perform multi-day batch processing.
Centralized data store enabled users across reporting and data science to have a single source of truth for model and report building.
Time taken to obtain variables from feature stores for consumers was reduced from 4 weeks to 4 hours. Prior to implementation of the solution, customer dataavailable for analysis was limited to recent 30-day timespan, while the current feature store provided data scientists with consumer data from Day 0.
Sample selection became seamless compared to prior issues with being unable to join different sources together directly, meaning filters couldn’t be added to queries causing long run-times and unnecessary large data extracts. The current solution provided users with six aggregated tables combining source data together. Users can join tables together directly so queries can be filtered at source, meaning faster run-times where only required data is extracted.
The platform enabled users to take capability of Azure’s compute power and capability to build ML models. Earlier models were using regression-based techniques,Azure enabled users to work on ML based techniques.
End-to-end data lineage and cataloguing capabilities ensures organization wide data governance.