How To Get Started Managing Data Quality With SQL and Scale

By Tom Baeyens, CTO & Co-Founder

Silent data quality issues are the biggest problem facing data teams today, who are flying blind with no systems or processes in place to monitor and detect bad data before it has a downstream impact.

Why data management?

In the last three years I’ve transitioned from being a software engineer to a data engineer. I fell into the area of data management when Maarten Masschelein, my fellow co-founder at Soda, and I started working together to solve the problem of data issues that are silent and undetected. Coming from a software engineering background, writing unit tests and monitoring applications in production is a given but in data, it’s quite different. Whilst most organizations are aware they should test, there is no strategy in place and they just don’t know how to start addressing the problem which leaves their systems exposed and can result in serious downstream issues for the data products they are building.

Define good data quality

We started with Soda SQL, made available in February 2021. It’s our first open source data testing, monitoring and profiling tool for data-intensive environments. It works with your existing data engineering workflows to create a quick and easy way to define what good quality data means to your business. This enables data engineers to define tests and protect against the silent data issues that go undetected in datasets, data lakes, and data warehouses.

Open source to the rescue

Soda SQL is an open source tool with simple Command Line Interface (CLI) and Python library to test your data through metric collection. It utilizes YAML config files as input to prepare SQL queries that run tests on tables in a database to compute a wide range of metrics and tests. It’s super easy to find invalid, missing, or unexpected data. Because Soda SQL leverages — you guessed it — SQL, the data can stay where it is and existing compute engines can be leveraged.

Bring everyone closer to the data

We have just released Soda Cloud, which is a web application where the Soda SQL metrics and test results can be monitored over time. Soda Cloud creates transparency from engineers to other people in the data team. With this collaboration data teams get ahead of the silent data issues. Soda Cloud extends Soda SQL and the two work together seamlessly.

Soda‘s data observability and collaboration platform keeps data fit for purpose, verifiable and trustworthy. We bring everyone closer to trusted data.

