How To Get Started Managing Data Quality With SQL and Scale

By Tom Baeyens, CTO & Co-Founder

Unsplash Contributor: Ricardo Gomez Angel

Silent data quality issues are the biggest problem facing data teams today, who are flying blind with no systems or processes in place to monitor and detect bad data before it has a downstream impact.

Why data management?

With more and more products being built using data as the core input, it’s never been more important to test and monitor the quality of data being used. And so we set about building a data observability platform that enables organizations to discover, prioritize and resolve data issues.

Define good data quality

Open source to the rescue

If tests fail, Soda SQL allows you to stop the pipeline and prevent bad data from causing damage. As metrics are computed, diagnostic information is captured as well to help with the analysis if a data issue is detected. Steps can then be taken to prioritize and collaboratively resolve issues as one data team. Soda SQL can be used manually on its own or integrated with a data orchestration tool to schedule scans and automate actions based on scan results.

You can check out the 5-minute tutorial on how to get started but here’s a quick example:

  1. Simple metrics and tests can be configured in scan YAML configuration files. An example of the contents of such a file is as follows:

2. Based on these configuration files, Soda SQL will scan your data each time new data arrived like this:

Bring everyone closer to the data

First of all Soda Cloud extends Soda SQL with a metrics database so that measurements and test results can be visualized over time. This enables monitoring change over time and anomaly detection on all of the metrics.

These visualizations and data profiles already create transparency between different people in the larger data team. All people in the data team get to see what data is actually present, what tests are performed.

But the Soda Cloud goes one step further. It enables non-technical people to build and maintain their own monitors in a simple UI with a 3 step wizard. This is important because it removes the bottleneck to monitoring the domain knowledge that Subject Matter Experts have. If they don’t need to involve data engineers to get their domain logic tested, that means a lot more of that domain knowledge will be used to define what good data looks like. And as a result a lot more bad data will be captured preventing various kinds of damages.

Soda Cloud prescriptively solves the problem of discovering the silent data issues, by giving data teams a central platform to track and score the health of data across core quality dimensions.

Data and analytics engineers are equipped with a way to test data each and every time it transforms to ensure data pipelines are reliable. Via Soda SQL, data production can be stopped and quarantined. Soda Cloud visualizes the health of data sets and acts as a communication hub for data issues.

Data consumers and producers can now easily align on what’s important, what’s expected, and what to measure so that data remains fit for purpose. We’ve also built integrations with email and Slack to ensure the right people are alerted, at the right time to diagnose, prioritize and resolve the data issues.

We’re on a mission to bring everyone closer to the data, as we believe that data quality is a team sport. Everyone who has a stake in the data (and we think that’s everyone in the business nowadays), needs to understand it, trust it, and stay on top of it.

My main responsibility at Soda is to ensure data engineers love using our products and help them solve real problems quickly. We help solve the problem with a combination of a cloud platform and a set of open source developer tools, that give data teams the configurability they need to create end-to-end observability.

Good quality data is for everyone. Access Soda SQL on GitHub and Soda Starter, our free trial, on Soda.io (extended to June 30, 2021). Our Slack Community and Docs contain best practices and helpful resources.

Get ahead of the silent data issues. Good luck!

Soda‘s data observability and collaboration platform keeps data fit for purpose, verifiable and trustworthy. We bring everyone closer to trusted data. soda.io