1. What is data quality?
This depends on the sector that you work in. For some, it’s accuracy and consistency of address information i.e. avoiding multiple versions of the truth with postcodes, city names etc. For others, it’s about numeric accuracy and business critical KPIs being 100% accurate.
In all cases one thing is true: Data quality is crucial. But it doesn’t have to be hard to get it right.
2. Why is data quality important?
In short - High quality data ensures that your business understands exactly how well it is (or isn't’) performing - it should also give you clear insight into why this might be.
A fascinating article I once read gave a really good example of where data is presumed to be high quality / useful, but in actuality is completely misleading.
I encourage you to give it a read, and always ask yourself what’s missing.
3. How to improve your data quality in 7 steps
Step 1: Assess & measure your current data quality
This can be a daunting task, depending on the size of your organisation. Two ways to health check your data quality are:
- Conduct interviews with core stakeholders / data owners / data consumers. How many data issues have they had this year? Are there any areas that they feel are poorly represented in data?
- Keep a log of all data-related incidents as they occur, and review frequently.
Step 2: Define a benchmark for “good quality” data
The best people to talk to about the expectations around data are your data consumers. Think of questions like:
- What are their pain points?
- Are there any recurring problems with data where they have to manually intervene?
- Have there been any cases where data is completely missing?
- Are they manipulating their data and if so, why?
- What would their world look like if their data was perfect, or as close to perfect as possible? This is your benchmark. What steps can you take to get there?
Step 3. Clean your data at the lowest possible level
First and foremost, you should aim to build a data lake, or any general “dumping ground” of data, in which data remains unchanged. This is so that you can always refer back to the original source when troubleshooting.
The next tier, which some call staging, should closely match your raw schema in terms of tables / columns - however, it should at this point contain cleansed data. The definition of cleansed data will differ depending on your business requirements, but typically includes:
- Column renames
- Filtering out incorrect or unwanted rows
- Leaving out unwanted columns
A further layer (not covered in this article) can then be built on top of the cleansed data. Usually this would be a neatly-modelled reporting layer.
Step 4: Normalise common data with a defined set of values
A common mistake, especially with text-based data, is giving users the ability to input values into free-format fields on a form or website. Unless the field is truly free format (such as a field for comments), the user should always be made to select from predefined values.
With cloud service providers providing rich features, it’s easier than ever to create a form with drop down lists, or even build your own front-end applications to help drive good data quality.
For example if you work with housing data, your front-end application could hook into an address-finding API to retrieve consistent address information. If the values aren’t available through an API, you could present your own lists, provided by your own internal data sources.
Step 5: Secure your data and implement a data governance strategy
Frequent reviews of data quality may be required for your business. If so, you should consider appointing a data steward to take care of this process, ensuring it’s done on a strict schedule and to a defined standard.
Many large organisations work in data silos, split out from one another. It is far better to have one silo and use schemas/privileges to grant data access based on a person’s role. This way your data can be consistent, and shared where one source of truth is relevant. When doing this, care must be taken to implement the correct row level security and column masking.
Step 6: Build a data culture with focus on training and enablement
It’s important that people feel comfortable with your data platform. Engage your staff with training opportunities and challenges / quizzes. Many data platforms offer weekly or monthly challenges, for example Snowflake’s Frosty Fridays.
Try to apply the same concepts to your own in-house data - build an internal training course which tells a story and enables a person with business as well as technical knowledge.
Don’t build a hierarchy in which one person gatekeeps all things data, and is the only person who has any say in data changes. Instead, foster an environment where everyone feels they can be heard - so that there can be open communication on all sides, and feedback is encouraged.
Collaborate, communicate, innovate.
Step 7: Prevent future errors
Even with a data steward in place, data quality can deteriorate over time if left unchecked. Therefore automated tests & reporting should be implemented ; especially against production data.
There are many utilities available that can make this much easier. For example, when using a tool such as dbt, one can set up generic tests to check for:
- Column uniqueness
- FK constraints broken
- Value is one of… (list)
- Value is not null
dbt can also facilitate any kind of custom testing, and whether you want to alert or terminate a job per failed test.
I hope this article has inspired you to make whatever changes are necessary to improve the quality of your data. If you think Biztory can help, get in touch!