Data Quality is a complex set of overlapping concerns. It is easy to spend a lot of time on "improving data" and not achieving much. So What can be done to focus work to improve data quality?
First, look at the sources of data. If there is more than one source compare and contrast the datasets for omissions/inaccuracies.
Second, don't start without a plan you need to measure the data quality BEFORE you start to track improvements.
What can be measured on any dataset that can give us a view into data quality?
We use the 4Cs approach developed internally and used over a number of years. It is based on strong mathematical foundations. We need to measure
Completeness
Correctness
Consistency
Coherence
What do we mean?Â
Completeness -For the data, if we know that the actual an estimated size for the dataset we can report how much we have: completeness. For open datasets, we can add time intervals and say we have 100% of the data for last weeks sensors logs,
Correctness -we can apply a check to each data element that tells us if data is correct. This can be a yes / no binary decision or a confidence interval that we define. E.g. 96% of the data we have for imaging crops includes GPS locations within the area of interest (e.g. a field on a farm)
Consistency - We can look at the data we have and apply internal validations e.g. 145 of 149 Addresses we hold of offices in the UK have a valid postcode, a street name, town and number.
Coherence - The datasets must tell the same story, for example, looking at hard drive temperatures reported by units in a datacentre we may get temperature reported in different units making the dataset logically inconsistent. Further coherence concerns may come from devices reporting their temperature as hotter than that of the sun where the low-level data from the drive has not been correctly processed or a bug in the drive's firmware.