Clean Data is Good Data

| No Comments | No TrackBacks
By Elliot King

Elliot King
The cliché is as old as computing itself--garbage in, garbage out. And that cliché is as true now as ever, if not more so. Unfortunately, with information flowing into companies from so many sources including the Web and third-party providers, mistakes should not just be expected; they are basically inevitable. Garbage data is going to get in your data systems.

We want to close our eyes to bad data and just pretend it doesn't matter; but that would be a major mistake. Virtually any operation driven by faulty data is suspect. Trends you may uncover could be wrong. Your customer contact efforts could be inappropriate or misdirected.

Data cleansing, the systematic effort to remediate bad data is no trivial task. First, so many different kinds of errors can exist. Mistakes can occur in single source systems as well as multiple source systems. Errors and inconsistencies can be introduced at both the metadata level--the data schema or the information wrapper, in the case of the Web, may be flawed--or at the granular level, where the information itself is just not right.

Just as the information itself, so much can go awry. There can be missing values and misspellings. Information can be entered into the wrong field--a street name in the city field perhaps. Attributes that should be linked aren't--let's say a city without a ZIP code. Records could contradict each other or be associated incorrectly. For example "John Smith" may actually work in payroll and not in human resources.

Not surprisingly, data cleansing has more steps than doing the laundry. The first step is to analyze the data and find the real mistakes--the omissions, the contradictions and the errors. The next step is to re-engineer and validate the new metadata and rules to address those errors. The data has to be transformed, expunging the problems. At that point, the data can be reloaded into the database. And in case that doesn't sound all that daunting, each of the steps generally has a slew of sub processes too.

Of course, none of this had to be done manually. Many different tools have been introduced into the market. Some are more generalized while others specialize in fixing a specific problem such as names and addresses.

At the end of the day, "garbage in, garbage out" sounds really harsh. Maybe you should look at it this way--clean data is good data. But just like your clothes, if you use data it will get dirty, so the cleansing actually never ends.


No TrackBacks

TrackBack URL: http://blog.melissadata.com/mt-tb.cgi/177

Leave a comment