Recently in Data Cleansing Category

Characterizing the Quality of Address Data

| No Comments | No TrackBacks
By David Loshin

My company is currently working on a couple of projects associated with address quality and location master data. We are reviewing a lot of the existing documentation that has been collected from a number of different operational systems, as well as reviewing the business processes to see where location data is either created, modified, or read.

And there are many references to operations or transformations performed on addresses, mostly with the intent of improving the quality of the address.

Curiously, there are a number of different terms used to refer to these different transformations: validation, verification, standardization, cleansing, correction. I am sure there are others. But what do all these things mean? And why are these different terms used if they mean the same thing?

The first step in exploring the answer to this question is reflecting back on the nature of deliverable addresses. When an item is sent to an addressed location, there are some core concepts that need to be right:

1) The item must be directed to a specific recipient party (either an individual or an organization).
2) The address must be a deliverable address.
3) The intended recipient must be associated with the deliverable address.
In addition, there are certain incentives provided to senders when the addresses are completely aligned with the Postal Standard, adding one more concept:

4) The delivery address must conform to the USPS standard.
These directives provide us with some material with which to work for differentiating the different terms used for postal data quality. More next week...

Sometimes Data Quality is the Law

| No Comments | No TrackBacks
By Elliot King

Elliot King
We have all read the statistics about the real costs that poor data quality represents. And intuitively, we know that bad data is, well, bad. But, in many cases, bad data is more than just bad for business. Increasingly, good data is required by law.

In 2001, the U.S. Congress added two lines to its major appropriations act that required that the Office of Management and Budget "provide policy and procedural guidance to Federal agencies for ensuring and maximizing the quality, objectivity, utility, and integrity of information (including statistical information) disseminated by Federal agencies." These two lines have come to be known as the Data Quality Act (DQA) or the Information Quality Act.

While the DQA applied to federal agencies, it also put a marker in the ground. Regulatory agencies could demand that data given to them meet criteria set by the Congress. Government data had to be accurate, objective and have integrity as a matter of law.

The same notion has spread to corporate America and other specific industries. In one of the most high profile examples, the Dodd-Frank Wall Street Reform and Consumer Protection Act passed in the wake of the financial meltdown of 2008 established the Office of Financial Research with a mandate to improve the quality of financial data accessible to regulators.

One of the primary challenges for financial institutions is to insure that the information they supply to regulators is consistent across all their divisions. Companies are also going to be able to track data flows and usage and develop chains of custody that can be audited. The net result should be that all users of the data will see consistent information. And that may not be easy when data is siloed in different operating entities, which is often the case in those too-big-to-fail financial behemoths.

The regulatory pressure for data quality is being felt elsewhere as well. The American Health Information Management Association, which promotes the technological advancement of health information management systems, has noted that for electronic health records (EHR) to have the positive impact on overall health care that their proponents anticipate, data quality can no longer be a reactive process based on auditing but must be proactively focused on data capture. Standards to ensure that result will be built into the requirements for EHRs.

While many folks in IT chafe at government regulations. However, the Federal government often has been a leader in IT innovation. Think about the Internet. Federal IT initiatives have also been inept--think about the FBI case management system or the Air Traffic Control system. While burdensome, the regulatory demand for higher quality data could easily have a positive ripple effect that will spread widely.


Clean Data is Good Data

| No Comments | No TrackBacks
By Elliot King

Elliot King
The cliché is as old as computing itself--garbage in, garbage out. And that cliché is as true now as ever, if not more so. Unfortunately, with information flowing into companies from so many sources including the Web and third-party providers, mistakes should not just be expected; they are basically inevitable. Garbage data is going to get in your data systems.

We want to close our eyes to bad data and just pretend it doesn't matter; but that would be a major mistake. Virtually any operation driven by faulty data is suspect. Trends you may uncover could be wrong. Your customer contact efforts could be inappropriate or misdirected.

Data cleansing, the systematic effort to remediate bad data is no trivial task. First, so many different kinds of errors can exist. Mistakes can occur in single source systems as well as multiple source systems. Errors and inconsistencies can be introduced at both the metadata level--the data schema or the information wrapper, in the case of the Web, may be flawed--or at the granular level, where the information itself is just not right.

Just as the information itself, so much can go awry. There can be missing values and misspellings. Information can be entered into the wrong field--a street name in the city field perhaps. Attributes that should be linked aren't--let's say a city without a ZIP code. Records could contradict each other or be associated incorrectly. For example "John Smith" may actually work in payroll and not in human resources.

Not surprisingly, data cleansing has more steps than doing the laundry. The first step is to analyze the data and find the real mistakes--the omissions, the contradictions and the errors. The next step is to re-engineer and validate the new metadata and rules to address those errors. The data has to be transformed, expunging the problems. At that point, the data can be reloaded into the database. And in case that doesn't sound all that daunting, each of the steps generally has a slew of sub processes too.

Of course, none of this had to be done manually. Many different tools have been introduced into the market. Some are more generalized while others specialize in fixing a specific problem such as names and addresses.

At the end of the day, "garbage in, garbage out" sounds really harsh. Maybe you should look at it this way--clean data is good data. But just like your clothes, if you use data it will get dirty, so the cleansing actually never ends.


By David Loshin

One nice thing about addresses, especially in the United States, is that they have well-defined standards. In previous blog series, I have looked at the process of address standardization and correction, so I won't belabor that point. However, many people confuse the differences among a  valid address, a precise representation of an address, and an accurate address.
 
First of all, recall that the main reason for standardizing addresses is based on "delivery" - supporting the general expectation that posted items reach their desired destination. Validation that addresses meet the standard certainly can improve the delivery processing, especially when it comes to sorting the items to be delivered. That is the reason that the Post Office offers discounts in the costs of mailings when the addresses are validated in standard form.

Luckily, there is a lot of leeway when it comes to meeting this expectation. If you provide a street number, street address, city, and state, there is a good chance your item will be delivered even if it is missing a ZIP code.

On the other hand, the address is not necessarily valid in terms of it meeting the USPS standard, since it is an incomplete address. Similarly, what was put into "address line 1" and "address line 2" might have been the reverse of what is specified by the standard (does "suite 2100" go before the street address or not?), but the carrier will still be able to put the letter into the right slot once she gets to the building. In other words, the address does not need to be valid in order to be delivered.

Nor does it really have to be that precise. We might define precision to mean that the address has enough information to direct the carrier to the specific box or slot associated with the entity referenced in the address. An address with a name, a street, and a suite number is more precise than one with just a name and a street. Yet if your local carrier has some knowledge about his route, then items that have less than precise addresses will still get to their destination.

The last term may be the most important one, though: accuracy. If the address associated with the named entity is not correct, the chances of the item being delivered are much lower. And one problem is that these characterizations are not mutually exclusive: an address can be valid (i.e. it is in standard form and there is an actual delivery point associated with it) but not accurate (if the named individual is not associated with the delivery point).

And while the Post Office maintains some information and tools to help synchronize validity and accuracy, sometimes there are kinks in the process, as we will see next time.

Standardizing Your Approach to Monitoring the Quality of Data

| No Comments | No TrackBacks
By David Loshin

In my last post, I suggested three techniques for maturing your organizational approach to data quality management. The first recommendation was defining processes for evaluating errors when they are identified. These types of processes actually involve a few key techniques:

1) An approach to specifying data validity rules that can be used to determine whether a data instance or record has an error. This is more of a discipline that can be guided by formal representations of business or data rules. Often metadata management tools and data profiling tools have repositories for capturing defined rules, leading to our next technique...

2) A method for applying those rules to data. This often will take advantage of the operational aspects of a data profiling or monitoring tool to validate a data instance against a set of rules. It may also incorporate parsing and standardization rules to identify known error patterns.

3) A means for reporting errors to a data analyst or steward. Some data analysis and profiling can be configured to automatically notify a data steward when a data validity rule is violated. In other situations, the results of applying the validation rule can be accumulated in a repository and a front-end reporting tool is used to provide visualization and notification of errors.

4) An inventory of actions to take when specific errors occur. As your team becomes more knowledgeable about the types of errors that can occur, you will also become accustomed to the methods employed for analysis and remediation.
In time, the repeated use of tools and the corresponding actions for remediation can be evolved into standardized methods, which can be documented, published, and used as the basis for training data quality analysts.


Garbage In ...

| No Comments | No TrackBacks
By Elliot King

Elliot King
We all know that dirty data is not really dirty; it is just incorrect. Data cleansing consists of correcting mistakes in the data.

Mistakes make their way into contact data in several different ways. It may just be wrong or incomplete; it may not be updated; and it may be duplicated if small variations are entered into the contact information each time a customer gets in touch your organization. I know that my name shows up in at least half a dozen variations in more than one company database.

There are several strategies for ensuring that a high percentage of your customer contact data is correct (some errors will inevitably creep in) but one of the most important steps you can take is right at the very beginning. Before you even start collecting data, you should ask yourself how much information do you actually need to capture about each customer, and what field or fields define a unique record?

Do you really need to capture somebody's fax number? Do you need the honorifics like Mr. or Dr.? (Honorifics were on a form I recently had to complete to buy an airline ticket. In fact, they were a required field). Are there other pieces of data that can be eliminated from your contact record?

And while it is obvious that the name field should not be used to determine a unique record, what should be? With Web-based forms, for example, many people enter incorrect email addresses to avoid getting spam.

The fact is that the more information required on a contact information form, the more mistakes it will have. It is much more efficient to collect data correctly at the beginning of the process, than to locate and fix incorrect data later.



Authors