Get to the Root of Data Quality Problems

| No Comments | No TrackBacks
By Elliot King

Safeguarding data quality is a continual and ongoing process. Sadly, no matter how diligent companies are, data quality problems will be created. But why is that?

  The root causes of data quality problems are deeply intertwined with daily business operations coupled with the simple truth that businesses are not static. Things change. Time decay is perhaps the most obvious root cause of data quality issues. People move; they die; they get divorced; they no longer have a need for your specific product, and so on. Updating data records to reflect those changes in a timely fashion is difficult since you may not even be aware that the change has occurred. Over time, data that was once right becomes wrong.

Time decay is a serious ongoing problem. But more dramatic quality issues often emerge when organizations grow, change the application infrastructure, or must respond to external demands. Too frequently, for example, when a company purchases another company, there is not enough time to fully merge the information infrastructure. Instead, the IT organization looks for a workaround or stopgap measure to insure the consolidation does not impede ongoing operations. These workarounds bring with them both known and unknown risks to data quality.

A workaround is often the solution of choice in a variety of other situations as well. A company may choose to eliminate certain applications or face a new regulatory requirement. Branch office or remote operation personnel may feel that the central IT staff cannot respond to their needs fast enough. In response to the pressure to "keep things online," an organization may opt for a workaround.

A third root cause of data quality issues is that the company simply has not established a sufficient data quality program. Since every piece of data cannot be verified, the data quality system itself may be flawed. Alternatively, data entry screens may be poorly designed and generate too many errors. And too few companies have standard data dictionaries.

As long as business is a fluid process, data quality issues will always arise. The key is to recognize their source and have programs in place to minimize the number of problems and reduce their impact on business.


Postal Standards and Address Quality - Take 1

| No Comments | No TrackBacks
By David Loshin

The USPS Postal Standard (Publication 28) provides at least some of the specifications we need for address quality. For example,

 "The Postal Service defines a complete address as one that has all the address elements necessary to allow an exact match with the current Postal Service ZIP+4 and City State files to obtain the finest level of ZIP+4 and delivery point codes for the delivery address."
The next paragraph provides some additional details:

 "A standardized address is one that is fully spelled out, abbreviated by using the Postal Service standard abbreviations (shown in this publication) or as shown in the current Postal Service ZIP+4 file."
A large part of the remainder of the document guides what is valid and what is not valid, as well as the postal standard abbreviations (as mentioned in the definition of standardized). So an address must be complete, which by definition implies that it can be matched with current Postal Service ZIP+4 and City State files.

This match is to obtain the ZIP+4, so the implication is that verification means that a complete address matches the USPS files and has the correct ZIP+4. The address components must be consistent with the postal standard in terms of valid and invalid values. For example, a street address cannot have a number that is outside the range of recognized numbers (that is, if the USPS file says that Main Street goes from 1-104, an address with 109 Main St is invalid). So validation means that the street address is consistent with what is documented by the USPS files. Standardization is also defined by the above reference: it is spelled out, and uses the USPS standard abbreviations.

In turn, the process for address quality would be to:

1) Ensure the address is complete.
2) Ensure that the address values are valid by checking it against the USPS files.
3) Verify the address's ZIP+4 by matching against the USPS fles.
4) Standardize the address according to the USPS standardized abbreviations.

Senate Green Lights Postal Reform - But Is It Enough?

| No Comments | No TrackBacks
Postal Reform legislation passed the Senate this month, but according to the Postal Service (and who should know better?) S. 1789 falls disappointedly short of restoring the USPS to financial viability.

For the past two years, the Postmaster General and the Board of Governors of the USPS have worked diligently preparing a comprehensive five-year plan to profitability that would enable revenue generation and achieve cost reductions of $20 billion by 2015 - restoring the Postal Service to long-term growth. The USPS is currently losing $25 million a day and has a debt of more than $13 billion.

Following the two days of sessions it took to vote on all the amendments in the bill, PMG Patrick Donahoe stated: "Based on our initial analysis of the legislation passed today, losses would continue in both the short and long term. If this bill were to become law, the Postal Service would be back before the Congress within a few years requesting additional legislative reform."

So where do we go from here? To the House of Representatives with an entirely new bill - H.R. 2309 - and with their own ideas for Postal Reform.

In the meantime? Take a look here at S.1789 to see exactly what the Senators voted for, and against, and then review the USPS Plan to Profitability - 5 Year Business Plan ... It's a good read. After that ... well, we'll keep you posted.


Translating Expectations into Quality Directives

| No Comments | No TrackBacks
By David Loshin

In my last post, I raised the question about the variety of terms used in describing address quality, and I introduced a set of core concepts that needed to be correct to provide the best benefits for accurate parcel delivery. Let's look at these more carefully:

1) The item must be directed to a specific recipient party (either an individual or an organization.)
2) The address must be a deliverable address.
3) The intended recipient must be associated with the deliverable address.
4) The delivery address must conform to the USPS standard.
Together these concepts have implications for address quality, and we can start with the first 3 concepts. The first concept implies a direct connection between entities: the sender and the recipient.

The corresponding business rule is relatively subtle - it suggests that the recipient must be identifiable to the sender. Concept #2 is a bit more direct: the address must be a deliverable address. This means that the address must carry enough information to enable a carrier to locate the address as a prelude to delivery. Concept #3 establishes a direct dependence between the recipient and the addressed location, implying awareness of that connection.

Together we can infer more discrete assertions:

• The address must be accurately mappable to a real location.
• The address must contain enough information to ensure delivery.
• The recipient must be a recognized entity.
• The recipient must be connected to the address.
In the next few posts we will figure out what these assertions really mean in terms of transforming a provided address into a complete, validated, and standardized address.


Characterizing the Quality of Address Data

| No Comments | No TrackBacks
By David Loshin

My company is currently working on a couple of projects associated with address quality and location master data. We are reviewing a lot of the existing documentation that has been collected from a number of different operational systems, as well as reviewing the business processes to see where location data is either created, modified, or read.

And there are many references to operations or transformations performed on addresses, mostly with the intent of improving the quality of the address.

Curiously, there are a number of different terms used to refer to these different transformations: validation, verification, standardization, cleansing, correction. I am sure there are others. But what do all these things mean? And why are these different terms used if they mean the same thing?

The first step in exploring the answer to this question is reflecting back on the nature of deliverable addresses. When an item is sent to an addressed location, there are some core concepts that need to be right:

1) The item must be directed to a specific recipient party (either an individual or an organization).
2) The address must be a deliverable address.
3) The intended recipient must be associated with the deliverable address.
In addition, there are certain incentives provided to senders when the addresses are completely aligned with the Postal Standard, adding one more concept:

4) The delivery address must conform to the USPS standard.
These directives provide us with some material with which to work for differentiating the different terms used for postal data quality. More next week...

Sometimes Data Quality is the Law

| No Comments | No TrackBacks
By Elliot King

Elliot King
We have all read the statistics about the real costs that poor data quality represents. And intuitively, we know that bad data is, well, bad. But, in many cases, bad data is more than just bad for business. Increasingly, good data is required by law.

In 2001, the U.S. Congress added two lines to its major appropriations act that required that the Office of Management and Budget "provide policy and procedural guidance to Federal agencies for ensuring and maximizing the quality, objectivity, utility, and integrity of information (including statistical information) disseminated by Federal agencies." These two lines have come to be known as the Data Quality Act (DQA) or the Information Quality Act.

While the DQA applied to federal agencies, it also put a marker in the ground. Regulatory agencies could demand that data given to them meet criteria set by the Congress. Government data had to be accurate, objective and have integrity as a matter of law.

The same notion has spread to corporate America and other specific industries. In one of the most high profile examples, the Dodd-Frank Wall Street Reform and Consumer Protection Act passed in the wake of the financial meltdown of 2008 established the Office of Financial Research with a mandate to improve the quality of financial data accessible to regulators.

One of the primary challenges for financial institutions is to insure that the information they supply to regulators is consistent across all their divisions. Companies are also going to be able to track data flows and usage and develop chains of custody that can be audited. The net result should be that all users of the data will see consistent information. And that may not be easy when data is siloed in different operating entities, which is often the case in those too-big-to-fail financial behemoths.

The regulatory pressure for data quality is being felt elsewhere as well. The American Health Information Management Association, which promotes the technological advancement of health information management systems, has noted that for electronic health records (EHR) to have the positive impact on overall health care that their proponents anticipate, data quality can no longer be a reactive process based on auditing but must be proactively focused on data capture. Standards to ensure that result will be built into the requirements for EHRs.

While many folks in IT chafe at government regulations. However, the Federal government often has been a leader in IT innovation. Think about the Internet. Federal IT initiatives have also been inept--think about the FBI case management system or the Air Traffic Control system. While burdensome, the regulatory demand for higher quality data could easily have a positive ripple effect that will spread widely.


Clean Data is Good Data

| No Comments | No TrackBacks
By Elliot King

Elliot King
The cliché is as old as computing itself--garbage in, garbage out. And that cliché is as true now as ever, if not more so. Unfortunately, with information flowing into companies from so many sources including the Web and third-party providers, mistakes should not just be expected; they are basically inevitable. Garbage data is going to get in your data systems.

We want to close our eyes to bad data and just pretend it doesn't matter; but that would be a major mistake. Virtually any operation driven by faulty data is suspect. Trends you may uncover could be wrong. Your customer contact efforts could be inappropriate or misdirected.

Data cleansing, the systematic effort to remediate bad data is no trivial task. First, so many different kinds of errors can exist. Mistakes can occur in single source systems as well as multiple source systems. Errors and inconsistencies can be introduced at both the metadata level--the data schema or the information wrapper, in the case of the Web, may be flawed--or at the granular level, where the information itself is just not right.

Just as the information itself, so much can go awry. There can be missing values and misspellings. Information can be entered into the wrong field--a street name in the city field perhaps. Attributes that should be linked aren't--let's say a city without a ZIP code. Records could contradict each other or be associated incorrectly. For example "John Smith" may actually work in payroll and not in human resources.

Not surprisingly, data cleansing has more steps than doing the laundry. The first step is to analyze the data and find the real mistakes--the omissions, the contradictions and the errors. The next step is to re-engineer and validate the new metadata and rules to address those errors. The data has to be transformed, expunging the problems. At that point, the data can be reloaded into the database. And in case that doesn't sound all that daunting, each of the steps generally has a slew of sub processes too.

Of course, none of this had to be done manually. Many different tools have been introduced into the market. Some are more generalized while others specialize in fixing a specific problem such as names and addresses.

At the end of the day, "garbage in, garbage out" sounds really harsh. Maybe you should look at it this way--clean data is good data. But just like your clothes, if you use data it will get dirty, so the cleansing actually never ends.


By David Loshin

In my last post, I shared a story about how rampant address validation actually can transform accurate (if not 100% standardized) addresses into inaccurate ones. My client actually noted that with some of the tools they have seen, addresses that have been submitted to the product and standardized are then re-submitted to the product, which then reports their own corrected address as being invalid! Huh?

Here are a few reflections:

1) Siloed application development has led to multiple iterations of address correction, often using different tools, or if using the same tool, using different sets of rules, which is probably not the best scenario since it will lead to inconsistencies.

2) Address standardization tools have to be maintained - the rules may change, locations may change, the standards may evolve.

3) The criticality of precision and standardization/validation has to be assessed in the context of how the address (or location) is being used. For example, for delivery purposes accuracy should trump "validity" if a validated address is no longer correct.

4) Find a single tool solution for address standardization and validation instead of using multiple products.

5) Have a single team responsible for managing the solicitation of requirements, definition of rules, and provision of address validation and correction services.
One last thought: push address correction and validation to the earliest point in the business process. If your application solicits an address from the customer, validate it right away and have the corrected address verified directly by the customer. And trust the customer to give you accurate address data - they probably have some experience in ensuring they get their mailings!


By Joseph Vertido

For many, the concepts of data integration and data quality are separate and have no commonality. But in reality, when you combine them - they create a partnership that excels. Where data quality leaves off, data integration begins, and vice versa. A new product - Contact Zone - fuses these two concepts together into one revolutionary solution for where data integration and data quality converge.

Data integration tools simplify data migration and data warehousing procedures - both of which are concerned with the issue of data management, i.e. keeping data organized. Data quality, on the other hand, is concerned primarily with an understanding of the nature, and validity of the contents of the actual data, i.e. keeping data clean. Maintaining an organized database is not the same as keeping it clean - they are two different approaches to handling data - but they can be combined, or should they?

The short answer is yes.

In essence, data integration allows for the migration of data from a given source to a given destination. Typically, users take advantage of data integration to accomplish data warehousing initiatives - allowing for easy migration and manipulation of data, which ultimately leads to maximizing the efficiency of business intelligence and analytics.

However, Gartner states that "only 30 percent of business intelligence and data warehousing implementations fully succeed." Why? The top two reasons for failure are budget constraints and data quality. So, although the architectural constraints of building a data warehouse can be addressed by utilizing data integration tools, it still leaves the problem of poor data quality - something that most data integration tools handle with mediocrity at best.

That's where Contact Zone comes into play. It's a data integration tool optimized for data quality, allowing you to shoot two birds with one stone.

Contact Zone connects to virtually any source, overcoming an obstacle our clients frequently encounter when implementing data quality, namely there is such a variety of database format and platforms today that the types of environments and combinations can be overwhelming.

Whether you have an IBM DB2 database or PostgreSQL, leveraging Contact Zone allows for data integration for almost any form of database format, while making sure that all data is clean, correct, standardized, and valid.

By David Loshin

There are all sorts of tools associated with address standardization, cleansing, and validation. As an example, the USPS has a certification program for software vendors, referred to as CASS (Coding Accuracy Support System)™ certification. According to their website,

CASS enables the Postal Service™ to evaluate the accuracy of address matching software programs in the following areas:

(1) five-digit coding
(2) ZIP + 4/ delivery point (DP) coding
(3) carrier route coding
(4) DPV®
(5) DSF2®
(6) LACSLink®
(7) eLOT®
(8) RDI™ products

CASS allows vendors/mailers the opportunity to test their address-matching software packages and, after achieving a certain percentage of compliance, to be certified by the Postal Service. CASS does not measure the accuracy of ZIP + 4 delivery point, five-digit ZIP, or carrier route codes in a mailer's existing files. CASS enables mailers to measure and diagnose internally written, commercially-available, address-matching software packages. The effectiveness of service bureaus' matching software can also be measured.

There are many vendors selling CASS-certified tools and services. Organizations use CASS-certified tools for address standardization, correction, and validation. End of story, right?

Wrong. Some organizations use many CASS-certified tools for address standardization, correction, and validation, at different places along the processing stream. The addresses are standardized, cleansed, and validated (or not) multiple times. The addresses are changed from their original format, manipulated, and then shoved back into the databases, without considering the actual process dependencies or expectations.

And then you end up with a scenario like this: a process for accepting customer applications including their self-provided addresses, send hard copies of acknowledgements to their self-provided addresses, yet the process includes an elaborate mechanism for managing returned mail. That did not make sense to me: if the organization sends out acknowledgements to the address the customer provided, wouldn't they trust that the customer provided an accurate address?

In fact, the issues was self-created: the provided address passed through at least three different iterations (with different products!) of standardization, correction, and validation, and was transformed from a deliverable ("accurate") address to an invalid one.

So even though the intent was appropriate, the execution of the process got in the way of the results. So I'll throw out two questions: First, is address standardization and validation a tool or a process? And second, at what point and how frequently in the business process flow should address standardization and validation take place?


Authors