Recently in Analyzing Data Quality Category

Where Do You Fit In?

| No Comments | No TrackBacks
By Elliot King

Elliot King
Too often, those of us with our noses to the grindstone have no time to look up. We are so busy putting out fires, monitoring and maintaining what we have, or trying to launch new initiatives that we never look around to see how other organizations are dealing with similar issues.

This may be particularly true in the data quality world. Data quality is often seen as an internal problem and it is often addressed differently in different settings, both organizationally and technically. Indeed, even the terminology is not consistent across industries.

So a recent study conducted by the International Association for Information and Data Quality (IAIDQ) working in conjunction with the Information Quality Program at the University of Arkansas, Little Rock (UALR-IQ) reveals some very interesting trends. The survey of 270 data quality professionals identified the top challenges faced by data quality professionals.

Heading the list is a lack of accountability and responsibility for data quality, followed by too many data and information silos to manage, a lack of awareness and discussion of the size and impact of data quality problems and a lack of understanding of what data quality means. These challenges are fundamental and each was tabbed by more than 50 percent of the respondents.

Considering the basic nature of the challenges, perhaps it should be no surprise that 66 percent of the respondents believed that the effectiveness of the data quality efforts in their organization were only OK (some goals were met) or poor (few goals were met.) Ironically, 70 percent claimed that their organizations recognized that data and information were important strategic assets and managed it with that in mind.

So what is driving companies to improve their data quality efforts? According to the survey, the top driver is just a general desire to improve the quality of data, which was cited by 68 percent of the respondents. Other important motivations to improve data quality were the desire to improve business intelligence, and compliance and legal considerations.


Using Data Quality Tools for Classification

| No Comments | No TrackBacks
By David Loshin

Hierarchical classification schemes are great for scanning through unstructured text for identifying critical pieces of information that can be mapped to an organized analytical profile. To enable this scanning capability, you will need two pieces of technology.

The first involves a text analysis methodology for scanning text and determining which character strings and phrases are meaningful and which ones are largely noise.

The second capability maps identified terms and phrases within existing known hierarchies and perform the classification.

Both of these techniques would work perfectly as long as the input data is always correct and complete - quite an assumption. That is why we need to augment these approaches with data quality techniques, largely in the area of data validation and data standardization/correction. For example, I am particularly guilty of character transposition when I type, and am as likely to tweet about my "Frod F-150" as I would about my "Ford F-150." In this example, the inexact spelling would lead to a failure to classify my automobile preference.

However, using data quality tools, we can create a knowledge base of standard transformations that map common error schemes to their most appropriate matches. Creating a transformation rule mapping "Frod F-150" to "Ford F-150" would suggest the likely intent, supplementing the classification process.

In other words, integrating our text analytics tools with more traditional data quality methodology will not only (yet again) reduce inconsistency and confusion, it will also enhance the precision for analytical results and enable more robust customer profiling - a necessity for customer centricity.

By Allison Moon
Data Quality Analyst

Allison Moon - Data Quality Analyst
In today's e-commerce environment, Web forms or online shopping carts serve to capture valuable contact data, but many times this data can contain inconsistencies, missing or incorrect information.


Fortunately, Melissa Data offers a solution. With a partially entered address line and ZIP Codeā„¢, our new auto-complete feature in Address Object can retrieve all possible address entries from which a user can select from. It's simple to use, and can be easily integrated into your existing solution with just a few steps. Address Object is Melissa Data's address verification solution (available as multiplatform API or Web service.)

Ensuring accuracy before bad data enters your CRM systems will prevent your company from dealing with lost revenue, time, inefficiencies and waste. Uses and Benefits to Implementing Auto-Completion So how can you take advantage of this new feature? For those who need to save as much time and keystrokes as possible, the auto-complete feature is a pretty awesome tool. Having the ability to retrieve a list of suggestions based upon the street number, and even just the first couple letters of the street name, saves time typing out an entire address.

The auto-complete functionality can also help find the correct suites and valid ranges for a building. In a call center setting, auto-complete can allow you to see whether the customer on the phone has forgotten to mention their apartment information, before the call has ended.

Or perhaps you're on the phone with a customer and quickly scribble down their mailing address. But now, when you look back at your notes, it's hard to read except for the first few numbers and characters (and who hasn't done this before?). With auto-complete, you can plug in the information (that you can decipher from your notes) and determine which address you meant to write down by looking at the suggestions returned.

Auto-completion is flexible enough to accommodate varying needs and design requirements, reduce the time spent finding addresses, and prevent issues by returning valid addresses given an incomplete one.

-- Allison Moon is Melissa Data's data quality analyst and software engineer.

To download a free trial of Address Object, please go to:
http://www.melissadata.com/free-trials/address-object-address-verifier.htm

Classifying Data Quality Problems

| No Comments | No TrackBacks
By Elliot King

Elliot King
Data quality is generally most fruitfully defined in the context of its use. Is the data good enough to allow the process with which it is associated to run efficiently and effectively? For example, is the mailing list you are using for a direct solicitation accurate enough that you can achieve your goals and not generate any unwanted and unanticipated negative consequences?

And while that definition may be good enough in a practical sense for specific issues, it really isn't good enough to diagnose the sources of data quality problems generally. Constructing a general framework for data quality problems can be a useful guide in better identifying and resolving specific issues.

One of the earliest efforts to better understand the nature of data quality problems calls for classifying problems into three general categories--operational, conceptual and organizational. Operational data quality issues are those that are generated through problems with data capture and transmission. Inaccurate data is collected. Data may be missing. Or data may be corrupted through some process, for example.

Conceptual data quality problems occur when data is not well defined or it is inappropriate for its intended use. One of the most famous examples of a conceptual data quality problem (though it is not often thought of in this way) was brought to light in the movie Moneyball.

The basic thrust of the movie was not that the information old-time baseball scouts used to evaluate players was wrong per se; it was they were collecting the wrong data to identify productive players. Batting average, for example, is less useful in determining a player's value than on-base percentage. A pressing new conceptual data problem is the attempt to use electronic patient records to judge medical treatment outcomes.

When operational and conceptual data problems persist over time despite repeated attempts to fix them, organizational data quality problems are usually the culprit. In these cases, wrong, missing and invalid data is not really the problem, but the symptom. Something has to be fixed in the organizational structure or culture.

The point is this--data can be wrong for many reasons and it can't fundamentally be fixed without a general understanding of the error's cause.

Understanding Hierarchies

| No Comments | No TrackBacks
By David Loshin

Defining standards for group classification helps in reducing confusion due to inconsistencies across generated reports and analyses. In the automobile classification example we have been using for the past few posts, we might pick the NHTSA values (mini passenger cars, light passenger cars, compact passenger cars, medium passenger cars, heavy passenger cars, sport utility vehicles, pickup trucks, and vans) as the standard.

Yet, as more organizations look to merge data sets and feeds from different sources, some challenges remain, particularly with the use of unstructured text (such as that presented via Twitter or Facebook.) People cannot be expected to always conform to your organization's data standards, and often use colloquial terms or their own words to describe ideas that would map to your own dimensional values.

For example, if you wanted to filter out the individuals who prefer to drive "pickup trucks" (one of our standard values), it is not enough to scan for that phrase. Many individuals will refer to their pickup truck using different terms, such as a make and model ("Ford F-150," "Chevy Silverado") or a different name ("light truck") or a nickname ("baby monster"), but these terms have to be linked to the overall classification term.

This is an example of a simple hierarchy, in which one concept ("automobiles") is divided into a collection of smaller classes (the NHTSA classifications). Each of those classes in turn contains other phrases and terms. Within each of those included collections, there may be other inclusive categorization, such as by make and then model.

With a well-defined hierarchy for classification, unstructured text can be scanned for matches with values that live within the hierarchy, and that enables the standardized classification. To round out the example, a Twitter tweet exclaiming the author's love of "driving his Ford F-150" can be scanned, with the model name extracted, located within the make and model hierarchy for pickup trucks, thereby allowing us to register his/her automobile driving preference!

Your Data Quality Scorecard: The 7 Cs of Data Quality

| No Comments | No TrackBacks
Do you know how accurate your company's data is? How much bad data is costing your organization? How you can get a single, complete view of your customers? Data quality today is a bottom-line issue that businesses must address to stay competitive. To help business managers and executives, marketing professionals, and other non-tech personnel understand what data quality is, and why it's important - we outlined 7 data quality principles in a convenient and easy-to-follow format.

Standardizing Classifications

| No Comments | No TrackBacks
By David Loshin

In the most recent post, we posed a straightforward problem: if we have a reporting or analytical objective that depends on using a dimension for classification, what happens when two different value domains are presumed to map to the same conceptual domain?

More concretely, the example we used was mapping individuals to their car purchase preferences, but different applications used different car classifications that did not share the same number of values and the value sets did not directly map in a one-to-one manner. The potential result is confusion in interpreting the results, especially if this classification is just one variable used for creating a customer profile.

One way to address this is to put a standards policy for classification dimensions in effect by selecting a single set of concepts, mapping those conceptual values to a standard single set of values, and then insisting that any application that uses that conceptual domain always use the standard.

This sounds simple, but it actually may entail some effort, since no one person may be aware of all the places that any specific classification domain is used.

This task goes beyond a "data management" activity and essentially becomes a "data governance" one involving a broad solicitation across the community of data consumers to determine the classification dimensions used and the enumerations of values employed within each dimension.

At the same time, the analyst spearheading this effort must have a plan for capturing the classification data, harmonizing values across variant lists, selecting a standard, communicating the standard, and then ensuring that the standard is put into practice.

Establishing good practices and processes for domain harmonization and standardization is an important topic to be considered in upcoming posts, but next time we will look at a growing challenge for classification domains: aligning data from unstructured text with the standard classification dimensions.

The People You Should Care About Most

| No Comments | No TrackBacks


By Elliot King

Elliot King
This goal should be a no-brainer. When a customer interacts with your organization, your point-of-contact personnel should have accurate information about your products and services and about the person especially in the case of a repeat customer. When front-line personal provide incorrect or incomplete information, or don't have access to information they should have, the customer experience suffers.

Ironically, the impact of poor data quality on customer satisfaction is often overlooked. According to a survey of members of the Association of Business Process Management Professionals (ABPMP), of the 45 percent who reported that they are working on improving CRM processes, only 38 percent have evaluated the impact that poor-quality data has on the effectiveness of these processes. That statistic is a little frightening.

Poor data quality coupled with the inability to deliver the right data to the customer at the right time damages overall customer satisfaction in a variety of ways. Consider this scenario. Many retailers run multiple promotions at the same time. Each promotion has different rules and restrictions and run for different time periods.

When customers go to pay, if the final price does not reflect the discounts they anticipated, they will not be real happy. The flip side of the coin is true as well. If a sales process does not include a discount where one is due, an opportunity to build goodwill or perhaps close a deal will be lost.

But misunderstandings and the failure to provide the appropriate product are the tip of the iceberg. If point-of-contact personnel do not have confidence in the information they receive, the process for which the data is needed will be slowed and their productivity diminished.

And there is the annoyance factor. Think about your interactions with your telephone or cable provider. You call with a problem. The computer confirms the number you are calling about. If you are lucky and you finally find a human to talk to, how many times do they have to reconfirm all your information--telephone number, address, and so on--especially when you are passed from one person to another? Okay, part of the problem is most likely the lack of overall system integration, but part of the problem is a fear of faulty information.

Poor data quality is not a theoretical issue. It can hurt you in the place that may hurt most--your relationship with your customers.


Groupings and Hierarchies

| No Comments | No TrackBacks
By David Loshin

In our last set of posts, we looked at householding - inferring relationships for grouping individuals together based on shared characteristics. In this series, we look at how we manage the quality of the data representing those shared characteristics. First let's look at an example: organizing individuals based on their preferences for types of cars. There are a number of different classifications of cars, mostly focusing on car size, and these can be used for grouping individuals by reference.

And that is the problem: there are a number of different classifications of cars, and without a defined standard, there's bound to be confusion. Here are three examples (I got them from a page at Wikipedia):


• The Highway Loss Data Institute (HLDI) classifies cars into five groups: Sports, Luxury, Large, Midsize, and Small.
• The National Highway Traffic Safety Administration (NHTSA) has eight classifications (based on curb weight of the car): mini passenger cars, light passenger cars, compact passenger cars, medium passenger cars, heavy passenger cars, sport utility vehicles, pickup trucks, and vans.
• The EPA has a car classification based on interior and cargo space: Two-seaters, minicompacts, subcompact, compact, mid-size, large, small station wagons, mid-size station wagons, large station wagons.

While one application might assign a demographic classification based on the HLDI groupings, another application might use the NHTSA classification, but aspects of those classifications don't match: the set of small HLDI cars might include the NHTSA sets of mini passenger cars, light passenger cars, and compact passenger cars.

The absence of a standard within the enterprise for choices of classification may seem irrelevant within siloed functions, but as more business processes are monitored across multiple functions, variant dimensions for classification and analysis will create confusion somewhere down the line.

Mistakes Are All Around Us

| No Comments | No TrackBacks
By Elliot King

Elliot King
Mistakes happen. No matter how effective your data quality program is; no matter how well trained your personnel are; no matter how aware you are of the high cost of low data quality, data errors will creep into your databases. The reason is simple. Before information winds up in a database, it passes through a series of steps involving both human interaction and computation from data acquisition to archival storage. There are so many opportunities for things to go astray, inevitably from time to time they will.

Perhaps the most common source of data errors and one of the most difficult to correct is data entry itself. In many cases, humans--typically data entry clerks, customer service representatives or end users via the Web or some other mechanism--initially enter data into the system.

They enter data incorrectly for many reasons. Data entry personnel or customer service representatives may be required to work too fast. The entry screen may be poorly designed. Or, in some cases, data may be incorrectly entered intentionally. For example, in several cases across the country, criminal lab personnel have been found to have entered data into their systems for tests that they did not perform. And a recent study of the use of electronic medical records found that medical personnel may be falsely claiming to have performed specific examinations.

Incorrect data entry, accidental or not, is only one part of the problem. Sometimes the data itself is just wrong. Measurements have been taken incorrectly. Sampling techniques are wrong and so on. As data volumes increase, controlling the quality of the initial data becomes more difficult. Moreover, frequently, data analysis does not require, or cannot use, all the data collected. The techniques used to distill or summarize data may be faulty.

Finally, in many cases, data does not come from a single source. With data flowing into systems from several different directions, data integration is a challenge. In practice, databases are always evolving and as more data is incorporated into different repositories, inconsistencies must always be resolved.

There is some hope. If you pay attention, the amount of incorrect data in your system can be reduced, but probably cannot be totally eliminated. The sources of errors are just too pervasive.