Data Quality Incident Management

| No Comments | No TrackBacks
By David Loshin

The previous step in our transition from uncontrolled reactivity to being proactively engaged in managing data quality involved defining processes for identifying and evaluating data errors using standardized methods. Providing well-defined processes to data stewards and data quality analysts helps reduce any confusion around the appropriate steps to take when those commonly-occurring data failures are discovered in process.

But in many cases there is still an issue of coordination. While standardizing the approaches to monitoring for data validity helps reduce the effort and complexity of analyzing and remediating issues, there is still the situation that the same error may impact multiple data consumers; if each data consumer reports to issue to one of the data stewards, you have many stewards investigating the same problem. So this is where our second suggestion comes in: instituting methods for coordinating those evaluations.

This is an area in which the data management world can learn lessons from our friends in desktop or network support, who rely on incident management systems for the reporting, tracking, and management of issues. Data consumers impacted by a data error can report the flaw in the incident management and tracking system, which can assign a unique identifier to the logged issue and then route it to a specific data steward.

However, by carefully structuring the ways that errors are described when reported provides hierarchies and organization in a way that facilitates assignment of issues to those stewards with the greatest corresponding experience.

In other words, issues can be grouped to reduce the amount of replicated effort. In turn, an incident tracking system for data quality issues also provides entry points for tracking the status of the issues - whether the root cause has been identified, if a correction has been performed, or if further evaluation is being performed.

In the next post: Using data quality rules as a proactive measure.


Reducing Data Quality Risks

| No Comments | No TrackBacks
By Elliot King

Elliot King
As Donald Rumsfeld, the former secretary of the defense once famously said, "there are known unknowns and there are unknown unknowns." In other words, somethings we know we don't know and consequently we can do the research to learn what we need to know. But other times, we don't even know what we don't know. Unknown unknowns present real risks, as Rumsfeld sadly learned.

Data quality can fall into both camps. Sometimes companies understand that the quality of their data varies and they have to assess its quality regularly. But in many cases, an organization is completely unaware of data problems. So how can you mitigate the risks of developing unknown data quality issues? The best solution is prevention.

The most visible cause of data corruption is poor data entry. If there are no rules defining how data is entered into your information systems, inconsistencies will inevitably fester.

For example, should name records be required to include Mr., Ms., Mrs., Miss, Dr. and so on? Who is a Ms., who is a Mrs., and who is a Miss? Should the records include those titles at all? Can the United States be entered in an address record as U.S., USA, U.S.A. or the United States? Those choices need to be defined and those definitions need to be enforced. The choices should also be rational. If you capture more information than you really need, you become more vulnerable to data quality issues.

Data decay is also a serious driver of data corruption. People move and they change their names. According to some estimates, name and address data can decay at the rate of two percent a month or 25 percent a year. That kind of decay occurs whether you track it or not.

Other common sources of data corruption are data migration--when information from one system or application is moved to another; data mergers--when data is combined from different sources into a master file, and data consolidation--when companies attempt to eliminate redundant data.

Companies that are alert to the threats those kinds of processes pose to data quality can put safeguards into place to mitigate the problems that might arise. But those who are blind to the risks, risk being blindsided.


Standardizing Your Approach to Monitoring the Quality of Data

| No Comments | No TrackBacks
By David Loshin

In my last post, I suggested three techniques for maturing your organizational approach to data quality management. The first recommendation was defining processes for evaluating errors when they are identified. These types of processes actually involve a few key techniques:

1) An approach to specifying data validity rules that can be used to determine whether a data instance or record has an error. This is more of a discipline that can be guided by formal representations of business or data rules. Often metadata management tools and data profiling tools have repositories for capturing defined rules, leading to our next technique...

2) A method for applying those rules to data. This often will take advantage of the operational aspects of a data profiling or monitoring tool to validate a data instance against a set of rules. It may also incorporate parsing and standardization rules to identify known error patterns.

3) A means for reporting errors to a data analyst or steward. Some data analysis and profiling can be configured to automatically notify a data steward when a data validity rule is violated. In other situations, the results of applying the validation rule can be accumulated in a repository and a front-end reporting tool is used to provide visualization and notification of errors.

4) An inventory of actions to take when specific errors occur. As your team becomes more knowledgeable about the types of errors that can occur, you will also become accustomed to the methods employed for analysis and remediation.
In time, the repeated use of tools and the corresponding actions for remediation can be evolved into standardized methods, which can be documented, published, and used as the basis for training data quality analysts.


Reactivity vs. Proactivity

| No Comments | No TrackBacks
By David Loshin

In the past few months, we have looked at technical approaches to data quality and the use of data quality tools to parse, standardize, and cleanse data. In this next series of posts, it is time to look at harnessing the power of these tools and techniques to support a data quality management program. Most organizations are relatively immature when it comes to addressing data quality issues. Some typical behaviors in an immature organization include:

   · Few or no well-defined processes for evaluating the severity or root causes of data issues
   · Little or no coordination among those investigating data errors
   · Evaluating the same issues multiple times
   · Correcting the same errors multiple times

These are all manifestations of a more insidious problem: knee-jerk reactivity, which presumes that addressing the symptoms solves the problem. But in reality, applying these bandages to open wounds is merely a temporary fix. This suggests that incremental maturation of data quality processes involves transitioning from a reactive environment to one that operates within the context of a series of policies and controls.

The manifestations of immaturity listed here are some fertile areas for improvement, namely:

   · Defining processes for evaluating data errors when they are identified
   · Instituting methods for coordinating those evaluations
   · Applying corrections once, and only once.

As a byproduct of coordinating evaluation, your team will be less inclined to evaluate the same issues multiple times! In my next set of posts, we will look at ideas for each of these suggestions.

Four Pillars of Data Quality Improvement

| No Comments | No TrackBacks
By Elliot King

Elliot King
Almost all data quality management programs have four key elements that serve as the foundations for success--data profiling, data improvement, integration and data augmentation. Put in other words, data quality programs must determine what is broken; fix what can be fixed; consolidate what can be consolidated and enhance what needs to be enhanced. Sounds easy, right? If only.

Data profiling is the process for identifying what records are "broken." It consists of comparing your actual data to what you think you should have. Since data flows into organizations via so many routes, errors are inevitable. But if you never look for them, you won't know the data is flawed until something unexpected--usually unexpectedly bad--happens.

Once you know what's wrong, you can set about fixing it. But like a house that is in disrepair, you don't have to do everything at once. You may not want to correct some errors at all if they do not have a significant impact. Some mistakes in the data may be so fundamental that you simply cannot risk using it at all. Sometimes, a record may be incomplete but adding a placeholder--a standard substitute value--may be enough. And all the other errors you find, well, you will probably want to fix them.

The next two elements of data quality improvement programs go beyond finding and fixing what can be fixed. Many organizations have a boatload of redundant data--a single customer's name and address may be stored in numerous different databases. Those records should be consolidated. The more places data is stored, the greater the odds that inconsistencies will be introduced and inconsistencies inevitably lead to errors.

Finally, the data you have may not be sufficient to address your business needs. Good data quality improvement programs take steps to augment the existing corporate information. The more good information you have, the more value you can develop from it.


Modeling Issues and Entity Inheritance

| No Comments | No TrackBacks
By David Loshin

In our last set of posts, we looked at matching and record linkage and how approximate matching could be used to improve the organization's view of "customer centricity." Data quality tools such as parsing, standardization, and business-rule based record linkage and similarity scoring can help in assessing the similarity between two records. The result of the similarity analysis is a score that can be used to advise about the likelihood of two records referring to the same real-life individual or organization.

One last thought: this approach is largely a "data-centric" activity. What I mean is that it looks at and compares two records regardless of where those records came from. They might have come from the same data set (as part of a duplicate analysis) or from different data sets (for consolidation or general linkage).

But it does not take into consideration whether one data set models "customer" data and another models "employee" data. While you may link a customer record with an employee record based on a similarity analysis of a set of corresponding data attributes, the contexts are slightly different.

A match across the two data sets is a bit of a hybrid: we have matched the individual but one playing different roles. That introduces a different kind of question: are the identifying attributes associated with the "customer" or the individual acting in the role of "customer"? The same question applies for individual vs. employee.

And finally, are there attributes of the roles that each individual plays that can be used for unique identification within the role context? The answers to these questions become important when matching and linkage are integrated as part and parcel of a business application (such as the consolidation of data being imported into a business intelligence framework).


It Takes a Team

| No Comments | No TrackBacks
By Elliot King

Elliot King
As the cliché has it, data is an organization's most valuable asset. But the question is--who guards those corporate jewels? Is it the IT staff that is charged with making sure the information infrastructure supports the business correctly? Is it the database developers and administrators who are the front-line data professionals? Is it the business
users who need accurate data to make sure tasks are executed as anticipated? Or is it the executive staff, which is in the best position to have a birds-eye view of the entire operation?

In practice, safeguarding data quality requires an interdisciplinary team approach, with different players coming from different parts of the organization. As with most teams, you need a team leader or program manager. This person is charged with supervising the entire data quality improvement program, recommending what resources are needed and where those resources should be invested.

In addition to the program manager, most data quality initiatives require a project leader, a person responsible for addressing specific data quality issues at hand. Each project team has at least three specific roles that need to be filled with representatives from the IT and business staffs.

The IT professionals must have the technical ability to fix what might be broken and the business personnel must serve as the subject matter experts, understanding the characteristics the data must have to get the job done. Finally, there should be a data steward to set policies, procedures and standards to improve standards.

Finally, one last critical role must be filled--executive sponsorship. Those of you who are sports fans may have noticed that some teams are good year after year while others aren't. The difference is in the ownership (think the Los Angeles Dodgers for a case study in good and bad ownership.) A data quality improvement team cannot succeed without a strong commitment from the top.

Approximate Matching

| No Comments | No TrackBacks
By David Loshin

Actually, my first name is not David - that is really my middle name, but it is the given name my parents used when talking to me. This has actually led to a lot of confusion over the years, especially when confronted with a form asking for me "first name" and my "last name." For official forms (like my driver's license) I use my real first name as my "first name," but for non-official forms I often just use David. The result is that there is inconsistency in my own representation in records across different data systems.

If we were to rely solely on an exact data element-to-data element match of values to determine record duplication, the variation in use of my first or middle name would prevent two records from linking. In turn, you can extrapolate and see that any variations across systems of what should be the same values will prevent an exact match, leading to inadvertent duplication.

Fortunately, we can again rely on data quality techniques. We have our stand-bys of parsing and standardization, which can be enhanced through the use of transformation rules to map abbreviations, acronyms, and common misspellings to their standard representations - an example might be mapping "INC" and "INC." and "Inc" and "inc" and "inc." and "incorp" and "incorp." and "incorporated" all to a standard form of "Inc."

We can add to this another tool: approximate matching. This matching technique allows for two values to be compared with a numeric score that indicates the degree to which the values are similar. An example might compare my last name "Loshin" with the word "lotion" and suggest that while the two values are not strict alphabetic matches, they do match phonetically.

There are a number of techniques used for approximate matching of values, such as comparing the set of characters, the number of transposed, inserted, or omitted letters, different kinds of forward and backward phonetic scoring, as well as other more complex algorithms.

In turn, we can apply this approximate matching to the entire set of corresponding identifying attributes and weight each score based on the differentiation factor associated with each attribute. For example, a combination of first name and last name might provide greater differentiation than a birth date, since there is a relatively limited number of dates on which an individual can be born (maximum 366 per year).

By applying a weighted approximate match to pairs of records, we can finesse the occurrence of variations in the data element values that might prevent direct matching from working. More on this topic in future posts.


Assessment is the Critical First Step

| No Comments | No TrackBacks
By Elliot King

Elliot King
Edward Deming taught us long ago about the virtuous cycle of continual quality improvement--plan for change; execute the change; study the results and then take action to improve the process. But Deming's PDSA (plan, do, study, act) cycle is a generic approach. The cycle has to be modified and customized to address targeted areas for quality improvement.

The key steps in the virtuous cycle for data quality improvement are assessment, measurement, integration, improvement and management. Each process is important but assessment is the critical first step.

Data quality assessment is a multi-pronged exercise and the key is to start at the end. What business tasks and processes can be hurt by inaccurate, invalid and incomplete data? And in what ways will poor quality data increase costs, reduce revenues, hurt efficiencies or otherwise inflict pain on the organization? This exercise helps to identify the data sources that should be examined.

After you have determined where to look, you can profile your data to uncover anomalies and data flaws and then bring those flaws to the attention of the data users. In some cases, data anomalies may be harmless and have little impact on actual business activities. In that case, no remedial action is warranted. But when poor data quality does interfere with business operations then further action is needed.

The last piece of the assessment puzzle is to correlate the identified data issues with performance through a defined set of data quality business rules such as completeness, accuracy, and consistency. The rules provide a framework within which data quality can be measured.

The rule of thumb with assessment is relatively easy. First determine where poor quality data will have the most impact within your organization. Then figure out if it has to be fixed.

The Challenge of Identifying Information

| No Comments | No TrackBacks
By David Loshin

In my last post, I introduced the question of determining which characteristics are used to uniquely differentiate between any pair of records within a data set. The same question is relevant when attempting to match a pair of records as well, once they are determined to represent the same entity. I like to call these "identifying attributes," and the values contained therein I call "identifying information."

Let's look at an example for customer data integration: what data element values do I compare when trying to link two records together? Let's start with the obvious ones, namely (ha ha) first and last names. Of course, we all know that there are certain names that are relatively common - just ask my friend John Smith, with whom I worked at one of my earlier jobs.

But even if you have an uncommon name, you might be surprised. For example, if you type in my name ("David Loshin") at Google, you will find entries for me, but you will also find entries for a dentist in Seattle and a professor.

Apparently, first and last names are not enough identifying information for distinction. Perhaps there is another attribute we can use? You probably know that I have written some books, (see http:\\dataqualitybook.com), so maybe that is an additional attribute to be used. But if you go to Amazon and do a search for "David Loshin," you will find me, but it turns out the professor has also written a book.

Even an uncommon name such as mine still finds multiple hits, and while attempting to add more identifying information can reduce that number of hits, a poorly selected set of attributes may still not provide the right amount of distinction. It may take a number of iterations to review a proposed set of identifying attributes, determine their completeness, density, and accuracy before settling on a core set of identifying characteristics to be used for comparison.

One more thing to think about, though. Once you get to the point where you are pretty confident that those attributes are enough for differentiation, there is one last monkey wrench in the works: even if you had the absolute set of identifying attributes, there is no guarantee that the values themselves are exact matches!

Authors