February 2012 Archives

Achieving "Proactivity?"

| No Comments | No TrackBacks
By David Loshin

Standardizing the approaches and methods used for reviewing data errors, performing root cause analysis, and designing and applying corrective or remedial measures all help ratchet an organization's data quality maturity up a notch or two. This is particularly effective when fixing the processes that allow data errors to be introduced in the first place totally eliminates the errors altogether.

In the cases where the root cause is not feasibly addressed, we still have another standardized approach: defining data validity rules that can be incorporated into probe points in the processes to monitor compliance with expectations and alert a data steward as early as possible when invalid data is recognized.

This certainly reduces the "reactive culture" I discussed in one of the previous posts, and governing the data stewardship activities by combining automated inspection tools such as data profiling, automated data correction and cleansing tools, and incident management reduces replicated analysis efforts as well as repetitive fixes applied at different places and times. In fact, many organizations consider this level of maturity as being proactive in data quality management because you are anticipating the need to address issues that you already know about.

However, I might take a little bit of a contrarian view on this: to truly be proactive you'd have to go beyond anticipating what you know. In this light, we might say that instituting controls supporting inspection, monitoring, and notifications is less about being not proactive and more about being reactive much earlier in the process.

To really be proactive, perhaps it might be more worthwhile to attempt to anticipate the types of errors that you don't already know. Instead of only using profiling tools to look for existing patterns and errors, you might use these analytical tools to understand the methods and channels through which any types of potential errors could occur and attempt to control the introduction of flawed data before it ever leads to any material impact!

Data Quality Incident Management

| No Comments | No TrackBacks
By David Loshin

The previous step in our transition from uncontrolled reactivity to being proactively engaged in managing data quality involved defining processes for identifying and evaluating data errors using standardized methods. Providing well-defined processes to data stewards and data quality analysts helps reduce any confusion around the appropriate steps to take when those commonly-occurring data failures are discovered in process.

But in many cases there is still an issue of coordination. While standardizing the approaches to monitoring for data validity helps reduce the effort and complexity of analyzing and remediating issues, there is still the situation that the same error may impact multiple data consumers; if each data consumer reports to issue to one of the data stewards, you have many stewards investigating the same problem. So this is where our second suggestion comes in: instituting methods for coordinating those evaluations.

This is an area in which the data management world can learn lessons from our friends in desktop or network support, who rely on incident management systems for the reporting, tracking, and management of issues. Data consumers impacted by a data error can report the flaw in the incident management and tracking system, which can assign a unique identifier to the logged issue and then route it to a specific data steward.

However, by carefully structuring the ways that errors are described when reported provides hierarchies and organization in a way that facilitates assignment of issues to those stewards with the greatest corresponding experience.

In other words, issues can be grouped to reduce the amount of replicated effort. In turn, an incident tracking system for data quality issues also provides entry points for tracking the status of the issues - whether the root cause has been identified, if a correction has been performed, or if further evaluation is being performed.

In the next post: Using data quality rules as a proactive measure.


Reducing Data Quality Risks

| No Comments | No TrackBacks
By Elliot King

Elliot King
As Donald Rumsfeld, the former secretary of the defense once famously said, "there are known unknowns and there are unknown unknowns." In other words, somethings we know we don't know and consequently we can do the research to learn what we need to know. But other times, we don't even know what we don't know. Unknown unknowns present real risks, as Rumsfeld sadly learned.

Data quality can fall into both camps. Sometimes companies understand that the quality of their data varies and they have to assess its quality regularly. But in many cases, an organization is completely unaware of data problems. So how can you mitigate the risks of developing unknown data quality issues? The best solution is prevention.

The most visible cause of data corruption is poor data entry. If there are no rules defining how data is entered into your information systems, inconsistencies will inevitably fester.

For example, should name records be required to include Mr., Ms., Mrs., Miss, Dr. and so on? Who is a Ms., who is a Mrs., and who is a Miss? Should the records include those titles at all? Can the United States be entered in an address record as U.S., USA, U.S.A. or the United States? Those choices need to be defined and those definitions need to be enforced. The choices should also be rational. If you capture more information than you really need, you become more vulnerable to data quality issues.

Data decay is also a serious driver of data corruption. People move and they change their names. According to some estimates, name and address data can decay at the rate of two percent a month or 25 percent a year. That kind of decay occurs whether you track it or not.

Other common sources of data corruption are data migration--when information from one system or application is moved to another; data mergers--when data is combined from different sources into a master file, and data consolidation--when companies attempt to eliminate redundant data.

Companies that are alert to the threats those kinds of processes pose to data quality can put safeguards into place to mitigate the problems that might arise. But those who are blind to the risks, risk being blindsided.


Standardizing Your Approach to Monitoring the Quality of Data

| No Comments | No TrackBacks
By David Loshin

In my last post, I suggested three techniques for maturing your organizational approach to data quality management. The first recommendation was defining processes for evaluating errors when they are identified. These types of processes actually involve a few key techniques:

1) An approach to specifying data validity rules that can be used to determine whether a data instance or record has an error. This is more of a discipline that can be guided by formal representations of business or data rules. Often metadata management tools and data profiling tools have repositories for capturing defined rules, leading to our next technique...

2) A method for applying those rules to data. This often will take advantage of the operational aspects of a data profiling or monitoring tool to validate a data instance against a set of rules. It may also incorporate parsing and standardization rules to identify known error patterns.

3) A means for reporting errors to a data analyst or steward. Some data analysis and profiling can be configured to automatically notify a data steward when a data validity rule is violated. In other situations, the results of applying the validation rule can be accumulated in a repository and a front-end reporting tool is used to provide visualization and notification of errors.

4) An inventory of actions to take when specific errors occur. As your team becomes more knowledgeable about the types of errors that can occur, you will also become accustomed to the methods employed for analysis and remediation.
In time, the repeated use of tools and the corresponding actions for remediation can be evolved into standardized methods, which can be documented, published, and used as the basis for training data quality analysts.


Reactivity vs. Proactivity

| No Comments | No TrackBacks
By David Loshin

In the past few months, we have looked at technical approaches to data quality and the use of data quality tools to parse, standardize, and cleanse data. In this next series of posts, it is time to look at harnessing the power of these tools and techniques to support a data quality management program. Most organizations are relatively immature when it comes to addressing data quality issues. Some typical behaviors in an immature organization include:

   · Few or no well-defined processes for evaluating the severity or root causes of data issues
   · Little or no coordination among those investigating data errors
   · Evaluating the same issues multiple times
   · Correcting the same errors multiple times

These are all manifestations of a more insidious problem: knee-jerk reactivity, which presumes that addressing the symptoms solves the problem. But in reality, applying these bandages to open wounds is merely a temporary fix. This suggests that incremental maturation of data quality processes involves transitioning from a reactive environment to one that operates within the context of a series of policies and controls.

The manifestations of immaturity listed here are some fertile areas for improvement, namely:

   · Defining processes for evaluating data errors when they are identified
   · Instituting methods for coordinating those evaluations
   · Applying corrections once, and only once.

As a byproduct of coordinating evaluation, your team will be less inclined to evaluate the same issues multiple times! In my next set of posts, we will look at ideas for each of these suggestions.

Authors