Recently in Data Profiling Category

By Joseph Vertido

For many, the concepts of data integration and data quality are separate and have no commonality. But in reality, when you combine them - they create a partnership that excels. Where data quality leaves off, data integration begins, and vice versa. A new product - Contact Zone - fuses these two concepts together into one revolutionary solution for where data integration and data quality converge.

Data integration tools simplify data migration and data warehousing procedures - both of which are concerned with the issue of data management, i.e. keeping data organized. Data quality, on the other hand, is concerned primarily with an understanding of the nature, and validity of the contents of the actual data, i.e. keeping data clean. Maintaining an organized database is not the same as keeping it clean - they are two different approaches to handling data - but they can be combined, or should they?

The short answer is yes.

In essence, data integration allows for the migration of data from a given source to a given destination. Typically, users take advantage of data integration to accomplish data warehousing initiatives - allowing for easy migration and manipulation of data, which ultimately leads to maximizing the efficiency of business intelligence and analytics.

However, Gartner states that "only 30 percent of business intelligence and data warehousing implementations fully succeed." Why? The top two reasons for failure are budget constraints and data quality. So, although the architectural constraints of building a data warehouse can be addressed by utilizing data integration tools, it still leaves the problem of poor data quality - something that most data integration tools handle with mediocrity at best.

That's where Contact Zone comes into play. It's a data integration tool optimized for data quality, allowing you to shoot two birds with one stone.

Contact Zone connects to virtually any source, overcoming an obstacle our clients frequently encounter when implementing data quality, namely there is such a variety of database format and platforms today that the types of environments and combinations can be overwhelming.

Whether you have an IBM DB2 database or PostgreSQL, leveraging Contact Zone allows for data integration for almost any form of database format, while making sure that all data is clean, correct, standardized, and valid.

By David Loshin

There are all sorts of tools associated with address standardization, cleansing, and validation. As an example, the USPS has a certification program for software vendors, referred to as CASS (Coding Accuracy Support System)™ certification. According to their website,

CASS enables the Postal Service™ to evaluate the accuracy of address matching software programs in the following areas:

(1) five-digit coding
(2) ZIP + 4/ delivery point (DP) coding
(3) carrier route coding
(4) DPV®
(5) DSF2®
(6) LACSLink®
(7) eLOT®
(8) RDI™ products

CASS allows vendors/mailers the opportunity to test their address-matching software packages and, after achieving a certain percentage of compliance, to be certified by the Postal Service. CASS does not measure the accuracy of ZIP + 4 delivery point, five-digit ZIP, or carrier route codes in a mailer's existing files. CASS enables mailers to measure and diagnose internally written, commercially-available, address-matching software packages. The effectiveness of service bureaus' matching software can also be measured.

There are many vendors selling CASS-certified tools and services. Organizations use CASS-certified tools for address standardization, correction, and validation. End of story, right?

Wrong. Some organizations use many CASS-certified tools for address standardization, correction, and validation, at different places along the processing stream. The addresses are standardized, cleansed, and validated (or not) multiple times. The addresses are changed from their original format, manipulated, and then shoved back into the databases, without considering the actual process dependencies or expectations.

And then you end up with a scenario like this: a process for accepting customer applications including their self-provided addresses, send hard copies of acknowledgements to their self-provided addresses, yet the process includes an elaborate mechanism for managing returned mail. That did not make sense to me: if the organization sends out acknowledgements to the address the customer provided, wouldn't they trust that the customer provided an accurate address?

In fact, the issues was self-created: the provided address passed through at least three different iterations (with different products!) of standardization, correction, and validation, and was transformed from a deliverable ("accurate") address to an invalid one.

So even though the intent was appropriate, the execution of the process got in the way of the results. So I'll throw out two questions: First, is address standardization and validation a tool or a process? And second, at what point and how frequently in the business process flow should address standardization and validation take place?


Achieving "Proactivity?"

| No Comments | No TrackBacks
By David Loshin

Standardizing the approaches and methods used for reviewing data errors, performing root cause analysis, and designing and applying corrective or remedial measures all help ratchet an organization's data quality maturity up a notch or two. This is particularly effective when fixing the processes that allow data errors to be introduced in the first place totally eliminates the errors altogether.

In the cases where the root cause is not feasibly addressed, we still have another standardized approach: defining data validity rules that can be incorporated into probe points in the processes to monitor compliance with expectations and alert a data steward as early as possible when invalid data is recognized.

This certainly reduces the "reactive culture" I discussed in one of the previous posts, and governing the data stewardship activities by combining automated inspection tools such as data profiling, automated data correction and cleansing tools, and incident management reduces replicated analysis efforts as well as repetitive fixes applied at different places and times. In fact, many organizations consider this level of maturity as being proactive in data quality management because you are anticipating the need to address issues that you already know about.

However, I might take a little bit of a contrarian view on this: to truly be proactive you'd have to go beyond anticipating what you know. In this light, we might say that instituting controls supporting inspection, monitoring, and notifications is less about being not proactive and more about being reactive much earlier in the process.

To really be proactive, perhaps it might be more worthwhile to attempt to anticipate the types of errors that you don't already know. Instead of only using profiling tools to look for existing patterns and errors, you might use these analytical tools to understand the methods and channels through which any types of potential errors could occur and attempt to control the introduction of flawed data before it ever leads to any material impact!

Standardizing Your Approach to Monitoring the Quality of Data

| No Comments | No TrackBacks
By David Loshin

In my last post, I suggested three techniques for maturing your organizational approach to data quality management. The first recommendation was defining processes for evaluating errors when they are identified. These types of processes actually involve a few key techniques:

1) An approach to specifying data validity rules that can be used to determine whether a data instance or record has an error. This is more of a discipline that can be guided by formal representations of business or data rules. Often metadata management tools and data profiling tools have repositories for capturing defined rules, leading to our next technique...

2) A method for applying those rules to data. This often will take advantage of the operational aspects of a data profiling or monitoring tool to validate a data instance against a set of rules. It may also incorporate parsing and standardization rules to identify known error patterns.

3) A means for reporting errors to a data analyst or steward. Some data analysis and profiling can be configured to automatically notify a data steward when a data validity rule is violated. In other situations, the results of applying the validation rule can be accumulated in a repository and a front-end reporting tool is used to provide visualization and notification of errors.

4) An inventory of actions to take when specific errors occur. As your team becomes more knowledgeable about the types of errors that can occur, you will also become accustomed to the methods employed for analysis and remediation.
In time, the repeated use of tools and the corresponding actions for remediation can be evolved into standardized methods, which can be documented, published, and used as the basis for training data quality analysts.


The Role of Data Profiling in Data Quality Assessment

| No Comments | No TrackBacks
By Elliot King

Elliot King
After "sustainability," perhaps the biggest buzzword flying around many corners of the corporate world these days is assessment. It seems people can't breathe without somebody wanted to assess the quality of the air, the efficiency of their lungs, and, of course, the outcome of the breath.

But just because something is a buzzword, doesn't mean it is a bad thing. So how do you go about assessing the quality of your data? With a tip of the hat to the idea that you have to know your starting point before you can map a path to a finishing line, the first step in many data quality programs is data profiling.

Without getting too technical, the data profiling process generates and collects descriptive statistics describing data such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, and variation as well as aggregate statistics such as sum and count. These statistics are analyzed to reveal and validate data patterns and formats, uncover duplicate data from different data sources, identify missing data and confirm that data values are valid.

Data profiling describes data in a way in which the data's strengths and weaknesses become apparent and the accuracy and completeness of the data can be determined. Based on that assessment, remedial data quality improvement programs can be launched.

Moreover, during the last couple of years, data profiling as gotten a lot of attention for the role it can play in master data management programs designed to ensure the consistency of key non-transactional reference data used across the enterprise.

In the long run, data profiling can be used both tactically and strategically. Tactically, it can serve as an integral part of data improvement programs. Strategically, it can help managers determine the appropriateness of different data source systems under consideration for deployment in a particular project.

Authors