Recently in Data Quality Assessment Category

Better Marketing Starts with Better Data

| No Comments | No TrackBacks

Improve Data Quality for More Accurate Analysis with Alteryx and Melissa


Organizations are under more pressure than ever to gain accurate contact data for their customers. When your consumer base ranges from Los Angeles to Tokyo, it can be challenging. Poor data quality has a critical impact on both the financial stability as well as the operations of a business. Verifying and maintaining vast quantities of accurate contact data is often inefficient and falls short of the mark. According to IBM, the yearly cost of poor data quality is estimated at 3.1 trillion in the U.S. alone.


Melissa's Global Address Verification and Predictive Analysis for Alteryx are the tools your business needs to grow. Download this whitepaper to find out how to achieve marketing success, while reducing the cost of doing business overall.


Learn how to:

  • ·         Better understand and utilize your big data for marketing success
  • ·         Build better relationships with customers with clean data
  • ·         Target the customers most likely to buy
  • ·         Cut down on undeliverable mail and save on costs


Download free whitepaper now:


Data Quality Dimensions Can Raise Practical Concerns

| No Comments | No TrackBacks
By Elliot King

Elliot King
As everybody knows, data quality is usually measured along seven dimensions--the four Cs of completeness, coverage, consistency, and conformity plus timeliness, accuracy and duplication. And the general method to judge data quality is to establish a standard for each of these dimensions and measure how much of the data meets these standards.

For example, how many records are complete; that is, how many of your records contain all of the essential information that the standard you established requires them to hold? Or how much of your data is accurate; that is, do the values in the records actually reflect something in the real world.

As Malcolm Chisholm pointed at in a series of posts not long ago, conceptualizing data quality as a set of dimensions may be misleading or at least not that useful. The argument is both philosophical and practical and while philosophers can debate the relationship of an abstraction to the real world, the practical concerns about the dimensions of data quality raise interesting questions.

The real issue is this--as they are currently conceptualized, are data quality dimensions too abstract; do they actually reveal something real, meaningful and useful about the data itself? And does measuring data according to those standards--i.e. establishing their quality-- lead to useful directions to improve business processes?

For example, the International Association for Information and Data Quality defines timeliness as "a characteristic of information quality measuring the degree to which data is available when knowledge workers or processes require it."

Obviously, the sense of timeliness in that definition reflects more on the ability to get at data when it is needed than on any quality of the data itself. However, timeliness of the data also could reflect on how up to date the data is.

Do records contain the most current information? But timeliness in that sense could also be subsumed under the idea of accuracy. If the information is not up to date, perhaps it is just inaccurate. Looked at through another lens, however, even if the data is not timely, that is it is not up to date, maybe the record is not inaccurate, per se, but is just incomplete.

Clearly, the assessment of quality according to individual dimensions is a tricky business. They can overlap and when used without caution can lead to more confusion than clarity.

More About Data Quality Assessment

| No Comments | No TrackBacks
By David Loshin

In our last series of blog entries, I shared some thoughts about data quality assessment and the use of data profiling techniques for analyzing how column value distribution and population corresponded to expectations for data quality. Reviewing the frequency distribution allowed an analyst to draw conclusions about column value completeness, the validity of data values, and compliance with defined constraints on a column-by-column basis.

However, data quality measurement and assessment goes beyond validation of column values, and some of the more interesting data quality standards and policies apply across a set of data attributes within the same record, across sets of values mapped between columns, or relationships of values that cross data set or table boundaries.

Data profiling tools can be used to assess these types of data quality standards in two ways. One approach is more of an undirected discovery of potential dependencies that are inherent in the data, while the other seeks to apply defined validity rules and identify violations. The first approach relies on some algorithmic complexity that I would like to address in a future blog series, and instead in the upcoming set of posts we will focus on the second approach.

To frame the discussion, let's agree on a simple concept regarding a data quality rule and its use for validation, and we will focus specifically on those rules applied to a data instance, such as a record in a database or a row in a table. A data instance quality rule defines an assertion about each data instance that must be true if the rule is observed. If the assertion evaluates to be not true, the record or table row is in violation of the rule.

For example, a data quality rule might specify that the END_DATE field must be later in time than the BEGIN_DATE field, and that means that for each record, verifying observance of the rule means comparing the two date fields and making sure that the END_DATE field is later in time than the BEGIN_DATE field.

This all seems pretty obvious, of course, and we can use data profiling tools to both capture and apply the validation of the rules to provide an assessment of observance. In the next set of posts we will focus on the definition and application of cross-column and cross-table data quality rules.


Data Quality Assessment: Value Domain Compliance

| No Comments | No TrackBacks
By David Loshin

To continue the review of techniques for using column value analysis for assessing data quality, we can build on a concept I brought up in my last post about format and pattern analysis and the reasonableness of data values, namely whether the set of values that appear in the column complies with the set of allowable data values altogether.

Many business applications rely on well-defined reference data sets, especially for contact information and product data. These reference data sets are often managed as master data, with the values enumerated in a shared space. For example, a conceptual data domain for the states of the United States can be represented using an enumerated list of 2-character codes as provided by the United States Postal Service (USPS).

That list establishes a set of valid values, which can be used for verification for any dataset column that is supposed to use that format to represent states of the United States.

A good data profiling tool can be configured to perform yet another column analysis that verifies that each value that appears in the column coincides with one of those in the enumerated master reference set. After the values have been scanned and their number of occurrences tallied, the set of unique values can be traversed and each value compared against the reference set.

Any values that appear in the column that do not appear in the reference set can be culled out as potential issues to be reviewed with the business subject matter expert.

In this blog series, we have looked at a number of methods that column value scanning and frequency analysis can be used as part of an objective review of potential data issues.

In a future series, we will look more closely at why these types of issues occur as well as methods for logging the issues with enough context and explanation to share with the business users and solicit their input for determination of severity and prioritization for remediation.


Data Quality Assessment: Value and Pattern Frequency

| No Comments | No TrackBacks
By David Loshin

Once we have started our data quality assessment process by performing column value analysis, we can reach out beyond the scope of the types of null value analysis we discussed in the previous blog post. Since our column analysis effectively tallies the number of each value that appears in the column, we can use this frequency distribution of values to identify additional potential data flaws by considering a number of different aspects of value frequency (as well as lexicographic ordering), including:

  • Range Analysis, which looks at the values, and allows the analyst to consider whether they can be ordered so as to determine whether the values are constrained within a well-defined range.

  • Cardinality Analysis, which analyzes the number of distinct values that appear within the column to help determine if the values that actually appear are reasonable for what the users expect to see.

  • Uniqueness, which indicates if each of the values assigned to the attribute is used once and only once within the column, helping the analyst to determine if the field is (or can be used as) a key.

  • Value Distribution, which presents an ordering of the relative frequency (count and percentage) of the assignment of distinct values. Reviewing this enumeration and distribution alerts the analyst to any outlier values, either ones that appear more than expected, or invalid values that appear few times and are the result of finger flubs or other data entry errors.
In addition, good data profiling tools can abstract the value strings by mapping the different character types such as alphabetic, digits, or special characters to a reduced representation of different patterns. For example, a telephone number like "(301) 754-6350" can be represented as "(DDD) DDD-DDDD." Once the abstract patterns are created, they can also be subjected to frequency analysis, allowing such assessment like:

  • Format and/or pattern analysis, which involves inspecting representational alphanumeric patterns and formats of values and reviewing the frequency of each to determine if the value patterns are reasonable and correct.

  • Expected frequency analysis, or reviewing those columns whose values are expected to reflect certain frequency distribution patterns, validate compliance with the expected patterns.
Recall again that the identification of potential issues can only be verified as a problem by review with a business process subject matter expert.

Data Quality Assessment: Sparsity and Nullness

| No Comments | No TrackBacks
By David Loshin

The first set of data quality assessment techniques that use column value frequency analysis focuses on the relationship of the population of values to the business processes that consume the data. The intent is to understand how the relative population of the column is associated with defined (or implicit) business rules, and then isolate and validate those rules.

Accepted rules can be integrated into operational systems as controls for validation when the data is created or acquired, thereby reducing the potential downstream negative impacts of data flaws.

Sparseness analysis identifies columns that are infrequently populated. Null analysis is used to identify potential default null values that are not database system nulls. These values crop up all the time, and I am sure you have seen them.

Some examples include: "None," "N/A," "X," "XX," "9999999," etc. These default null values are important to identify for a number of reasons. The existence of default nulls like those examples indicates a flaw in the system that forces the provision of a value at the time of creation even when no value can be provided.

We have seen systems that insist on the user entering data into required fields, particularly when there are no values that could be correct, such as a "home phone number" for a person that only uses a mobile phone, or a "first name" field for a corporate customer.

In these cases, the user is forced to provide some value, so determining the existence of this forced data entry helps the analysts isolate where in the information flow that dependency exists so that it can be resolved more effectively. Identifying default nulls is also beneficial when one can determine any consistency or patterns in their use.

It is one thing to see a multitude of default null values, but very different to find that "N/A" is used consistently, even when the data originates from different sources and different people. In addition, considering the different types of null patterns helps in the analysis of sparseness to see how frequently the column has a valid value altogether. Regarding sparseness, there is an implicit presumption that sparsely-populated columns are unused or are devolving into unused data elements.

The absence of data can indicate a data quality issue, especially as more organizations seek to increase their data collection and retention. However, recall that the objective analysis only indicates potential issues. These must be put in context by reviewing the potential issues with business subject matter experts.

Ask First, Fix Later

| No Comments | No TrackBacks
By Elliot King

Elliot King
Like the Boston Red Sox breaking their fans' hearts, almost inevitably (stress on the almost) you will discover that some percentage of your data is wrong. The realization that you have data quality problems may come about for few reasons: 1) you've looked under the hood of your data systems by conducting a data assessment or 2) a data audit revealed that the data you have is not what you think you have.

Or a problem may have percolated to the surface. Perhaps a direct mail campaign failed to yield the anticipated results or customer service representatives find themselves with incorrect information during critical interactions. So what do you do then?

With most rude awakenings, people want to act right way. After all, the data is broken, so let's get it fixed. With data quality, however, the impulse to act immediately may be a mistake. Indeed, the first question to ask is, does it really matter? The sad fact is that we live in a world of inaccurate and incomplete data.

Data sets will never be perfect. Inaccurate data may have little or no impact on ongoing processes and the investment required to remediate the data may be more than the return better data will provide. Identifying the impact of the data quality is essential. Have the problems resulted in lost revenue? Has customer service been compromised? Have the issues driven up costs? And so on.

Once the impact of the problem has been isolated, the next step is to better understand the nature and scope of the problem. What are the processes through which incorrect or poor data is entering the system? As most data professionals know, often data problems have more ways into your system than a freeway has on-ramps. Can the sources of incorrect data even be fixed? If they can, how much investment will be required and how much improvement can be expected? Finally, what will be the expected return on investment?

Though it seems a little counter-intuitive and perhaps even a little uncomfortable, the first step after data quality issues are discovered is to think. You may not want to act at all.

Data Quality Assessment: Column Value Analysis

| No Comments | No TrackBacks
By David Loshin

In recent blog series, I have shared some thoughts about methods used for data quality and data correction/cleansing. This month, I'd like to share some thoughts about data quality assessment, and the techniques that analysts use to review potential anomalies that present themselves.

The place to start, though is not with the assessment task per se, but the context in which the data quality analyst will find him/herself when asked to identify potential data quality flaws. The challenge is in interpretation of the goal: an objective assessment is intended to identify data errors and flaws, but when the task is handed off to a technical data practitioner outside of the context of business needs, the review can be more of a fishing expedition than a true analysis.

What I mean here is that an undirected approach to data quality assessment is likely to expose numerous potential issues, and without some content scoping as to which potential issues are or are not relevant to specific business processes, a lot of time may be spent on wild goose chases to fix issues that are not really problems.

With that caveat, though, we can start to look at some data quality assessment methods, starting with one particular aspect of data profiling: column value analysis. The idea is that reviewing all of the values in a specific column along with their corresponding frequencies will expose situations in which values vary from what they should be. Most column analysis centers on value frequency. In essence, the technical approach for column analysis is to scan all the values in a column and add up their frequencies, then, present the frequencies to the analyst, ordered by frequency or in lexicographic order.

These two orderings enable the lion's share of the analysis, yet many people don't realize that the analysis itself must be driven by the practitioner within the context of the expectation. Over the next three postings in this series, we will look at three different ways to assess quality through reviewing the enumeration of column values and their relative frequency.

Customer Centricity and Birds of a Feather

| No Comments | No TrackBacks
By David Loshin

Why do we care to establish physical locations for individuals? One reason should be patently obvious: in every interaction between a staff member from your company and customer, both parties are always physically located somewhere, and many business performance indicators are pinned to a location dimension, such as "sales," "customer complaints," or "product distribution" by region.

Location is meaningful when it comes to analyzing customer behavior. The old adage "birds of a feather flock together" can be adapted to customer centricity. Simply put, individuals tend to congregate in areas populated by others with similar characteristics and interests. People interested in boating and fishing are more likely to live near the water than the mountains.

Wealthy people live in wealthy neighborhoods. In fact, geo-demographic segmentation has been around for a long time, in which the primary demographic characteristics of individuals living in different geographic regions are analyzed and then used to provide descriptive segmentation in relation to where each individual lives.

The value of geo-demographic segmentation is that once you can link a location to the individual, you have a starting point for marketing strategies especially along the traditional media channels such as television, radio, print advertising, or even highway billboards. For example, a luxury car manufacturer might want to situate a highway billboard advertisement close to the exit nearest the wealthy neighborhoods.

Contact information is traditionally used for determining location, but there are two factors that are poised to upend this reliable apple cart. The first, which we started to look at last time, is the encroaching inaccurate use of previously reliable location data sources such as area codes.

As individuals transition away from landline telephones to mobile or virtual phones (whose "area" codes are irrelevant and untrustworthy for location specification), the usability of that data diminishes as well.

The second factor is a transition in the way that people communicate among themselves. In today's connected world, people are as likely (if not more so!) to interact virtually via the internet than by telephone.

Email, instant messages, Facebook, Twitter, LinkedIn, Foursquare - these are just a few examples of the channels within individuals with similar interests hang together. But what does this imply when it comes to location analysis and customer centricity?


Get it Right the First Time

| No Comments | No TrackBacks
By Elliot King

Elliot King
People generally think of data quality as a remedial exercise. During the ongoing course of business, for a variety of reasons, companies find themselves with incorrect data. The goal of a data quality program is to identify the incorrect data and fix it.

And while data errors inevitably do occur, an essential element of a data quality program is putting technology and processes in place that will ensure as much as possible that the data captured initially is correct. It stands to reason that the higher the quality of data at the front end, the less extensive the remediation will have to be later on.

Precise and comprehensive business rules can play an important role in protecting data quality both as data is captured and as data is used. Broadly speaking, in building a database, data quality business rules can be classified into four categories--rules that describe how a business object is identified; rules that describe the specific attributes of a business object; rules that control the various relationships among business objects; and rules that define the validity of the data. A business object can be thought of as a collection of data points that form a complete unit of information--a customer record for example.

Each of those broad categories contains different possibilities that must be defined. For example, each business object must have a unique identifier. That record can be a newly generated number such as a purchase order number or a customer identification number or it can consist of a number of data points in a record such as name and telephone number. The key is for the identifier to be unique.

The relationships between business objects must be set. For example, a professional baseball player can be associated with only one team at a time, but a team can be associated with many players. Valid values for data have to be determined. Is the "year" value in a date two digits or four digits. Y2K is an excellent example of how significant that rule can be.

Carefully constructing the business rules for your data will ensure that you know the information you have and assist you in applying it correctly. Unfortunately, too often the development of business rules is a black-box operation implemented by software. When people do not know the business rules defining their data, mistakes happen.