December 2011 Archives

What is a Data Steward and Do You Need One?

| No Comments | No TrackBacks
By Elliot King

Elliot King
The metaphor of "ownership" has become popular in organizations and their IT shops. Companies have "application owners" and projects that are "owned" by this or that group. So that raises the question, who "owns" your data?

The right answer is that nobody "owns" the data. Data is a resource that must be shared across an organization. Data flows from the point of creation--perhaps capturing contact information on a website or importing a third-party mailing list--through staging, consumption, storage and archiving. At each step of the way, a different functional group within an organization has to be able to use the data in different ways.

To insure that data meets the standards needed by each stakeholder in the data lifecycle, companies have to implement enterprise-wide data management policies and procedures. A typical policy might say that all contact information must conform to a specific format. Don't assume that to be the case in your organization. Unmonitored, your sales department, service organization and billing department could easily capture names differently. Indeed, in larger corporations, different sales organizations might have different formats for names and addresses.

Data stewards both develop those policies and create mechanisms to insure that the policies are enforced. On the flip side, the data steward should be accountable for enterprise data quality and the advocate for data quality initiatives.

Data stewardship is neither an easy job nor an easy job to fill. The foundational technical skill is a deep understanding of specific business functions, the data associated with those functions and the processes that rely on the data.

Those technical skills have to be coupled with a strong set of interpersonal skills as, by definition, data stewardship requires interacting with a wide range of stakeholders (often including other data stewards). Finally, regardless of the formal position they hold, data stewards need to be able to establish their authority as the role sometimes calls for stepping on other people's toes.

Stewardship is quite different than ownership. But if your organization has data, it probably needs a data steward.

By David Loshin

What I have found to be the most interesting byproduct of record linkage is the ability to infer explicit facts about individuals that are obfuscated as a result of distribution of data. As an example, consider these records, taken from different data sets:

A:
David
Loshin
301-754-6350
1163 Kersey Rd
Silver Spring
MD
20902

B:
Knowledge Integrity, Inc
1163 Kersey Rd
Silver Spring
MD
20902

C:
H David
Lotion
1163 Kersey Rd
Silver Spring
MD
20902

D:
Knowledge Integrity, Inc.
301
7546350
7546351
MD
20902

We could establish a relationship between record A and records B and C because they share the same street address. We could establish a relationship between record B and record D because the company names are the same.

Therefore, by transitivity, we can infer a relationship between "David Loshin" and the company "Knowledge Integrity, Inc" (A links to B, B links to D, therefore A links to D). However, none of these records alone explicitly shows the relationship between "David Loshin" and "Knowledge Integrity, Inc" - that is inferred knowledge.

You can probably see the opportunity here - basically, by merging a number of data sets together, you can enrich all the records as a byproduct of exposed transitive relationships.

This provides us with one more valuable type of enhancements that record linkage provides. And this is particularly valuable, since the exposure of embedded knowledge can in turn contribute to our other enhancement techniques for cleansing, enrichment, and merge/purge.

The Ethics of Data Quality

| No Comments | No TrackBacks
By Elliot King

Elliot King

Technical people often don't seem too interested in ethical issues related to their work. Discussions of right and wrong are often "squishy." Too frequently, they have no clear answers and the answer can change from one context to another. In contrast, technical people like to deal with facts. They like clear outcomes--it worked or it didn't work--without
any value judgments attached.

Perhaps the most profound ethical discussion associated with a significant technical advancement was the one scientists engaged in when they developed the atomic bomb. Was it right to contribute to the building of the most destructive weapon in human history--a weapon that could destroy the earth? The argument that the atomic bomb was the inevitable result of technical advances is just not compelling or satisfactory.

While certainly not as momentous as the debates about the atomic bomb or those debated in bioethics, like it or not, data quality professionals face ethical questions everyday. These questions revolve around privacy, data integrity, security, retention, access and so on.

Take the issue of privacy, for example. As we know in industries ranging from health care to financial, there are a slew of legal standards that companies have to meet. But beyond that, companies must decide exactly what data they collect about their customer or clients and why? Should your company routinely collect and store social security numbers, for example? If so, why? The question is not just one of legal liability but the ethics of putting your customers at unnecessary risk.

Similar sorts of ethical questions can be raised around data retention policies. Once again there are legal restrictions--in certain fields, records must be retained for legally determined periods of time--but there are also questions of right and wrong. Beyond your legal obligations, how much of your data should be retained for what period of time and why?

Incorporating the idea of ethics into your data management decision-making processes will help make your decisions more deliberative. Facing ethical concerns forces people to confront not only what they are required to do but what they should do as well.

Record Linkage and Data Enhancement

| No Comments | No TrackBacks
By David Loshin

In my last two posts we looked at the distribution of information about entities and the use of record linkage to find corresponding data records in different data sets that can be linked together. Record linkage can be used for a number of processes that we bundle under the concept of "data enhancement," which we'll use to describe any methods for
improving the value and usefulness of information. In this post, we'll look at three different types of enhancement:

· Data cleansing - The first type of enhancement is relatively straightforward: our idea is to link records together for the purposes of cleansing the data, or making it more suitable for use. Often, one data set may have a more trustworthy representation of an entity, or we may have more than one data set, each potentially containing overlapping data elements such as birth date, address, telephone number. By linking two different records, you can compare the corresponding values, find those that are of better quality (e.g. more complete or more current values) and update the "delinquent" record with the higher quality values.

· Enrichment - Existing records for entities (such as people or products) can be matched against other data sets with additional reference information. For example, you might want to match your customer data with a credit bureau's data and enrich your own data set with each individual's credit ratings.

· Merge/Purge - Duplicate records entered into one data set often plague the business in attempting to actively manage customer accounts. Applying the record linkage methodology to the records in a single data set helps find multiple records that refer to the same individual. These records can be presented to a data analyst to review and determine the surviving record and updating the record with the highest quality values.
There are many variations on these themes. For example, merge/purge can be used for combining customer data sets after a corporate acquisition; enrichment can be used to institute a taxonomic hierarchy for customer classification and segmentation. Loosening the matching rules for merge/purge can help with a process called "householding," which attempts to identify individuals with some shared characteristics (such as "living in the same house").

Who Should Lead Your Data Quality Initiatives

| No Comments | No TrackBacks
By Elliot King

Elliot King
Data and data quality issues touch virtually every part of an organization. Poor data hurts organizational efficiency. It can have a measurable impact on the bottom line. And it can diminish employee morale when employees cannot access the information they need to succeed at their tasks, and the data they do retrieve turns out to be wrong.

So the easy answer to who should lead data quality initiatives is that the CIO should be in charge. In most cases, the CIO is in charge of all enterprise IT issues. So to suggest that data quality programs should be supervised by the CIO is not saying much, other than data quality is an important enterprise IT issue (which it is but, sadly enough, has to be said.)

For practical purposes, data quality initiatives should be the domain of an enterprise data quality team; consisting of business leaders, staff and IT personnel. Data quality issues do not exist in a vacuum. They can have a concrete impact on real operations and only the people involved in those operations can truly understand their severity. The fact is that IT staffs are not at the point where data is actually used and often are not present when data is created either.

On the other hand, the IT organization should have the expertise to locate the problem data within the overall information infrastructure and the tools to correct what is wrong. Generally managed by IT, the enterprise data management team should represent all the stakeholders in data quality including the marketing and sales organizations, finance, operations and product development.

While cross-functional teams like this are difficult to manage and sustain, done right, the payoff can be significant. Projects can be launched with wider corporate support and institutional knowledge about data quality can be developed. In short, poor data quality can be seen as everybody's problem and, as they say, admitting to having problem is the first step in fixing it.

The Dupe Detective Returns: Matching 101

| No Comments | No TrackBacks
By Joseph Vertido

Once again, it's your Dupe Detective here, helping you win the fight against record duplicates. This time, we're going to have a quick lesson on how to have a better eye and a keener understanding of how to identify dupe problems and associate similar matching records.

Things aren't always what they seem - or as in this case, records might not always be unique, even though they look like it. Let's take a look at an example. We've got two sets of data: one is a subset of our master database containing existing customer information, and another contains new customers to be added.

Master Data and Incoming Data

 Take a close look at the data and see if you can find anything wrong in this picture.

It doesn't take long before we realize that our two incoming records are peculiarly similar to some of the records in our master database. And it doesn't take very long to see that automating the process of associating these records might not be very straightforward - a task commonly known as Record Linkage.

So what are some of the problems we face in this example?

I. Non-Matching Customer IDs

It's common practice to use some form of a unique string (such as a customer ID) to identify and associate records. But what happens when we've got existing customers signing up for new accounts? We get duplicate customer data with different customer IDs. Hence, we've just opened our data to duplicates.

II. Non-Matching Data

If records can't be associated based solely on customer IDs, then we can most probably make comparisons based on the other information in our data right? This is true, but only for cases where we've got exact matching information - which won't always be the case. Who's to say that "John Smith at 123 Main St" isn't the same person as "J Smith at 123 Mein St"? Minor discrepancies are common, especially when we've got people manually typing in their information.

Without the necessary precautions and preventive measures, our master database might eventually become almost unmanageable. But now that we've come to understand the problem, let's talk about solutions!

In a world where data can be very deceiving, we can make use of Similarity Computations to track down these unwanted dupes. Even with non-matching customer IDs and differences in data, similarity computations through various algorithms allows us to associate probable and possible duplicate records represented by a percentage match - a job perfectly fit for our Fuzzy Matching Component in SQL Server Integration Services (SSIS).

Take a look at the similarity computation results for these records when processed through the Fuzzy Matching Component.

Data Matching Percentage

Can you deduce what the pattern is?


As the compared record becomes less and less similar to the source, the Match Percentage between the two records correspondingly goes down as well. Although the records are not precisely matching, we now have the ability to associate records, based off a percentage score of likeliness assigned by a Fuzzy Matching Algorithm in our component.

Let's take a look at the results of our similarity computation in our original records:

Data Matching Percentage

Percent Similarity based off the Levenshtein Algorithm           

We've now successfully established a relationship between these possibly matching records through a percent score. Of course, there isn't really a single given algorithm that will accommodate to all types of data in all types of situations, which is why the Fuzzy Matching Component gives you several algorithms to choose from for similarity computation.

The key to mastering your database is acknowledging the problem, and knowing the solutions. In this case, let's not get fooled by duplicate records in disguise. Don't get overrun, stay one step ahead, and catch them first, before they eventually catch up to you.


What is Record Linkage?

| No Comments | No TrackBacks
By David Loshin

In my last entry, I talked about the fact that many distributed pieces of data about a single individual can be combined together to form a deep profile about that individual. But how are different data records from disparate data sets combined to formulate insightful profiles?

The answer lies in the ability to collect the different pieces of data that belong to a single individual and then glom them together. For example, let's presume the existence of a record in one data set that has a person's address, a record in another data set that has that person's telephone number, a third record that has that person's registration number for a toaster, another with the person's car year, make, and model, etc.

As long as you can find all the records that are associated with each person and connect them together, you could collect all the interesting information together and create a single representative profile. That profile is then suitable for use in list generation, but is also used for more comprehensive analytics such as segmentation, clustering analysis, and classification.

The way these records are connected together is through a process called "record linkage." This process searches through one or more data sets looking for records that refer to the same unique entity based on identifying characteristics that can be used to distinguish one entity from all others, such as names, addresses, or telephone numbers.

When two records are found to share the same pieces of identifying information, you might assume that those records can be linked together. It sounds simple, but unfortunately, there are a number of challenges with linking records across more than one data set, such as:

· The records from the different data sets don't share the same identifying attributes (one might have phone number but the other one does not).

· The values in one data set use a different structure or format than the data in another data set (such as using hyphens for social security numbers in one data set but not in the other).

· The values in one data set are slightly different than the ones in the other data set (such as using nicknames instead of given names).

· One data set has the values broken out into separate data elements while the other does not (such as titles and name suffixes).

Luckily, there are numerous software products that are designed to address these discrepancies, which can simplify the record linkage process. If you recall some of my previous posts, you may begin to see how parsing and standardization start to fit in. These tools will parse and standardize the values prior to attempting to compare for the purposes of linkage, and that alleviates some of the challenges I noted.


How to Perform ETL - Fast and Easy

| No Comments | No TrackBacks
Imagine a more simplified approach to data integration - one that doesn't require the use of different connectors, etc. It's all possible with expressor Software's latest product release - version 3.5 of its desktop ETL platform - which will feature Melissa Data and Salesforce.com integration.


The platform will leverage the power of Melissa Data's WebSmart services for postal address verification, phone verification, email validation and name parsing - from within the expressor data flow application. The Salesforce.com integration will allow users to read and write CRM data to Salesforce.com from their on-premises business systems.

What is expressor 3.5?

It's an easy-to download and install, metadata-driven ETL tool that offers a simplified approach to data integration, ranging from small ETL tasks to complex data integration projects. The tool provides a user interface with a Microsoft Office look and feel, and a drag and drop configuration that lets you process your data with speed and power.

But its biggest lure lies in its connectivity. Unlike other ETL products, expressor's tool does not require the use of different connectors. Instead, the expressor platform contains as few connections as possible - but each is configurable and includes all the parameters needed to get your data in and out of almost any data source.

Learn more about expressor 3.5 and Melissa Data's integration in a special co-hosted webinar, Dec. 15 at 2 pm EST. To register for the event, click here.

Data Quality ROI

| No Comments | No TrackBacks
By Elliot King

Elliot King
Developing metrics to determine the return on investment is both a boon and a bane for IT professionals. A credible return on investment projection is invaluable for guiding the deployment of technology resources. And an after-the-fact calculation of actual ROI is essential for continual improvement. Did you meet your project goals within budget, and
most of all, did you realize the benefits you anticipated? Calculating the ROI of an investment should tell you all that.

But ROI is also the bane of many IT professionals' existence as well. Many of the benefits technology produce are intangible. Assigning a monetary value to those benefits can seem arbitrary, at best, and fictional at worst. Moreover, calculating ROI is hard work and not the kind of work many technical people like to do (and the financial people often just don't seem to understand the challenges of collecting the necessary metrics.)

Fortunately, data quality professionals can build ROI models that are credible and reusable. In the simplest iteration, the process of calculating an expected return on investment consists of five steps. First, select a target application--data quality programs do not necessarily have to be corporate-wide. What data must be used to execute function X? Next, determine the quality of the existing data. Third, determine what would have to be done to raise the quality to a specified level and how much would that cost? Fourth, anticipate what the benefits of having improving the data quality would be. Finally, measure the actual benefits realized.

Direct marketing is one of the most straight-forward areas to determine ROI for data quality. What data do you need to execute your direct marketing campaign? Such as names, addresses, email addresses, etc. Then investigate how accurate your contact database is and what would you have to do to improve it to a desired level? Next, calculate the anticipated benefits of the improved data--how many more orders would you receive and how much would the cost of returns be reduced. After you complete the marketing campaign, analyze if your projections were accurate.

While improved data quality can lead to soft returns such as improved decision-making and better operational efficiency, in many cases tangible metrics are available to determine at least a minimum return on investment.


Authors