Recently in Data Enhancement Category

A Guide to Better Survivorship - A Melissa Data Approach

| No Comments | No TrackBacks
By Joseph Vertido

The importance of survivorship - or as others may refer to as the Golden Record - is quite often overlooked. It is the final step in the record matching and consolidation process which ultimately allows us to create a single accurate and complete version of a record. In this article, we will take a look at how Melissa Data uniquely differentiates itself in approaching the concept of survivorship compared to some of the more conventional practices.

The process of selecting surviving records means selecting the best possible candidate as its representation. However, best in the perspective of survivorship can really mean a lot of things. It can be affected by the structure of data, where the data is gathered from, how data comes in, what kind of data is stored, and sometimes by the nature of business rules. Thus techniques can be applied in order to accommodate certain types of variations when performing survivorship. We find that there are three very commonly used techniques in determining the surviving record:

I. Most Recent

Date stamped records can be ordered from most recent to less recent. The most recent record can be considered eligible as the survivor.

II. Most Frequent

Matching records containing the same information are also an indication for correctness. Repeating records indicate that the information is persistent and therefore reliable.

III. Most Complete

Field completeness is also a factor of consideration. Records with more values populated for each available field are also viable candidates for survivorship.


Although these techniques are commonly applied in survivorship schemas, its correctness may not be as reliable in many circumstances. Because these techniques apply to almost any type of data, the basis in which a surviving record is created conforms only to "generic" rules. This is where Melissa Data is able to set itself apart from "generic" survivorship. By leveraging reference data, we can steer a way to generating better and more effective schemas for survivorship.

The incorporation of reference data in survivorship changes how rules come into play. Using the Most Recent, Most Frequent or Most Complete logic really has more of an aesthetic basis for selection. Ideally, the selection of the surviving record should be based off an actual understanding of our data.

And this is where reference data comes into play. What it boils down to at the very end is simply being able to consolidate the best quality data. Thus by incorporating reference data, we gain an understanding of the actual contents of data, and create better decisions for survivorship. Let's take a look at some instances on how reference data and data quality affect decisions for survivorship.

I. Address Quality

Separating good data from bad data should take precedence in making decisions for survivorship.

Address Quality Sample

In the case of addresses, giving priority to good addresses makes for a better decision in the survivorship schema.

II. Record Quality

It could also be argued that good data may exist in a single group of matching records. In cases like these, we can assess the overall quality of data by taking into consideration other pieces of information that affect the weight of overall data quality. Take for example the following data:

Record Quality Sample

In this case, the ideal approach is to evaluate multiple elements for each record in the group. Since the second record contains a valid phone number, it can be given more weight or more importance than the third record despite it being more complete.

Whether we're working with contact data, product data or any other form of data, in summary, the methodologies and logic used for record survivorship become dependent primarily on data quality. And however we choose to define data quality, it is imperative that we keep only the best pieces of data if we are to have the most accurate and correct information. In the case of Contact Data however, Melissa Data changes the perspective as to how the quality of data is defined, therefore breaking the norm of typical survivorship schemas.


Structural Differences and Data Matching

| No Comments | No TrackBacks
By David Loshin

Data matching is easy when the values are exact, but there are different types of variation that complicate matters. Let's start at the foundation: structural differences in the ways that two data sets represent the same concepts. For example, early application systems used data files that were relatively "wide," capturing a lot of information in each record, but with a lot of duplication.

More modern systems use a relational structure that segregates unique attributes associated with each data concept - attributes about an individual are stored in one data table, and those records are linked to other tables containing telephone numbers, street addresses, and other contact data.

Transaction records refer back to the individual records, which reduces the duplication in the transaction log tables.

The differences are largely in the representation - the older system might have a field for a name, a field for an address, perhaps a field for a telephone number, and the newer system might break up the name field into a first name, middle name, and last name, the address into fields for street, city, state, and ZIP code, and a telephone number into fields for area code and exchange/line number.

These structural differences become a barrier when performing records searches and matching. The record structures are incompatible: different number of fields, different field names, and different precision in what is stored.

This is the first opportunity to consider standardization: if structural differences affect the ability to compare a record in one data set to records in another data set, then applying some standards to normalize the data across the data sets will remove that barrier. More on structural standardization in my next post.

By David Loshin

One of the most frequently-performed activities associated with customer data is searching - given a customer's name (and perhaps some other information), looking that customer's records up in databases. And this leads to an enduring challenge for data quality management, which supports finding the right data through record matching, especially when you don't have all the data values, or if the values are incorrect.

When applications allow free-formed text to be inserted into data elements with ill-defined semantics, there is the risk that the values stored may not completely observe the expected data quality rules.

As an example, many customer service representatives may expect that if a customer calls the company, there will be a record in the customer database for that customer. If for some reason, though, the customer's name is not entered exactly the same way as presented during a lookup, there is a chance that the record won't be found. This happens a lot with me, since I go by my middle name, "David," and often people will shorten that to "Dave" when entering data, so when I give my name as "David" the search fails when there is no exact match.

The same scenario takes place when the customer herself does not recall the data used to create the electronic persona - in fact, how many times have you created a new online account when you couldn't remember your user id? Also, it is important to recognize that although we think in terms of interactive lookups of individual data, a huge amount of record matching is performed as bulk operations, such as mail merges, merging data during corporate acquisitions, eligibility validation, claims processing, and many other examples.

It is relatively easy to find a record when you have all the right data. As long as the values used for search criteria are available and exactly match the ones used in the database, the application will find the record. The big differentiator, though, is the ability to find those records even when some of the values are missing, or vary somewhat from the system of record. In the next few postings we'll dive a bit deeper into the types of variations and then some approaches used to address those variations.

By Joseph Vertido

For many, the concepts of data integration and data quality are separate and have no commonality. But in reality, when you combine them - they create a partnership that excels. Where data quality leaves off, data integration begins, and vice versa. A new product - Contact Zone - fuses these two concepts together into one revolutionary solution for where data integration and data quality converge.

Data integration tools simplify data migration and data warehousing procedures - both of which are concerned with the issue of data management, i.e. keeping data organized. Data quality, on the other hand, is concerned primarily with an understanding of the nature, and validity of the contents of the actual data, i.e. keeping data clean. Maintaining an organized database is not the same as keeping it clean - they are two different approaches to handling data - but they can be combined, or should they?

The short answer is yes.

In essence, data integration allows for the migration of data from a given source to a given destination. Typically, users take advantage of data integration to accomplish data warehousing initiatives - allowing for easy migration and manipulation of data, which ultimately leads to maximizing the efficiency of business intelligence and analytics.

However, Gartner states that "only 30 percent of business intelligence and data warehousing implementations fully succeed." Why? The top two reasons for failure are budget constraints and data quality. So, although the architectural constraints of building a data warehouse can be addressed by utilizing data integration tools, it still leaves the problem of poor data quality - something that most data integration tools handle with mediocrity at best.

That's where Contact Zone comes into play. It's a data integration tool optimized for data quality, allowing you to shoot two birds with one stone.

Contact Zone connects to virtually any source, overcoming an obstacle our clients frequently encounter when implementing data quality, namely there is such a variety of database format and platforms today that the types of environments and combinations can be overwhelming.

Whether you have an IBM DB2 database or PostgreSQL, leveraging Contact Zone allows for data integration for almost any form of database format, while making sure that all data is clean, correct, standardized, and valid.

By David Loshin

What I have found to be the most interesting byproduct of record linkage is the ability to infer explicit facts about individuals that are obfuscated as a result of distribution of data. As an example, consider these records, taken from different data sets:

A:
David
Loshin
301-754-6350
1163 Kersey Rd
Silver Spring
MD
20902

B:
Knowledge Integrity, Inc
1163 Kersey Rd
Silver Spring
MD
20902

C:
H David
Lotion
1163 Kersey Rd
Silver Spring
MD
20902

D:
Knowledge Integrity, Inc.
301
7546350
7546351
MD
20902

We could establish a relationship between record A and records B and C because they share the same street address. We could establish a relationship between record B and record D because the company names are the same.

Therefore, by transitivity, we can infer a relationship between "David Loshin" and the company "Knowledge Integrity, Inc" (A links to B, B links to D, therefore A links to D). However, none of these records alone explicitly shows the relationship between "David Loshin" and "Knowledge Integrity, Inc" - that is inferred knowledge.

You can probably see the opportunity here - basically, by merging a number of data sets together, you can enrich all the records as a byproduct of exposed transitive relationships.

This provides us with one more valuable type of enhancements that record linkage provides. And this is particularly valuable, since the exposure of embedded knowledge can in turn contribute to our other enhancement techniques for cleansing, enrichment, and merge/purge.

Record Linkage and Data Enhancement

| No Comments | No TrackBacks
By David Loshin

In my last two posts we looked at the distribution of information about entities and the use of record linkage to find corresponding data records in different data sets that can be linked together. Record linkage can be used for a number of processes that we bundle under the concept of "data enhancement," which we'll use to describe any methods for
improving the value and usefulness of information. In this post, we'll look at three different types of enhancement:

· Data cleansing - The first type of enhancement is relatively straightforward: our idea is to link records together for the purposes of cleansing the data, or making it more suitable for use. Often, one data set may have a more trustworthy representation of an entity, or we may have more than one data set, each potentially containing overlapping data elements such as birth date, address, telephone number. By linking two different records, you can compare the corresponding values, find those that are of better quality (e.g. more complete or more current values) and update the "delinquent" record with the higher quality values.

· Enrichment - Existing records for entities (such as people or products) can be matched against other data sets with additional reference information. For example, you might want to match your customer data with a credit bureau's data and enrich your own data set with each individual's credit ratings.

· Merge/Purge - Duplicate records entered into one data set often plague the business in attempting to actively manage customer accounts. Applying the record linkage methodology to the records in a single data set helps find multiple records that refer to the same individual. These records can be presented to a data analyst to review and determine the surviving record and updating the record with the highest quality values.
There are many variations on these themes. For example, merge/purge can be used for combining customer data sets after a corporate acquisition; enrichment can be used to institute a taxonomic hierarchy for customer classification and segmentation. Loosening the matching rules for merge/purge can help with a process called "householding," which attempts to identify individuals with some shared characteristics (such as "living in the same house").

Review: Contact Verification Component

| No Comments | No TrackBacks
By Brian Erlich
SQL Server Magazine

Melissa Data provides mailing lists, mailing and data management software, and other products that can help companies improve the quality of their data. It now offers the Data Quality Components for SSIS, a suite of components that integrate into SQL Server Integration Services (SSIS). I recently had a chance to try the company's Contact Verification Component, which you use with custom data (also provided by Melissa Data). With the component, you can validate and correct contact information, parse and reformat the data in intelligent ways, and enhance the information by providing additional data and removing questionable data.

Melissa Data recommends installing the component on a 64-bit Windows machine (Windows XP SP2 or later) running the 64-bit version of SQL Server 2008 R2 and Microsoft .NET Framework 3.5. Installation is pretty easy. It took me about 10 minutes on my Windows 7 machine, which has the 64-bit version of SQL Server 2008 R2 Enterprise Edition installed.

The first step is to install Setup.exe, which starts the installation wizard. The wizard installs the task and the data files that it relies on. Once installed, you must add the Data Flow task to the Toolbox in SSIS. If you have never done this before, don't worry--it's not hard and you only need to do it once. To add the task, open an SSIS project in Business Intelligence Development Studio (BIDS) and right-click anywhere in the Toolbox. From the drop-down menu, select Choose Items and a dialog box will open with multiple tabs along the top. On the SSIS Data Flow Items tab, select the Melissa Data Contact Verification check box. To finish the process, click OK. The Melissa Data Contact Verification task is now available for use.

To test the new task, I built a sample package using an OLE DB source. The first thing I tested was name parsing. On the Name tab, I selected FullName from the drop-down list in the Input Name section to indicate the field (i.e., database column) containing the names to parse. I then specified the names of the fields in which to place the output. The task not only parsed the full names into the five fields (prefix, first name, middle name, last name, and suffix) but also enhanced the data by adding two new fields containing the person's gender and a salutation. You can even separate out a second name from the input name. An example of this would be turning the name John and Jane Doe into John Doe and Jane Doe.

Address verification works in a similar way to name parsing. On the Address tab, which Figure 1 shows, I selected the address1, address2, city, state, and zip fields in my database as the input and assigned field names for each possible output. The task did the rest. It parsed the suite and apartment numbers into their own fields, added the nine-digit ZIP code, added latitude and longitude coordinates, and even told me which addresses were private post-office boxes, like those rented from The UPS Store.

Data Quality Components for SSIS
Figure 1: Verifying addresses in the Contact Verification Component

The Contact Verification task also verifies and enriches email addresses and phone numbers. A phone number can yield a lot of information, which you can use to enhance your contact information. For example, you can use the task to add the type of phone number (landline, cellular, or Voice over IP--VoIP), country, time zone, and a lot more.

When addresses and phone numbers are verified, the component provides result codes, which add an additional layer of information to each line of data. You can use these codes to define data as valid or invalid based on your business rules. For example, you could use the result codes to remove addresses that contain vulgarities or questionable names such as Donald Duck.

If you need to validate and enhance your company's contact information, the Contact Verification component makes it almost effortless. The component is easy to install. More important, it's easy to use, thanks to the integration with SSIS.

Contact Verification Component PROS:
Easy to install and use; complete control over the output fields; identified questionable addresses and names; added useful data to the contact database.

CONS: Tendency to hold on to some of the metadata during testing but was easily resolved by removing and re-adding the task to the design plan.

RATING: 5 out of 5

PRICE: Licenses based on the number of records; can purchase the Contact Verification Component separately or as part of the Data Quality Components for SSIS suite.

RECOMMENDATION: I would recommend this component to anyone who needs to verify or enhance contact information. Its integration with SSIS makes it easy to use.

CONTACT: Melissa Data • 800-635-4772 or 949-858-3000 • www.melissadata.com 

FREE TRIAL: www.melissadata.com/ssis10


Integrating Analytical Results with Operational Activities

| No Comments | No TrackBacks
By David Loshin

We have looked at using enhancement for operational purposes, as well as analytical purposes, but there are ways that we can merge the two into a hybrid: using enhanced data for analytics, whose results are incorporated into operational activities using the same types of enhancements.

Actually I was a bit sneaky in my last few blog entries, because I already started to plant some ideas in advance of this post. I mentioned the use of enhancement for person names, perhaps in a customer support capacity, in order to verify identity.

At the same time, I referred to analytics that use customer data enhanced data for customer profiling and geo/demo/psychographic segmentation. These two processes can be combined to provide even more effective recommendations to improve customer support or even drive additional sales.

Consider this scenario: call origination data (such as telephone number) is provided when callers reach out to an inbound call center. Location data associated with the originating telephone number is used as a key to look up geographic/demographic data, which is used to enhance the inbound caller's record with segmentation data nominally associated with the individual.

As the call center representative walks through specific scripts provided to help the caller, the enhanced profile information is used to adjust offers in real time based on previously calculated statistics.

As a more direct example, complaints about dropped mobile calls might lead into a script to recommend upgrading equipment. Based on the caller's enhanced location and geographic profile, historical measures of accepted offers coupled with connectivity statistics in the given area can be mined.

The results can then be fed into the call center application, which provides specific suggestions for the call center rep to offer a particular type of phone that is best suited for maintaining connections in the customer's location, at a promotional price that the customer is likely to accept.

With a little bit of thought, we can come up with many types of these hybrid scenarios - ones where data enhancement is used for both analytical and operational purposes. In the upcoming months I anticipate looking at a number of different aspects of data enhancement, data standards, and data cleansing, all with a focus towards improved business functions.

Data Enhancement for Analytical Purposes

| No Comments | No TrackBacks
By David Loshin

Last time we looked at two example operational uses of data enhancement. But the value is not limited to insertion in specific operational workflows, because enhancement is often done to provide additional detail for reporting and analysis purposes.

And in these cases, enhancement goes beyond data standardization and correction; instead, the enhancement process can add more information by linking one data set to another. The appended data can augment an analytical process to include extra information in generated report and interactive visualizations.

As an example, recall that in a previous post, I talked about collecting ZIP code values at a point of sale. A retail company can take sales data that includes this geographic data element and then enhance the data with demographic profiles provided by the US Census Bureau to look for correlation between purchasing patterns and documented demographics about the specific locations (including sex, age, race, Hispanic or Latino origin, household relationship, household type, group quarters population, housing occupancy, and housing tenure).

Geographic data enhancement also adds value for analysis. Given a pair of addresses, an enhancement process can evaluate different types of distances (direct distance and driving distance are two examples) between those two points. This can be useful in a number of analytical applications, such as site location planning, which compares properties based on a variety of criteria (possibly including the median driving distance for local customers for a bank branch, or average driving time for delivering pizza to frequent customers).

There are many data aggregators who can supply demographic and behavior data that can enhance your customer data sets. And you can use your own company's data for enhancement as well, such as your own product sales by region used to develop your own customer segmentation data.

Standardizing names and addresses is the first step, and linking those records to the reference data collections allows direct linkage based on specific criteria, ranging from gross-level linkage (say, at the county level) down to specific enhancement at the individual level (such as the names of the magazine to which a customer subscribes).

These qualitative enhancements augment the business intelligence and analytics processes to help companies make more sales, increase revenues, and improve profitability.

Data Enhancement for Operational Purposes

| No Comments | No TrackBacks
By David Loshin

In my last post, I introduced the concept of data enhancement as a collection of methods for adding information to a data set to increase its utility, and suggested that there are a number of scenarios where enhancement adds business value. In this post, we'll look at two examples that show ways that data enhancement can be incorporated into
operational scenarios.

We can start with a very common use of enhancement: postal standardization and address correction. A delivery address describes a specific location to which an item can be delivered. In the United States, it is usually composed of a street name, a street number, a city name, a state identifier, and a postal code. When executing a sales transaction, you would probably want to make sure that you have a valid delivery address to ensure that the purchased products can be sent to the customer. So although there is a wide variety of ways that people could assemble a delivery address, the address data can be submitted to an address validation and correction enhancement process to ensure proper delivery.

Another common example involves individual's names, which can appear in data records in different ways: first name followed by last name, last name with a comma, followed by first name, with or without titles such as "Mr." or "Professor," different generational suffixes. In a recent conversation with some colleagues at the US Census Bureau, they shared with me that they have over 1000 different patterns for ways that names can appear in data.

Again, a data standardization and enhancement process can parse out the key components of a person name, fill in the blanks (if necessary) through lookups in master data tables, and reorganize those components into a format so that a customer's identity can be established for verification purposes when providing customer service at an inbound call center.