Recently in Record Linkage Category

Record Matching Made Easy with MatchUp Web Service

| No Comments | No TrackBacks

MatchUp®, Melissa's solution to identify and eliminate duplicate records, is now available as a web service for batch processes, fulfilling one of most frequent requests from our customers - accurate database matching without maintaining and linking to libraries, or shelling out to the necessary locally-hosted data files.


Now you can integrate MatchUp into any aspect of your network that can communicate with our secure servers using common protocols like XML, JSON, REST or SOAP.

 

Select a predefined matching strategy, map the table input columns necessary to identify matches to the respective request elements, and submit the records for processing. Duplicate rows can be identified by a combination of NAME, ADDRESS, COMPANY, PHONE and/or EMAIL.

 

Our select list of matching strategies removes the complexity of configuring rules, while still applying our fast and versatile fuzzy matching algorithms and extensive datatype-specific knowledge base, ensuring the tough-to-identify duplicates will be flagged by MatchUp. 


The output response returned by the service can be used to update a database or create a unique marketing list by evaluating each record's result codes, group identifier and group count, and using the record's unique identifier to link back the original database record.

 

Since Melissa's servers do the processing, there are no key files - the temporary sorting files - to manage, freeing up valuable hardware resources on your local server.

 

Customers can access the MatchUp Web Service license by obtaining a valid license from our sales team and selecting the endpoint compatible to your development platform and necessary request structures here.

A 6-Minute MatchUp for SQL Server Tutorial

| No Comments | No TrackBacks

In this short demo, learn how to eliminate duplicates and merge multiple records into a single, accurate view of your customer - also known as the Golden Record - through a process known as survivorship using Melissa Data's advanced matching tool, MatchUp for SQL Server.

Watch our video to learn more!


Centricity and Connections: Clearing the Air

| No Comments | No TrackBacks
By David Loshin

There are opportunities for adjusting your strategy for customer centricity based on understanding the grouping relationships that bind individuals together (either tightly or loosely). And in the last post, we looked at some examples in which linking customer records into groups was straightforward when the values to be compared and weighted for similarity are exact matches. When the values are not exact, it introduces some level of doubt into the decision process for including a record into a group.

Let's revisit our example from my last post by adding in a new record for evaluation:

John Hansen, 1824 Polk Ave., Memphis TN 38177
Emily S. Hansen, 1824 Polk Ave., Memphis, TN 38177
Emily Stoddard, 1824 Polk Avenue, Memphis, TN
We had already decided that John and Emily shared a household, but all of a sudden we have a third record with a name that shares some similarity, with one of the existing names, and an almost exact street address match (note that the third record is missing a ZIP code).

We could speculate that "Emily Stoddard" changed her name after she got married to "John Hansen," or that she changed an address somewhere as she moved form her bachelorette pad to their newlywed home. But without exact knowledge of the facts, it is only speculation, and one must exercise some care when relying on speculation for business decisions.

If a few small differences pose a challenge to linkage, what would you think of dozens, or even hundreds of variations for names, locations, or other data values?

Just as a case in point: in a hallway conversation at the recent Data Governance Conference, a colleague mentioned that one of his customers' databases had over one hundred variations for a certain big-box retailer's name! The conclusion you can draw from this is that a key part of the record linkage process involves some traditional data quality tactics, namely appending a standardized version of the data to help your linkage algorithms score record similarity as a prelude to establishing connectivity.

Customer Centricity and Connections: Establishing the Link

| No Comments | No TrackBacks
By David Loshin

In my last post, we began to look at the value proposition for grouping individual customers into logical groupings. We began by looking at a grouping that generally appears naturally, namely the traditional residential household.

We talked about householding in a previous blog posting, but it is worth reviewing the basic approaches used for determining that a group of individuals share a household. The general approach is to analyze a collection of data records and examine sets of identifying attributes for degrees of similarity in naming and residence locations. Many situations are relatively straightforward, such as this example:

John Hansen, 1824 Polk Ave., Memphis TN 38177
Emily S. Hansen, 1824 Polk Ave., Memphis, TN 38177

In this example, two individuals share both a last name and a location address, and although the data evidence does not guarantee truth of the inference, it might be reasonable to suggest that because there is a link between the family name and the residence location, these two individuals are members of the same household. The algorithm, then, is to link records into a collection of similar records based on similarity of the surname and residence characteristics.

However, the concept of grouping is not limited to conventional groups, since there are many artificial groups formed as a result of shared interests or similarities in profile criteria. For example, people interested in certain sports car models often organize "fan clubs," new mothers often organize toddler play groups, and sports team fans are often rabid about their franchise alliances.

In turn, your company might want to create marketing campaigns that target sets of individuals grouped together by demographic or psychographic attributes. In these cases, you would adjust your algorithms to link records based on similarity of the values in other sets of data attributes.

Establishing the link goes beyond looking at the data that already exists in your data set. Rather, you may need to append additional data acquired from alternate sources.

And, interestingly enough, you will need to connect the acquired data to your existing data, and that requires yet another record linkage effort. Apparently, understanding customer collectives is pretty dependent on record linkage. And while linking records is straightforward when all the data values line up nicely, as you might suspect, there are some curious intricacies of linkage in the presence of data with questionable quality.


Content Standards for Data Matching and Record Linkage

| No Comments | No TrackBacks
By David Loshin

As I suggested in my last post, applying parsing and standardization to normalize data value structure will reduce complexity for exact matching. But what happens if there are errors in the values themselves?

Fortunately, the same methods of parsing and standardization can be used for the content itself. This can address the types of issues I noted in the first post of this series, in which someone entering data about me would have used a nickname such as "Dave" instead of "David."

By introducing a set of rules for pattern recognition, we can organize a number of transformations from an unacceptable value into one that is more acceptable or standardized. Mapping abbreviations and acronyms to fully spelled out words, eliminating punctuation, even reordering letters in words to attempt to correct misspellings - all of these can be accomplished by parsing the values, looking for patterns that the value matches, and then applying a transformation or standardization rule.

In essence, we can create a two-phased standardization process that first attempts to correct the content and then attempts to normalize the structure. Applying these same rules to all data sets results in a standard representation of all the records, which reduces the effort in trying to perform the exact matching.

Yet this process may still allow variance to remain, and for that we have some other algorithms that I will touch upon in upcoming posts.


By David Loshin

In my last few posts, I discussed how structural differences impact the ability to search and match records across different data sets. Fortunately, most data quality tool suites use integrated parsing and standardization algorithms to map structures together.

As long as there is some standard representation, we should be able to come up with a set of rules that can help to rearrange the words in a data value to match that standard.

As an example, we can look at person names (for simplicity, let's focus on name formats common to the United States). The general convention is that people have three names - a first name, a middle name, and a surname. Yet even limiting our scope to just these components (that is, we are ignoring titles, generationals, and other prefixes and suffixes), there is a wide range of variance for representing the name. Here are some examples, using my own name:

• Howard David Loshin
• Howard D Loshin
• Howard D. Loshin
• David Loshin
• Howard Loshin
• H David Loshin
• H. David Loshin
• H D Loshin
• H. D. Loshin
• Loshin, Howard D
• Loshin, Howard D.
• Loshin, H David
• Loshin, H. David
• Loshin, H D
• Loshin, H. D.

There are different versions depending on whether you use abbreviations or full names, punctuation, and the order of the terms. A good parsing engine can be configured with the different patterns and will be able to identify each piece of a name string.

The next piece is standardization: taking the pieces and rearranging them into a desired order. The example might be taking a string of the form "last_name, first_name, initial" and transforming that into the form "first_name, initial, last_name" as a standardized or normalized representation. Using a normalized representation will simplify the comparison process for data matching and record linkage.


By David Loshin

One of the most frequently-performed activities associated with customer data is searching - given a customer's name (and perhaps some other information), looking that customer's records up in databases. And this leads to an enduring challenge for data quality management, which supports finding the right data through record matching, especially when you don't have all the data values, or if the values are incorrect.

When applications allow free-formed text to be inserted into data elements with ill-defined semantics, there is the risk that the values stored may not completely observe the expected data quality rules.

As an example, many customer service representatives may expect that if a customer calls the company, there will be a record in the customer database for that customer. If for some reason, though, the customer's name is not entered exactly the same way as presented during a lookup, there is a chance that the record won't be found. This happens a lot with me, since I go by my middle name, "David," and often people will shorten that to "Dave" when entering data, so when I give my name as "David" the search fails when there is no exact match.

The same scenario takes place when the customer herself does not recall the data used to create the electronic persona - in fact, how many times have you created a new online account when you couldn't remember your user id? Also, it is important to recognize that although we think in terms of interactive lookups of individual data, a huge amount of record matching is performed as bulk operations, such as mail merges, merging data during corporate acquisitions, eligibility validation, claims processing, and many other examples.

It is relatively easy to find a record when you have all the right data. As long as the values used for search criteria are available and exactly match the ones used in the database, the application will find the record. The big differentiator, though, is the ability to find those records even when some of the values are missing, or vary somewhat from the system of record. In the next few postings we'll dive a bit deeper into the types of variations and then some approaches used to address those variations.

Modeling Issues and Entity Inheritance

| No Comments | No TrackBacks
By David Loshin

In our last set of posts, we looked at matching and record linkage and how approximate matching could be used to improve the organization's view of "customer centricity." Data quality tools such as parsing, standardization, and business-rule based record linkage and similarity scoring can help in assessing the similarity between two records. The result of the similarity analysis is a score that can be used to advise about the likelihood of two records referring to the same real-life individual or organization.

One last thought: this approach is largely a "data-centric" activity. What I mean is that it looks at and compares two records regardless of where those records came from. They might have come from the same data set (as part of a duplicate analysis) or from different data sets (for consolidation or general linkage).

But it does not take into consideration whether one data set models "customer" data and another models "employee" data. While you may link a customer record with an employee record based on a similarity analysis of a set of corresponding data attributes, the contexts are slightly different.

A match across the two data sets is a bit of a hybrid: we have matched the individual but one playing different roles. That introduces a different kind of question: are the identifying attributes associated with the "customer" or the individual acting in the role of "customer"? The same question applies for individual vs. employee.

And finally, are there attributes of the roles that each individual plays that can be used for unique identification within the role context? The answers to these questions become important when matching and linkage are integrated as part and parcel of a business application (such as the consolidation of data being imported into a business intelligence framework).


Approximate Matching

| No Comments | No TrackBacks
By David Loshin

Actually, my first name is not David - that is really my middle name, but it is the given name my parents used when talking to me. This has actually led to a lot of confusion over the years, especially when confronted with a form asking for me "first name" and my "last name." For official forms (like my driver's license) I use my real first name as my "first name," but for non-official forms I often just use David. The result is that there is inconsistency in my own representation in records across different data systems.

If we were to rely solely on an exact data element-to-data element match of values to determine record duplication, the variation in use of my first or middle name would prevent two records from linking. In turn, you can extrapolate and see that any variations across systems of what should be the same values will prevent an exact match, leading to inadvertent duplication.

Fortunately, we can again rely on data quality techniques. We have our stand-bys of parsing and standardization, which can be enhanced through the use of transformation rules to map abbreviations, acronyms, and common misspellings to their standard representations - an example might be mapping "INC" and "INC." and "Inc" and "inc" and "inc." and "incorp" and "incorp." and "incorporated" all to a standard form of "Inc."

We can add to this another tool: approximate matching. This matching technique allows for two values to be compared with a numeric score that indicates the degree to which the values are similar. An example might compare my last name "Loshin" with the word "lotion" and suggest that while the two values are not strict alphabetic matches, they do match phonetically.

There are a number of techniques used for approximate matching of values, such as comparing the set of characters, the number of transposed, inserted, or omitted letters, different kinds of forward and backward phonetic scoring, as well as other more complex algorithms.

In turn, we can apply this approximate matching to the entire set of corresponding identifying attributes and weight each score based on the differentiation factor associated with each attribute. For example, a combination of first name and last name might provide greater differentiation than a birth date, since there is a relatively limited number of dates on which an individual can be born (maximum 366 per year).

By applying a weighted approximate match to pairs of records, we can finesse the occurrence of variations in the data element values that might prevent direct matching from working. More on this topic in future posts.


The Challenge of Identifying Information

| No Comments | No TrackBacks
By David Loshin

In my last post, I introduced the question of determining which characteristics are used to uniquely differentiate between any pair of records within a data set. The same question is relevant when attempting to match a pair of records as well, once they are determined to represent the same entity. I like to call these "identifying attributes," and the values contained therein I call "identifying information."

Let's look at an example for customer data integration: what data element values do I compare when trying to link two records together? Let's start with the obvious ones, namely (ha ha) first and last names. Of course, we all know that there are certain names that are relatively common - just ask my friend John Smith, with whom I worked at one of my earlier jobs.

But even if you have an uncommon name, you might be surprised. For example, if you type in my name ("David Loshin") at Google, you will find entries for me, but you will also find entries for a dentist in Seattle and a professor.

Apparently, first and last names are not enough identifying information for distinction. Perhaps there is another attribute we can use? You probably know that I have written some books, (see http:\\dataqualitybook.com), so maybe that is an additional attribute to be used. But if you go to Amazon and do a search for "David Loshin," you will find me, but it turns out the professor has also written a book.

Even an uncommon name such as mine still finds multiple hits, and while attempting to add more identifying information can reduce that number of hits, a poorly selected set of attributes may still not provide the right amount of distinction. It may take a number of iterations to review a proposed set of identifying attributes, determine their completeness, density, and accuracy before settling on a core set of identifying characteristics to be used for comparison.

One more thing to think about, though. Once you get to the point where you are pretty confident that those attributes are enough for differentiation, there is one last monkey wrench in the works: even if you had the absolute set of identifying attributes, there is no guarantee that the values themselves are exact matches!

Categories