Recently in Fuzzy Matching Category

Record Matching Made Easy with MatchUp Web Service

| No Comments | No TrackBacks

MatchUp®, Melissa's solution to identify and eliminate duplicate records, is now available as a web service for batch processes, fulfilling one of most frequent requests from our customers - accurate database matching without maintaining and linking to libraries, or shelling out to the necessary locally-hosted data files.

Now you can integrate MatchUp into any aspect of your network that can communicate with our secure servers using common protocols like XML, JSON, REST or SOAP.


Select a predefined matching strategy, map the table input columns necessary to identify matches to the respective request elements, and submit the records for processing. Duplicate rows can be identified by a combination of NAME, ADDRESS, COMPANY, PHONE and/or EMAIL.


Our select list of matching strategies removes the complexity of configuring rules, while still applying our fast and versatile fuzzy matching algorithms and extensive datatype-specific knowledge base, ensuring the tough-to-identify duplicates will be flagged by MatchUp. 

The output response returned by the service can be used to update a database or create a unique marketing list by evaluating each record's result codes, group identifier and group count, and using the record's unique identifier to link back the original database record.


Since Melissa's servers do the processing, there are no key files - the temporary sorting files - to manage, freeing up valuable hardware resources on your local server.


Customers can access the MatchUp Web Service license by obtaining a valid license from our sales team and selecting the endpoint compatible to your development platform and necessary request structures here.

A 6-Minute MatchUp for SQL Server Tutorial

| No Comments | No TrackBacks

In this short demo, learn how to eliminate duplicates and merge multiple records into a single, accurate view of your customer - also known as the Golden Record - through a process known as survivorship using Melissa Data's advanced matching tool, MatchUp for SQL Server.

Watch our video to learn more!

Structural Differences and Data Matching

| No Comments | No TrackBacks
By David Loshin

Data matching is easy when the values are exact, but there are different types of variation that complicate matters. Let's start at the foundation: structural differences in the ways that two data sets represent the same concepts. For example, early application systems used data files that were relatively "wide," capturing a lot of information in each record, but with a lot of duplication.

More modern systems use a relational structure that segregates unique attributes associated with each data concept - attributes about an individual are stored in one data table, and those records are linked to other tables containing telephone numbers, street addresses, and other contact data.

Transaction records refer back to the individual records, which reduces the duplication in the transaction log tables.

The differences are largely in the representation - the older system might have a field for a name, a field for an address, perhaps a field for a telephone number, and the newer system might break up the name field into a first name, middle name, and last name, the address into fields for street, city, state, and ZIP code, and a telephone number into fields for area code and exchange/line number.

These structural differences become a barrier when performing records searches and matching. The record structures are incompatible: different number of fields, different field names, and different precision in what is stored.

This is the first opportunity to consider standardization: if structural differences affect the ability to compare a record in one data set to records in another data set, then applying some standards to normalize the data across the data sets will remove that barrier. More on structural standardization in my next post.

Modeling Issues and Entity Inheritance

| No Comments | No TrackBacks
By David Loshin

In our last set of posts, we looked at matching and record linkage and how approximate matching could be used to improve the organization's view of "customer centricity." Data quality tools such as parsing, standardization, and business-rule based record linkage and similarity scoring can help in assessing the similarity between two records. The result of the similarity analysis is a score that can be used to advise about the likelihood of two records referring to the same real-life individual or organization.

One last thought: this approach is largely a "data-centric" activity. What I mean is that it looks at and compares two records regardless of where those records came from. They might have come from the same data set (as part of a duplicate analysis) or from different data sets (for consolidation or general linkage).

But it does not take into consideration whether one data set models "customer" data and another models "employee" data. While you may link a customer record with an employee record based on a similarity analysis of a set of corresponding data attributes, the contexts are slightly different.

A match across the two data sets is a bit of a hybrid: we have matched the individual but one playing different roles. That introduces a different kind of question: are the identifying attributes associated with the "customer" or the individual acting in the role of "customer"? The same question applies for individual vs. employee.

And finally, are there attributes of the roles that each individual plays that can be used for unique identification within the role context? The answers to these questions become important when matching and linkage are integrated as part and parcel of a business application (such as the consolidation of data being imported into a business intelligence framework).

Entities and their Characteristics

| No Comments | No TrackBacks
By David Loshin

How can you tell if two records refer to the same person (or company, or other type of organization)? In our recent posts, we have looked at how data quality techniques such as parsing and standardization help in normalizing the data values within different records so that the records can be compared. But what is being compared? That is the topic of this next set of entries.

A simplistic view might suggest that when looking at two records, comparing the corresponding values is the best way to start. For example, we might compare the corresponding names, telephone numbers, street addresses - stuff that usually appears in records representing customers, residences, patients, etc.

But the simple concept belies a much more complex question about the attributes used to describe the individual as well as differentiate pairs of individuals. Much of this issue revolves around the approaches taken for determining what characteristics are being managed within a representative record, the motivation for including those characteristics, and importantly, are those data elements used solely as "attribution" (or additional description of the entity involved) or are they used for "distinction" (to help in unique identification).

More to the point: what are the core data elements necessary for determining the uniqueness of a record? We often take for granted the fact that our relational models presume one and only one record per entity, and that there might be business impacts should more than one entry exist for each individual.

Yet individual "entities" may exist in multiple data sets, even in different contexts. Some characteristics are part and parcel of each entity, while others describe the entity playing a particular role. Our upcoming posts are intended to consider some of these issues when assessing similarity for record linkage and matching.

By David Loshin

What I have found to be the most interesting byproduct of record linkage is the ability to infer explicit facts about individuals that are obfuscated as a result of distribution of data. As an example, consider these records, taken from different data sets:

1163 Kersey Rd
Silver Spring

Knowledge Integrity, Inc
1163 Kersey Rd
Silver Spring

H David
1163 Kersey Rd
Silver Spring

Knowledge Integrity, Inc.

We could establish a relationship between record A and records B and C because they share the same street address. We could establish a relationship between record B and record D because the company names are the same.

Therefore, by transitivity, we can infer a relationship between "David Loshin" and the company "Knowledge Integrity, Inc" (A links to B, B links to D, therefore A links to D). However, none of these records alone explicitly shows the relationship between "David Loshin" and "Knowledge Integrity, Inc" - that is inferred knowledge.

You can probably see the opportunity here - basically, by merging a number of data sets together, you can enrich all the records as a byproduct of exposed transitive relationships.

This provides us with one more valuable type of enhancements that record linkage provides. And this is particularly valuable, since the exposure of embedded knowledge can in turn contribute to our other enhancement techniques for cleansing, enrichment, and merge/purge.

Are You A Dupe Detective?

| No Comments | No TrackBacks
By Joseph Vertido

The process of finding approximate matching records in your data to get rid of duplicates is precisely that - fuzzy. It raises as many questions as answers. Am I using a good matching algorithm? Am I matching on the right fields? Is it a true match or a false one?

The problem begins when inconsistent data enters from multiple sources. The meticulous process of finding these similar records and comparing to see if they are actually the same is a daunting challenge. But with the release of the Melissa Data Fuzzy Matching Component for SQL Server Integration Services (SSIS), you

now have a tool that will make this all elementary. With this component, you become a Sherlock Holmes - easily cracking the case of the data doppelgangers.

Finding the Culprits

Matching duplicate records are identified through a percent score. Compared records will be given a match score ranging from 0% (non-matching) to 100% (exact match). So what about records that score in between?

By leveraging the ETL capabilities of SSIS, the Fuzzy Matching Component allows you to send the results through three different output destinations: Match, Non-Match; and Possible Match. Based on how strict or loose you set our thresholds to be, records will be redirected to the output tables accordingly - making the job of keeping the bad records out much easier.

More Brains are Better Than One

So how exactly does the Fuzzy Matching Component determine match percentages? Similarity computation is done through built-in fuzzy matching algorithms. But why settle with just one algorithm for similarity computation when you can have 16!

Available algorithms include common algorithms like the Jaro and Levenstein, but also includes other more advanced algorithms such as Smith-Waterman-Gotoh and PhonetEx.

Why so many you might ask? Each algorithm has its own strengths and weaknesses.
Some algorithms are more accurate when it comes to company and peoples' names, while some are more effective when it comes to company names. But with the wide array of algorithms to choose from, you have the flexibility to choose the logic that works for your data.

The Seven Features of the Fuzzy Matching Component

The SSIS component includes these features:
  1. Match based on several columns in your data
  2. Get actual Match Score Percentages
  3. Multiple fuzzy matching algorithms
  4. Automatic filtering of matching, non-matching, and possible matching records
  5. Built-in pre-cleansing through search and replace patterns
  6. Data Driven Model for easy migration to production
  7. Matching Metadata for data driven decision making

It's Elementary, My Dear Watson

In what seems to be an impossible task of finding and blocking dupe records, Melissa Data aims to help you put the puzzle pieces together and help solve the mysteries of

fuzzy matching. With the Melissa Data Fuzzy Matching Component for SSIS, you're now one step closer to making your data problem-free.