By Joseph Vertido
Things aren't always what they seem - or as in this case, records might not always be unique, even though they look like it. Let's take a look at an example. We've got two sets of data: one is a subset of our master database containing existing customer information, and another contains new customers to be added.

Take a close look at the data and see if you can find anything wrong in this picture.
It doesn't take long before we realize that our two incoming records are peculiarly similar to some of the records in our master database. And it doesn't take very long to see that automating the process of associating these records might not be very straightforward - a task commonly known as Record Linkage.
So what are some of the problems we face in this example?
I. Non-Matching Customer IDs
It's common practice to use some form of a unique string (such as a customer ID) to identify and associate records. But what happens when we've got existing customers signing up for new accounts? We get duplicate customer data with different customer IDs. Hence, we've just opened our data to duplicates.
II. Non-Matching Data
If records can't be associated based solely on customer IDs, then we can most probably make comparisons based on the other information in our data right? This is true, but only for cases where we've got exact matching information - which won't always be the case. Who's to say that "John Smith at 123 Main St" isn't the same person as "J Smith at 123 Mein St"? Minor discrepancies are common, especially when we've got people manually typing in their information.
Without the necessary precautions and preventive measures, our master database might eventually become almost unmanageable. But now that we've come to understand the problem, let's talk about solutions!
In a world where data can be very deceiving, we can make use of Similarity Computations to track down these unwanted dupes. Even with non-matching customer IDs and differences in data, similarity computations through various algorithms allows us to associate probable and possible duplicate records represented by a percentage match - a job perfectly fit for our Fuzzy Matching Component in SQL Server Integration Services (SSIS).
Take a look at the similarity computation results for these records when processed through the Fuzzy Matching Component.

Can you deduce what the pattern is?
As the compared record becomes less and less similar to the source, the Match
Percentage between the two records correspondingly goes down as well. Although
the records are not precisely matching, we now have the ability to associate
records, based off a percentage score of likeliness assigned by a Fuzzy Matching
Algorithm in our component.
Let's take a look at the results of our similarity computation in our original records:

Percent Similarity based off the Levenshtein Algorithm
We've now successfully established a relationship between these possibly matching records through a percent score. Of course, there isn't really a single given algorithm that will accommodate to all types of data in all types of situations, which is why the Fuzzy Matching Component gives you several algorithms to choose from for similarity computation.
The key to mastering your database is acknowledging the problem, and knowing the solutions. In this case, let's not get fooled by duplicate records in disguise. Don't get overrun, stay one step ahead, and catch them first, before they eventually catch up to you.
Once again, it's your Dupe Detective here, helping you win the fight against record duplicates. This time, we're going to have a quick lesson on how to have a better eye and a keener understanding of how to identify dupe problems and associate similar matching records.
Things aren't always what they seem - or as in this case, records might not always be unique, even though they look like it. Let's take a look at an example. We've got two sets of data: one is a subset of our master database containing existing customer information, and another contains new customers to be added.

Take a close look at the data and see if you can find anything wrong in this picture.
It doesn't take long before we realize that our two incoming records are peculiarly similar to some of the records in our master database. And it doesn't take very long to see that automating the process of associating these records might not be very straightforward - a task commonly known as Record Linkage.
So what are some of the problems we face in this example?
I. Non-Matching Customer IDs
It's common practice to use some form of a unique string (such as a customer ID) to identify and associate records. But what happens when we've got existing customers signing up for new accounts? We get duplicate customer data with different customer IDs. Hence, we've just opened our data to duplicates.
II. Non-Matching Data
If records can't be associated based solely on customer IDs, then we can most probably make comparisons based on the other information in our data right? This is true, but only for cases where we've got exact matching information - which won't always be the case. Who's to say that "John Smith at 123 Main St" isn't the same person as "J Smith at 123 Mein St"? Minor discrepancies are common, especially when we've got people manually typing in their information.
Without the necessary precautions and preventive measures, our master database might eventually become almost unmanageable. But now that we've come to understand the problem, let's talk about solutions!
In a world where data can be very deceiving, we can make use of Similarity Computations to track down these unwanted dupes. Even with non-matching customer IDs and differences in data, similarity computations through various algorithms allows us to associate probable and possible duplicate records represented by a percentage match - a job perfectly fit for our Fuzzy Matching Component in SQL Server Integration Services (SSIS).
Take a look at the similarity computation results for these records when processed through the Fuzzy Matching Component.

Can you deduce what the pattern is?
Let's take a look at the results of our similarity computation in our original records:

Percent Similarity based off the Levenshtein Algorithm
We've now successfully established a relationship between these possibly matching records through a percent score. Of course, there isn't really a single given algorithm that will accommodate to all types of data in all types of situations, which is why the Fuzzy Matching Component gives you several algorithms to choose from for similarity computation.
The key to mastering your database is acknowledging the problem, and knowing the solutions. In this case, let's not get fooled by duplicate records in disguise. Don't get overrun, stay one step ahead, and catch them first, before they eventually catch up to you.




Leave a comment