Matchcode Caveats - How to Solve Them

| No Comments | No TrackBacks

By Tim Sidor, Data Quality Analyst

"The more advanced I make my matchcode, the more duplicates I'll identify."

This is an assumption - true or false - that many of our new users to MatchUp make, but often leads to false dupes, no dupes, or a process that seems to run forever.


Adding more columns of conditions, can be looked at as 'just adding more ways to return more duplicates.' This additional criteria may or may not result in accurate groups, as you may have actually loosened up your intended criteria. On the flip side, adding matchcode components may result in less duplicates as you may have tightened up your rules too much. Applying fuzzy algorithms (without thoroughly testing) will lead to a slower process, but may not return a significant number of additional matches (diminishing returns of accuracy/speed vs complexity/inefficiency).

"What can I do?"

When learning to use MatchUp, we always suggest starting with the basics - a simple default matchcode that we distribute, and a small data set. This allows you to quickly run and analyze how the matchcode performed against the data. Then make small changes - tweaking the matchcode and repeating the process or running a slightly altered data set with a few variations in format or data values. Eventually, you will migrate towards your end goal of incorporating your business rules into the matching strategy (the matchcode) with your production data.


By following any of the above disciplined paths, you will more quickly arrive at your goal and with a better understanding of how to create the best matchcode for your environment. No diagonal shortcuts!

"OK, I already went straight to 'Production Data and a Custom Matchcode,' what do I do?"

First, evaluate the Result Codes and Dupe Group output properties. In addition to telling you the output disposition of a record (unique, group winner, duplicate, etc.), the Result Codes will tell you which matchcode combination (which column of checkmarks in the matchcode) caused the record to match in a particular Dupe Group. If you find out that a particular column is never finding a match, or never finding a match that another column hasn't already found - you should consider removing it. This may also prompt you to remove duplicated component types which may have been used with alternate settings, from the matchcode. After re-evaluating the remaining components, and concluding they still represent a valid strategy, you may find that your process returns more accurate results AND processes much quicker.

"Can my process run faster?"

Yes, MatchUp uses an advanced clustering method to find duplicates and creating advanced matchcodes prevent efficient clustering, thus slowing processes down. For example, we had a customer who we had drop a matchcode component with a fuzzy setting from the second position to below another component which was using an exact setting (and in all columns). Their process decreased from 47 hours to under 4 - by making this simple change. Expanding on the diminishing returns concept - if an exact matchcode, for example, returns 20,000 duplicates from a 1,000,000 record set - is changing all components to a fuzzy algorithm and then returning 20,003 duplicates worth a process that takes 4x to run?

"What about that Result Code that tells me a specific combination returned a false dupe?" or "Why did these records not match under my rules?"

For details on how a matchcode relates to your data, click here for easy guidance to understanding your matchcode rules, and remember, test thoroughly!

For more info, go to:


No TrackBacks

TrackBack URL:

Leave a comment