Content Standards for Data Matching and Record Linkage

| No Comments | No TrackBacks
By David Loshin

As I suggested in my last post, applying parsing and standardization to normalize data value structure will reduce complexity for exact matching. But what happens if there are errors in the values themselves?

Fortunately, the same methods of parsing and standardization can be used for the content itself. This can address the types of issues I noted in the first post of this series, in which someone entering data about me would have used a nickname such as "Dave" instead of "David."

By introducing a set of rules for pattern recognition, we can organize a number of transformations from an unacceptable value into one that is more acceptable or standardized. Mapping abbreviations and acronyms to fully spelled out words, eliminating punctuation, even reordering letters in words to attempt to correct misspellings - all of these can be accomplished by parsing the values, looking for patterns that the value matches, and then applying a transformation or standardization rule.

In essence, we can create a two-phased standardization process that first attempts to correct the content and then attempts to normalize the structure. Applying these same rules to all data sets results in a standard representation of all the records, which reduces the effort in trying to perform the exact matching.

Yet this process may still allow variance to remain, and for that we have some other algorithms that I will touch upon in upcoming posts.


No TrackBacks

TrackBack URL: http://blog.melissadata.com/mt-tb.cgi/202

Leave a comment