Normalizing Structure Using Data Standardization for Improved Matching

| No Comments | No TrackBacks
By David Loshin

In my last few posts, I discussed how structural differences impact the ability to search and match records across different data sets. Fortunately, most data quality tool suites use integrated parsing and standardization algorithms to map structures together.

As long as there is some standard representation, we should be able to come up with a set of rules that can help to rearrange the words in a data value to match that standard.

As an example, we can look at person names (for simplicity, let's focus on name formats common to the United States). The general convention is that people have three names - a first name, a middle name, and a surname. Yet even limiting our scope to just these components (that is, we are ignoring titles, generationals, and other prefixes and suffixes), there is a wide range of variance for representing the name. Here are some examples, using my own name:

• Howard David Loshin
• Howard D Loshin
• Howard D. Loshin
• David Loshin
• Howard Loshin
• H David Loshin
• H. David Loshin
• H D Loshin
• H. D. Loshin
• Loshin, Howard D
• Loshin, Howard D.
• Loshin, H David
• Loshin, H. David
• Loshin, H D
• Loshin, H. D.

There are different versions depending on whether you use abbreviations or full names, punctuation, and the order of the terms. A good parsing engine can be configured with the different patterns and will be able to identify each piece of a name string.

The next piece is standardization: taking the pieces and rearranging them into a desired order. The example might be taking a string of the form "last_name, first_name, initial" and transforming that into the form "first_name, initial, last_name" as a standardized or normalized representation. Using a normalized representation will simplify the comparison process for data matching and record linkage.


No TrackBacks

TrackBack URL: http://blog.melissadata.com/mt-tb.cgi/199

Leave a comment