By David Loshin
There are different versions depending on whether you use abbreviations or full names, punctuation, and the order of the terms. A good parsing engine can be configured with the different patterns and will be able to identify each piece of a name string.
The next piece is standardization: taking the pieces and rearranging them into a desired order. The example might be taking a string of the form "last_name, first_name, initial" and transforming that into the form "first_name, initial, last_name" as a standardized or normalized representation. Using a normalized representation will simplify the comparison process for data matching and record linkage.
In my last few posts, I discussed how structural
differences impact the ability to search and match records across different
data sets. Fortunately, most data quality tool suites use integrated parsing
and standardization algorithms to map structures together.
As long as there is some standard representation, we should be able to come up with a set of rules that can help to rearrange the words in a data value to match that standard.
As an example, we can look at person names (for simplicity, let's focus on
name formats common to the United States). The general convention is that
people have three names - a first name, a middle name, and a surname. Yet
even limiting our scope to just these components (that is, we are ignoring
titles, generationals, and other prefixes and suffixes), there is a wide
range of variance for representing the name. Here are some examples, using
my own name:As long as there is some standard representation, we should be able to come up with a set of rules that can help to rearrange the words in a data value to match that standard.
• Howard David Loshin
• Howard D Loshin
• Howard D. Loshin
• David Loshin
• Howard Loshin
• H David Loshin
• H. David Loshin
• H D Loshin
• H. D. Loshin
• Loshin, Howard D
• Loshin, Howard D.
• Loshin, H David
• Loshin, H. David
• Loshin, H D
• Loshin, H. D.
There are different versions depending on whether you use abbreviations or full names, punctuation, and the order of the terms. A good parsing engine can be configured with the different patterns and will be able to identify each piece of a name string.
The next piece is standardization: taking the pieces and rearranging them into a desired order. The example might be taking a string of the form "last_name, first_name, initial" and transforming that into the form "first_name, initial, last_name" as a standardized or normalized representation. Using a normalized representation will simplify the comparison process for data matching and record linkage.





Leave a comment