Designing More Effective Data Cleansing Rules

| No Comments | No TrackBacks
By David Loshin

In my last post, we looked at a simple data transformation and cleansing rule that was to be used to standardize a representation of a street type. We found that an uncontrolled application of the rule made changes where we didn't really want a change to happen.

There are two reasons why applying the rule led to an undesired result. The first issue is context: even though we want to map the string "St." to "STREET" when it appears at the end of an address, that same string "St." appearing earlier in the street name and before a proper noun is more likely to be an abbreviation for "Saint," not "Street." The obvious way to fix that is to introduce a new rule that maps "St." to the word "SAINT."

And that introduces the second issue: order of execution. Let's say we did have that second rule, so our rule set now looks like this:

• St is transformed into SAINT
• St. is transformed into SAINT
• STR is transformed into STREET
• ST is transformed into STREET
• St. is transformed into STREET
• St. is transformed into STREET
• Str is transformed into STREET
• Str. is transformed into STREET

Applying that rule set to our string "St. Charles St" yields "SAINT CHARLES SAINT" which is also clearly incorrect. The reason is that we have conflicting rules, in which the same input maps to two different outputs.

In that case, the first rules applied are the ones that win, since by the time we hit the other rules with the same inputs, the input string has already been transformed into the output version, so the rule will no longer apply. Reordering the rules won't work either, since the same thing will happen (just with the other output values). To address this challenge, we need an approach that accommodates both context and order of execution.

No TrackBacks

TrackBack URL: http://blog.melissadata.com/mt-tb.cgi/189

Leave a comment