By David Loshin
Here is a simple example: For address correction, we'd like to expand out the abbreviations for the street type such as "road," "street," "avenue," etc.). For the road type of "STREET," we might have rules such as:
And so on. The approach that would be taken is to integrate these rules into a data cleansing rules engine, and then present our strings to be corrected through the engine. To continue the example, (and if we also included a rule that upper-cases all letters), the string "1250 Main Str." might be transformed into "1250 MAIN STREET" and provided back to the calling routine. Seems simple, no?
Of course it is. And simplistic as well, since the same transformation might happen when presenting this street name as well: "St. Charles St," which would be changed into "STREET CHARLES STREET" when using that same set of rules. Because the rule is so basic, there are no controls over how, where, and when the rule is applied. We'd have to have more rules and a bit more control to effectively transform and correctly correct the data. More in my next post...
Having worked as a data quality tool software developer, rules developer, and consultant, I am relatively familiar with some of the idiosyncrasies associated with building an effective business rules set for data standardization and particularly, data cleansing. At first blush, the process seems relatively straightforward: I have a data value in a character string that I believe to be incorrect and I want to use the automated transformative capability of a business rule to correct that incorrect string into a correct one.
Here is a simple example: For address correction, we'd like to expand out the abbreviations for the street type such as "road," "street," "avenue," etc.). For the road type of "STREET," we might have rules such as:
• STR is transformed into STREET
• ST is transformed into STREET
• St. is transformed into STREET
• St. is transformed into STREET
• Str is transformed into STREET
• Str. is transformed into STREET
And so on. The approach that would be taken is to integrate these rules into a data cleansing rules engine, and then present our strings to be corrected through the engine. To continue the example, (and if we also included a rule that upper-cases all letters), the string "1250 Main Str." might be transformed into "1250 MAIN STREET" and provided back to the calling routine. Seems simple, no?
Of course it is. And simplistic as well, since the same transformation might happen when presenting this street name as well: "St. Charles St," which would be changed into "STREET CHARLES STREET" when using that same set of rules. Because the rule is so basic, there are no controls over how, where, and when the rule is applied. We'd have to have more rules and a bit more control to effectively transform and correctly correct the data. More in my next post...





Leave a comment