By David Loshin
That process, strangely enough, is called "standardization," and it extends the tokenization and parsing to recognize both valid tokens and common patterns for invalid ones, and that is where the power of standardization lies. Here is the basic idea: when you recognize a token value to be a known error, you can define a business rule to map it to a corrected version.
The example I have used over the recent blog posts is a simple address standard:
In this example, we see some variant abbreviations, fully-spelled out words, a finger flub (the typist hit the b key instead of the v in "abenue" - I do this all the time), and a transposition ("Raod" instead of "Road", I also do this all the time).
Different types of formats and patterns can be subjected to different kinds of rules. The first token has to be an integer, but perhaps some OCR reader mis-translated what it scanned into a character instead of a number, so we might see O instead of 0, A instead of 4, S instead of 8, ) instead of 9, etc. That means that part of the standardization process looks for non-digits and then apply rules that traverse through a string and convert according to the defined mappings (A becomes 4, for example).
For the second token, the challenge is when more than three words appear. One set of rules might take all tokens between the first and the last and concatenate them together into a single word.
Another approach is to scan the tokens and pluck out the one that most closely matches one of the street types and move that to the end.
So these are the basics ideas for standardization: defining the formats and patterns, determining the tokenization rules, parse the data and recognize valid tokens and invalid tokens, define rules for mapping invalid tokens to valid ones, and potentially rearrange tokens into the corrected version. In reality, there are many more challenges, opportunities, and subtleties, but at least this series of notes gives a high level view of the general process.
While we have been talking in the last few posts about checking whether a data value observes the standard (and is therefore a valid value), the real challenge in standardization is in determining (1) that a value does not meet the standard and then (2) taking the right actions to modify it so that it does meet the standard.
That process, strangely enough, is called "standardization," and it extends the tokenization and parsing to recognize both valid tokens and common patterns for invalid ones, and that is where the power of standardization lies. Here is the basic idea: when you recognize a token value to be a known error, you can define a business rule to map it to a corrected version.
The example I have used over the recent blog posts is a simple address standard:
· The number must be a positive integer numberAnd deriving these additional expectations:
· The name must have one and only one word
· The street type must be one of the following: RD, ST, AV, PL, or CT
· The address string must have three components to it (format)The next step would be to consider the variations from the expected values. A good example might look at the third token, namely the street type, and presume the types of errors that could happen and how they'd be corrected, such as:
· The first component has to only have characters that are digits 0-9 (syntax)
· The first character of the first component cannot be a '0' (syntax)
· The third component must be of length 2 (format)
· The third component has to have one of the valid street types (content)
| Possible errors | Standard |
| Rd, Road, Raod, rd | RD |
| Street, STR | ST |
| Avenue, AVE, avenue, abenue, avenoo | AV |
| Place, PLC | PL |
| CRT, Court, court | CT |
In this example, we see some variant abbreviations, fully-spelled out words, a finger flub (the typist hit the b key instead of the v in "abenue" - I do this all the time), and a transposition ("Raod" instead of "Road", I also do this all the time).
Different types of formats and patterns can be subjected to different kinds of rules. The first token has to be an integer, but perhaps some OCR reader mis-translated what it scanned into a character instead of a number, so we might see O instead of 0, A instead of 4, S instead of 8, ) instead of 9, etc. That means that part of the standardization process looks for non-digits and then apply rules that traverse through a string and convert according to the defined mappings (A becomes 4, for example).
For the second token, the challenge is when more than three words appear. One set of rules might take all tokens between the first and the last and concatenate them together into a single word.
Another approach is to scan the tokens and pluck out the one that most closely matches one of the street types and move that to the end.
So these are the basics ideas for standardization: defining the formats and patterns, determining the tokenization rules, parse the data and recognize valid tokens and invalid tokens, define rules for mapping invalid tokens to valid ones, and potentially rearrange tokens into the corrected version. In reality, there are many more challenges, opportunities, and subtleties, but at least this series of notes gives a high level view of the general process.




Leave a comment