Recently in Address Standardization Category

Postal Standards and Address Quality - Take 1

| No Comments | No TrackBacks
By David Loshin

The USPS Postal Standard (Publication 28) provides at least some of the specifications we need for address quality. For example,

 "The Postal Service defines a complete address as one that has all the address elements necessary to allow an exact match with the current Postal Service ZIP+4 and City State files to obtain the finest level of ZIP+4 and delivery point codes for the delivery address."
The next paragraph provides some additional details:

 "A standardized address is one that is fully spelled out, abbreviated by using the Postal Service standard abbreviations (shown in this publication) or as shown in the current Postal Service ZIP+4 file."
A large part of the remainder of the document guides what is valid and what is not valid, as well as the postal standard abbreviations (as mentioned in the definition of standardized). So an address must be complete, which by definition implies that it can be matched with current Postal Service ZIP+4 and City State files.

This match is to obtain the ZIP+4, so the implication is that verification means that a complete address matches the USPS files and has the correct ZIP+4. The address components must be consistent with the postal standard in terms of valid and invalid values. For example, a street address cannot have a number that is outside the range of recognized numbers (that is, if the USPS file says that Main Street goes from 1-104, an address with 109 Main St is invalid). So validation means that the street address is consistent with what is documented by the USPS files. Standardization is also defined by the above reference: it is spelled out, and uses the USPS standard abbreviations.

In turn, the process for address quality would be to:

1) Ensure the address is complete.
2) Ensure that the address values are valid by checking it against the USPS files.
3) Verify the address's ZIP+4 by matching against the USPS fles.
4) Standardize the address according to the USPS standardized abbreviations.

By David Loshin

In my last post, I shared a story about how rampant address validation actually can transform accurate (if not 100% standardized) addresses into inaccurate ones. My client actually noted that with some of the tools they have seen, addresses that have been submitted to the product and standardized are then re-submitted to the product, which then reports their own corrected address as being invalid! Huh?

Here are a few reflections:

1) Siloed application development has led to multiple iterations of address correction, often using different tools, or if using the same tool, using different sets of rules, which is probably not the best scenario since it will lead to inconsistencies.

2) Address standardization tools have to be maintained - the rules may change, locations may change, the standards may evolve.

3) The criticality of precision and standardization/validation has to be assessed in the context of how the address (or location) is being used. For example, for delivery purposes accuracy should trump "validity" if a validated address is no longer correct.

4) Find a single tool solution for address standardization and validation instead of using multiple products.

5) Have a single team responsible for managing the solicitation of requirements, definition of rules, and provision of address validation and correction services.
One last thought: push address correction and validation to the earliest point in the business process. If your application solicits an address from the customer, validate it right away and have the corrected address verified directly by the customer. And trust the customer to give you accurate address data - they probably have some experience in ensuring they get their mailings!


By David Loshin

There are all sorts of tools associated with address standardization, cleansing, and validation. As an example, the USPS has a certification program for software vendors, referred to as CASS (Coding Accuracy Support System)™ certification. According to their website,

CASS enables the Postal Service™ to evaluate the accuracy of address matching software programs in the following areas:

(1) five-digit coding
(2) ZIP + 4/ delivery point (DP) coding
(3) carrier route coding
(4) DPV®
(5) DSF2®
(6) LACSLink®
(7) eLOT®
(8) RDI™ products

CASS allows vendors/mailers the opportunity to test their address-matching software packages and, after achieving a certain percentage of compliance, to be certified by the Postal Service. CASS does not measure the accuracy of ZIP + 4 delivery point, five-digit ZIP, or carrier route codes in a mailer's existing files. CASS enables mailers to measure and diagnose internally written, commercially-available, address-matching software packages. The effectiveness of service bureaus' matching software can also be measured.

There are many vendors selling CASS-certified tools and services. Organizations use CASS-certified tools for address standardization, correction, and validation. End of story, right?

Wrong. Some organizations use many CASS-certified tools for address standardization, correction, and validation, at different places along the processing stream. The addresses are standardized, cleansed, and validated (or not) multiple times. The addresses are changed from their original format, manipulated, and then shoved back into the databases, without considering the actual process dependencies or expectations.

And then you end up with a scenario like this: a process for accepting customer applications including their self-provided addresses, send hard copies of acknowledgements to their self-provided addresses, yet the process includes an elaborate mechanism for managing returned mail. That did not make sense to me: if the organization sends out acknowledgements to the address the customer provided, wouldn't they trust that the customer provided an accurate address?

In fact, the issues was self-created: the provided address passed through at least three different iterations (with different products!) of standardization, correction, and validation, and was transformed from a deliverable ("accurate") address to an invalid one.

So even though the intent was appropriate, the execution of the process got in the way of the results. So I'll throw out two questions: First, is address standardization and validation a tool or a process? And second, at what point and how frequently in the business process flow should address standardization and validation take place?


By David Loshin

One nice thing about addresses, especially in the United States, is that they have well-defined standards. In previous blog series, I have looked at the process of address standardization and correction, so I won't belabor that point. However, many people confuse the differences among a  valid address, a precise representation of an address, and an accurate address.
 
First of all, recall that the main reason for standardizing addresses is based on "delivery" - supporting the general expectation that posted items reach their desired destination. Validation that addresses meet the standard certainly can improve the delivery processing, especially when it comes to sorting the items to be delivered. That is the reason that the Post Office offers discounts in the costs of mailings when the addresses are validated in standard form.

Luckily, there is a lot of leeway when it comes to meeting this expectation. If you provide a street number, street address, city, and state, there is a good chance your item will be delivered even if it is missing a ZIP code.

On the other hand, the address is not necessarily valid in terms of it meeting the USPS standard, since it is an incomplete address. Similarly, what was put into "address line 1" and "address line 2" might have been the reverse of what is specified by the standard (does "suite 2100" go before the street address or not?), but the carrier will still be able to put the letter into the right slot once she gets to the building. In other words, the address does not need to be valid in order to be delivered.

Nor does it really have to be that precise. We might define precision to mean that the address has enough information to direct the carrier to the specific box or slot associated with the entity referenced in the address. An address with a name, a street, and a suite number is more precise than one with just a name and a street. Yet if your local carrier has some knowledge about his route, then items that have less than precise addresses will still get to their destination.

The last term may be the most important one, though: accuracy. If the address associated with the named entity is not correct, the chances of the item being delivered are much lower. And one problem is that these characterizations are not mutually exclusive: an address can be valid (i.e. it is in standard form and there is an actual delivery point associated with it) but not accurate (if the named individual is not associated with the delivery point).

And while the Post Office maintains some information and tools to help synchronize validity and accuracy, sometimes there are kinks in the process, as we will see next time.

Synergizing the Process for Locations and Addresses

| No Comments | No TrackBacks
By David Loshin

I am currently working with a number of clients who are dealing with particularly thorny issues relating to location. While the business drivers are relatively diverse, there are some similarities across all scenarios, especially in the ways that location is managed from an enterprise perspective. Therefore, in this set of blog entries, we'll look at different
business value drivers, typical usage scenarios, and some ideas for melding process with application for a synergistic lift.

Location is a critical concept in many industries, yet the importance of a standardized approach to managing location is often unnoticed. For example, for some client scenarios, the business driver is reducing risk. Insurance companies like to see their customer base (and therefore, their exposure to certain types of hazards such as floods or earthquakes) spread across different geographic regions. Financial institutions may be subject to different laws (with different penalties for violations) at different geographic jurisdictional levels for the protection of private information.

Healthcare providers need to ensure that protected health information is not inadvertently exposed by being sent to the wrong address.

In other scenarios, the business drivers are financial, such as focusing on customer acquisition, retention or cost management. In some cases, marketing budgets are allocated to local print and media advertising to grow the customer base. In other cases, reducing transport costs by optimizing the supply chain looks at distances between delivery points.

Either way, the underlying desire is precision and correctness in geolocation.

But note that precision and correctness are very separate ideas, and that difference underlies some of the main problems related to address and location management. Next post: Accuracy vs. Precision.

Data Quality Incident Management

| No Comments | No TrackBacks
By David Loshin

The previous step in our transition from uncontrolled reactivity to being proactively engaged in managing data quality involved defining processes for identifying and evaluating data errors using standardized methods. Providing well-defined processes to data stewards and data quality analysts helps reduce any confusion around the appropriate steps to take when those commonly-occurring data failures are discovered in process.

But in many cases there is still an issue of coordination. While standardizing the approaches to monitoring for data validity helps reduce the effort and complexity of analyzing and remediating issues, there is still the situation that the same error may impact multiple data consumers; if each data consumer reports to issue to one of the data stewards, you have many stewards investigating the same problem. So this is where our second suggestion comes in: instituting methods for coordinating those evaluations.

This is an area in which the data management world can learn lessons from our friends in desktop or network support, who rely on incident management systems for the reporting, tracking, and management of issues. Data consumers impacted by a data error can report the flaw in the incident management and tracking system, which can assign a unique identifier to the logged issue and then route it to a specific data steward.

However, by carefully structuring the ways that errors are described when reported provides hierarchies and organization in a way that facilitates assignment of issues to those stewards with the greatest corresponding experience.

In other words, issues can be grouped to reduce the amount of replicated effort. In turn, an incident tracking system for data quality issues also provides entry points for tracking the status of the issues - whether the root cause has been identified, if a correction has been performed, or if further evaluation is being performed.

In the next post: Using data quality rules as a proactive measure.


Standardizing Your Approach to Monitoring the Quality of Data

| No Comments | No TrackBacks
By David Loshin

In my last post, I suggested three techniques for maturing your organizational approach to data quality management. The first recommendation was defining processes for evaluating errors when they are identified. These types of processes actually involve a few key techniques:

1) An approach to specifying data validity rules that can be used to determine whether a data instance or record has an error. This is more of a discipline that can be guided by formal representations of business or data rules. Often metadata management tools and data profiling tools have repositories for capturing defined rules, leading to our next technique...

2) A method for applying those rules to data. This often will take advantage of the operational aspects of a data profiling or monitoring tool to validate a data instance against a set of rules. It may also incorporate parsing and standardization rules to identify known error patterns.

3) A means for reporting errors to a data analyst or steward. Some data analysis and profiling can be configured to automatically notify a data steward when a data validity rule is violated. In other situations, the results of applying the validation rule can be accumulated in a repository and a front-end reporting tool is used to provide visualization and notification of errors.

4) An inventory of actions to take when specific errors occur. As your team becomes more knowledgeable about the types of errors that can occur, you will also become accustomed to the methods employed for analysis and remediation.
In time, the repeated use of tools and the corresponding actions for remediation can be evolved into standardized methods, which can be documented, published, and used as the basis for training data quality analysts.


Business Rules for Standardization: Bringing it all Together

| No Comments | No TrackBacks
By David Loshin

While we have been talking in the last few posts about checking whether a data value observes the standard (and is therefore a valid value), the real challenge in standardization is in determining (1) that a value does not meet the standard and then (2) taking the right actions to modify it so that it does meet the standard.

That process, strangely enough, is called "standardization," and it extends the tokenization and parsing to recognize both valid tokens and common patterns for invalid ones, and that is where the power of standardization lies. Here is the basic idea: when you recognize a token value to be a known error, you can define a business rule to map it to a corrected version.

The example I have used over the recent blog posts is a simple address standard:

· The number must be a positive integer number

· The name must have one and only one word

· The street type must be one of the following: RD, ST, AV, PL, or CT

And deriving these additional expectations:

· The address string must have three components to it (format)

· The first component has to only have characters that are digits 0-9 (syntax)

· The first character of the first component cannot be a '0' (syntax)

· The third component must be of length 2 (format)

· The third component has to have one of the valid street types (content)

The next step would be to consider the variations from the expected values. A good example might look at the third token, namely the street type, and presume the types of errors that could happen and how they'd be corrected, such as:

Possible errors Standard
Rd, Road, Raod, rd RD
Street, STR ST
Avenue, AVE, avenue, abenue, avenoo AV
Place, PLC PL
CRT, Court, court CT

In this example, we see some variant abbreviations, fully-spelled out words, a finger flub (the typist hit the b key instead of the v in "abenue" - I do this all the time), and a transposition ("Raod" instead of "Road", I also do this all the time).

Different types of formats and patterns can be subjected to different kinds of rules. The first token has to be an integer, but perhaps some OCR reader mis-translated what it scanned into a character instead of a number, so we might see O instead of 0, A instead of 4, S instead of 8, ) instead of 9, etc. That means that part of the standardization process looks for non-digits and then apply rules that traverse through a string and convert according to the defined mappings (A becomes 4, for example).

For the second token, the challenge is when more than three words appear. One set of rules might take all tokens between the first and the last and concatenate them together into a single word.

Another approach is to scan the tokens and pluck out the one that most closely matches one of the street types and move that to the end.

So these are the basics ideas for standardization: defining the formats and patterns, determining the tokenization rules, parse the data and recognize valid tokens and invalid tokens, define rules for mapping invalid tokens to valid ones, and potentially rearrange tokens into the corrected version. In reality, there are many more challenges, opportunities, and subtleties, but at least this series of notes gives a high level view of the general process.


Tokenization and Parsing

| No Comments | No TrackBacks
By David Loshin

As we have discussed in the previous posts, the data values stored within data elements carry specific meaning within the context of the business uses of the modeled concepts, so to be able to standardize an address, the first step is identifying those chunks of information that are embedded in the values.

This means breaking out each of the chunks of a data value that carry the meaning, and in the standardization biz, each of those chunks is called a token. A token is representative of all of the character strings used for a particular purpose. In our example, we have three tokens - the number, name, and type.

Token categories can be further refined based on the value domain, such as our street type, with its listed valid values. This distinction and recognition process starts by parsing the tokens and then rearranging the strings that mapped to those tokens through a process called standardization. The process of parsing is intended to achieve two goals - to validate the correctness of the string or to identify what parts of the string need to be corrected and standardized.

We rely on metadata to guide parsing, and parsing tools use format and syntax patterns as part of the analysis.

We would define a set of data element types and patterns that correspond to each token type and the parsing algorithm matches data against the patterns and maps them to the expected tokens in the string. These tokens are then analyzed against the patterns to determine their element types.

Comparing data fields that are expected to have a pattern, such as our initial numeric token or the third street type token, enables a measurement of conformance to defined structure patterns. This can be applied in many other scenarios as well, such as telephone numbers, person names, product code numbers, etc.

Once the tokens are segregated and reviewed, as long as all tokens are valid and are in the right place, the string is valid. In the next post, we will consider what to do if the string is not valid.


Formats, Syntax, and Content

| No Comments | No TrackBacks
By David Loshin

One great thing about having a standard representation for data is that it becomes easy to see whether any value does or does not meet the standard. Let's use a simple example: we can say that a street address has to have three parts - a number, a name, and a "street type." We can further specify our example standard with these constraints:

· The number must be a positive integer number
· The name must have one and only one word
· The street type must be one of the following: RD, ST, AV, PL, or CT
OK, I know that there are streets with names that span more than one word, and I know there are a lot more types of streets, but this experiment is to demonstrate how we can use the standard to determine if an address is valid or not by comparing it against the defined format, syntax, and content characteristics, such as:

· The address string must have three components to it (format)
· The first component has to only have characters that are digits 0-9 (syntax)
· The first character of the first component cannot be a '0' (syntax)
· The third component must be of length 2 (format)
· The third component has to have one of the valid street types (content)
In other words, we are refining the rules for validity into ones that we can test. If the first component of the address has any characters other than digits, it is not a valid address, and if the last component of the address is "AVE" the string is not valid, since the length of that component is 3, not 2.

Authors