Recently in Address Standardization Category

Content Standards for Data Matching and Record Linkage

| No Comments | No TrackBacks
By David Loshin

As I suggested in my last post, applying parsing and standardization to normalize data value structure will reduce complexity for exact matching. But what happens if there are errors in the values themselves?

Fortunately, the same methods of parsing and standardization can be used for the content itself. This can address the types of issues I noted in the first post of this series, in which someone entering data about me would have used a nickname such as "Dave" instead of "David."

By introducing a set of rules for pattern recognition, we can organize a number of transformations from an unacceptable value into one that is more acceptable or standardized. Mapping abbreviations and acronyms to fully spelled out words, eliminating punctuation, even reordering letters in words to attempt to correct misspellings - all of these can be accomplished by parsing the values, looking for patterns that the value matches, and then applying a transformation or standardization rule.

In essence, we can create a two-phased standardization process that first attempts to correct the content and then attempts to normalize the structure. Applying these same rules to all data sets results in a standard representation of all the records, which reduces the effort in trying to perform the exact matching.

Yet this process may still allow variance to remain, and for that we have some other algorithms that I will touch upon in upcoming posts.


By David Loshin

In my last few posts, I discussed how structural differences impact the ability to search and match records across different data sets. Fortunately, most data quality tool suites use integrated parsing and standardization algorithms to map structures together.

As long as there is some standard representation, we should be able to come up with a set of rules that can help to rearrange the words in a data value to match that standard.

As an example, we can look at person names (for simplicity, let's focus on name formats common to the United States). The general convention is that people have three names - a first name, a middle name, and a surname. Yet even limiting our scope to just these components (that is, we are ignoring titles, generationals, and other prefixes and suffixes), there is a wide range of variance for representing the name. Here are some examples, using my own name:

• Howard David Loshin
• Howard D Loshin
• Howard D. Loshin
• David Loshin
• Howard Loshin
• H David Loshin
• H. David Loshin
• H D Loshin
• H. D. Loshin
• Loshin, Howard D
• Loshin, Howard D.
• Loshin, H David
• Loshin, H. David
• Loshin, H D
• Loshin, H. D.

There are different versions depending on whether you use abbreviations or full names, punctuation, and the order of the terms. A good parsing engine can be configured with the different patterns and will be able to identify each piece of a name string.

The next piece is standardization: taking the pieces and rearranging them into a desired order. The example might be taking a string of the form "last_name, first_name, initial" and transforming that into the form "first_name, initial, last_name" as a standardized or normalized representation. Using a normalized representation will simplify the comparison process for data matching and record linkage.


Structural Differences and Data Matching

| No Comments | No TrackBacks
By David Loshin

Data matching is easy when the values are exact, but there are different types of variation that complicate matters. Let's start at the foundation: structural differences in the ways that two data sets represent the same concepts. For example, early application systems used data files that were relatively "wide," capturing a lot of information in each record, but with a lot of duplication.

More modern systems use a relational structure that segregates unique attributes associated with each data concept - attributes about an individual are stored in one data table, and those records are linked to other tables containing telephone numbers, street addresses, and other contact data.

Transaction records refer back to the individual records, which reduces the duplication in the transaction log tables.

The differences are largely in the representation - the older system might have a field for a name, a field for an address, perhaps a field for a telephone number, and the newer system might break up the name field into a first name, middle name, and last name, the address into fields for street, city, state, and ZIP code, and a telephone number into fields for area code and exchange/line number.

These structural differences become a barrier when performing records searches and matching. The record structures are incompatible: different number of fields, different field names, and different precision in what is stored.

This is the first opportunity to consider standardization: if structural differences affect the ability to compare a record in one data set to records in another data set, then applying some standards to normalize the data across the data sets will remove that barrier. More on structural standardization in my next post.

In a Global Economy, a Global Solution is Vital

| No Comments | No TrackBacks
By Patrick Bayne
Data Quality Tools Software Engineer

Heidelberglaan 8
3584 CS Utrecht


If you were given this address, how would you know it was valid? Is it formatted correctly? How long will it take you to verify? For years businesses have understood a need for address validating solutions, because clean, accurate data is essential. Without accurate and consistent data, customers are lost due to missed mailings, synchronization across servers is difficult, and reports become inaccurate - all adding to costs and missed opportunities for your business.

In an ever-emerging, ever-increasing global economy, there is a strong need for a global address solution that is not only accurate but cost-effective. Melissa Data, a long time value leader in address cleansing solutions, is proud to announce the launch of its Global Address Verification Web Service. Now it's easier than ever to cleanse, validate, and standardize your international data - allowing you to make confident business decisions, execute accurate mailings and shipping, plus maintain contact with your customers.

The Global Address Verification Web Service easily integrates into your systems. Because of the nature of a Web service, any platform that can communicate with the Web is open to use the service. And the Web service supports a variety of popular protocols - SOAP, XML, JSON, REST and Forms. The multiplatform, multi-Web protocol openness of the service allows simple connection to any point of entry, or batch system.

Enhanced Global Address Verification

What if you only had the street address and not the actual country name? Global Address Verification Web Service's unrestrictive address cleansing capabilities will append the name of the country to ensure deliverability.

The solution automatically puts components into the correct address field, making applications that process location data more accurate - bringing more value to your data. The process is also resilient to erroneous, non-address data. The solution can translate addresses between many languages and can format address to a country's local postal standards.

Result Codes for the Highest Verification Level

While it's easy to set up the Web Service and make calls to it, you will also know exactly what changed in the submitted address through results codes. There are three types of codes to detail the data quality: AE (Error) codes signify missing or invalid information; AC (Change) codes indicate address pieces that have been changed; and new AV (Verification) codes inform you as to how strongly the address was verified according to the reference data we have for a particular country.

Quantifying the quality of a match is easy through the results codes. An AV followed by a 2 designated that the record was matched to the highest level possible according to the reference data available. An AV followed by a 1 denotes that the address is partially verified, but not to the highest level possible. The number following, which is 1 to 5, indicates the level to which the address was verified. Think of it as a sliding scale.

For countries like the France, Finland and Germany, where there is data up to the delivery point, you will know that there was a full verification on an address when there is a result code of AV24 or AV25. A country such as American Samoa, where reference data is up to locality, would be fully verified with a results code of AV22. Any missing or inaccurate information would change the results to be partially verified. The user-friendly results codes will help you make more informed decisions about your data.

The Global Address Verification Web Service will open new doors to cleanse and validate your international data. The service is operating system and programming language neutral, allowing you to use it wherever you desire.

All data is maintained by Melissa Data, reassuring you that it is up-to-date and relieving you of the stress of any maintenance. You will have confidence that your data is accurate and be able to make informed, effective business decisions based on your data, increasing your productivity. So the next time you see, "Heidelberglaan 8, 3584 CS Utrecht," you will know how to quickly and assuredly verify that it is a valid address.

--- Patrick Bayne is a data quality tools software engineer at Melissa Data.

For a free trial of the Global Address Verification Web Service, click here.


Address Quality - Take 2

| No Comments | No TrackBacks
By David Loshin

We have dealt with some of our core address quality concepts, but not this one:
The intended recipient must be associated with the deliverable address.

The problem here is no longer address quality but rather address correctness.
The address may be complete, all the elements may be valid, the ZIP+4 is the right one, and all values conform to standardized abbreviations ... and still be incorrect if the recipient is not associated with the address!

This is the bigger challenge with address data quality, since address correctness or accuracy is a factor of real-world events that are not necessarily synchronized with your databases. Some level of control is again served by the Postal Service through the NCOA (National Change of Address) data set that is licensed to tools providers.

Checking against the NCOA data set will notify you if an entity linked to a location has self-reported a change of address, and this accommodates a large portion of the address correction issues. However, there are estimates about the percentage of people that moved, and I recall reading a Census Bureau press release about their 2006-2007 statistics noting that 14% of the population moved over the year.

Not all changes propagate to the NCOA file at the right time, and it may take a while before all consumers of that data actually synch up with the NCOA data set. Even if you do a quarterly review, if we trust that 14% statistic, then there is a pretty good chance that by the end of the quarter you will still have a 3-4% inaccuracy rate for mapping entities to locations.

And there are other considerations that are not incorporated into this calculation. For example:

• Individuals change jobs and therefore change business addresses
• Third-party data vendors incorrectly link individuals to locations
• Miskeyed data
• Purposely incorrect data
• Propagation of legacy addresses overwriting updated addresses
This a small sample of challenges. But what it means is that there are many aspects of assessing and assuring the quality and correctness of addresses, and it may be worth reviewing the ways that your organization verify, validate, standardize, and correct location data!

Postal Standards and Address Quality - Take 1

| No Comments | No TrackBacks
By David Loshin

The USPS Postal Standard (Publication 28) provides at least some of the specifications we need for address quality. For example,

 "The Postal Service defines a complete address as one that has all the address elements necessary to allow an exact match with the current Postal Service ZIP+4 and City State files to obtain the finest level of ZIP+4 and delivery point codes for the delivery address."
The next paragraph provides some additional details:

 "A standardized address is one that is fully spelled out, abbreviated by using the Postal Service standard abbreviations (shown in this publication) or as shown in the current Postal Service ZIP+4 file."
A large part of the remainder of the document guides what is valid and what is not valid, as well as the postal standard abbreviations (as mentioned in the definition of standardized). So an address must be complete, which by definition implies that it can be matched with current Postal Service ZIP+4 and City State files.

This match is to obtain the ZIP+4, so the implication is that verification means that a complete address matches the USPS files and has the correct ZIP+4. The address components must be consistent with the postal standard in terms of valid and invalid values. For example, a street address cannot have a number that is outside the range of recognized numbers (that is, if the USPS file says that Main Street goes from 1-104, an address with 109 Main St is invalid). So validation means that the street address is consistent with what is documented by the USPS files. Standardization is also defined by the above reference: it is spelled out, and uses the USPS standard abbreviations.

In turn, the process for address quality would be to:

1) Ensure the address is complete.
2) Ensure that the address values are valid by checking it against the USPS files.
3) Verify the address's ZIP+4 by matching against the USPS fles.
4) Standardize the address according to the USPS standardized abbreviations.

By David Loshin

In my last post, I shared a story about how rampant address validation actually can transform accurate (if not 100% standardized) addresses into inaccurate ones. My client actually noted that with some of the tools they have seen, addresses that have been submitted to the product and standardized are then re-submitted to the product, which then reports their own corrected address as being invalid! Huh?

Here are a few reflections:

1) Siloed application development has led to multiple iterations of address correction, often using different tools, or if using the same tool, using different sets of rules, which is probably not the best scenario since it will lead to inconsistencies.

2) Address standardization tools have to be maintained - the rules may change, locations may change, the standards may evolve.

3) The criticality of precision and standardization/validation has to be assessed in the context of how the address (or location) is being used. For example, for delivery purposes accuracy should trump "validity" if a validated address is no longer correct.

4) Find a single tool solution for address standardization and validation instead of using multiple products.

5) Have a single team responsible for managing the solicitation of requirements, definition of rules, and provision of address validation and correction services.
One last thought: push address correction and validation to the earliest point in the business process. If your application solicits an address from the customer, validate it right away and have the corrected address verified directly by the customer. And trust the customer to give you accurate address data - they probably have some experience in ensuring they get their mailings!


By David Loshin

There are all sorts of tools associated with address standardization, cleansing, and validation. As an example, the USPS has a certification program for software vendors, referred to as CASS (Coding Accuracy Support System)™ certification. According to their website,

CASS enables the Postal Service™ to evaluate the accuracy of address matching software programs in the following areas:

(1) five-digit coding
(2) ZIP + 4/ delivery point (DP) coding
(3) carrier route coding
(4) DPV®
(5) DSF2®
(6) LACSLink®
(7) eLOT®
(8) RDI™ products

CASS allows vendors/mailers the opportunity to test their address-matching software packages and, after achieving a certain percentage of compliance, to be certified by the Postal Service. CASS does not measure the accuracy of ZIP + 4 delivery point, five-digit ZIP, or carrier route codes in a mailer's existing files. CASS enables mailers to measure and diagnose internally written, commercially-available, address-matching software packages. The effectiveness of service bureaus' matching software can also be measured.

There are many vendors selling CASS-certified tools and services. Organizations use CASS-certified tools for address standardization, correction, and validation. End of story, right?

Wrong. Some organizations use many CASS-certified tools for address standardization, correction, and validation, at different places along the processing stream. The addresses are standardized, cleansed, and validated (or not) multiple times. The addresses are changed from their original format, manipulated, and then shoved back into the databases, without considering the actual process dependencies or expectations.

And then you end up with a scenario like this: a process for accepting customer applications including their self-provided addresses, send hard copies of acknowledgements to their self-provided addresses, yet the process includes an elaborate mechanism for managing returned mail. That did not make sense to me: if the organization sends out acknowledgements to the address the customer provided, wouldn't they trust that the customer provided an accurate address?

In fact, the issues was self-created: the provided address passed through at least three different iterations (with different products!) of standardization, correction, and validation, and was transformed from a deliverable ("accurate") address to an invalid one.

So even though the intent was appropriate, the execution of the process got in the way of the results. So I'll throw out two questions: First, is address standardization and validation a tool or a process? And second, at what point and how frequently in the business process flow should address standardization and validation take place?


By David Loshin

One nice thing about addresses, especially in the United States, is that they have well-defined standards. In previous blog series, I have looked at the process of address standardization and correction, so I won't belabor that point. However, many people confuse the differences among a  valid address, a precise representation of an address, and an accurate address.
 
First of all, recall that the main reason for standardizing addresses is based on "delivery" - supporting the general expectation that posted items reach their desired destination. Validation that addresses meet the standard certainly can improve the delivery processing, especially when it comes to sorting the items to be delivered. That is the reason that the Post Office offers discounts in the costs of mailings when the addresses are validated in standard form.

Luckily, there is a lot of leeway when it comes to meeting this expectation. If you provide a street number, street address, city, and state, there is a good chance your item will be delivered even if it is missing a ZIP code.

On the other hand, the address is not necessarily valid in terms of it meeting the USPS standard, since it is an incomplete address. Similarly, what was put into "address line 1" and "address line 2" might have been the reverse of what is specified by the standard (does "suite 2100" go before the street address or not?), but the carrier will still be able to put the letter into the right slot once she gets to the building. In other words, the address does not need to be valid in order to be delivered.

Nor does it really have to be that precise. We might define precision to mean that the address has enough information to direct the carrier to the specific box or slot associated with the entity referenced in the address. An address with a name, a street, and a suite number is more precise than one with just a name and a street. Yet if your local carrier has some knowledge about his route, then items that have less than precise addresses will still get to their destination.

The last term may be the most important one, though: accuracy. If the address associated with the named entity is not correct, the chances of the item being delivered are much lower. And one problem is that these characterizations are not mutually exclusive: an address can be valid (i.e. it is in standard form and there is an actual delivery point associated with it) but not accurate (if the named individual is not associated with the delivery point).

And while the Post Office maintains some information and tools to help synchronize validity and accuracy, sometimes there are kinks in the process, as we will see next time.

Synergizing the Process for Locations and Addresses

| No Comments | No TrackBacks
By David Loshin

I am currently working with a number of clients who are dealing with particularly thorny issues relating to location. While the business drivers are relatively diverse, there are some similarities across all scenarios, especially in the ways that location is managed from an enterprise perspective. Therefore, in this set of blog entries, we'll look at different
business value drivers, typical usage scenarios, and some ideas for melding process with application for a synergistic lift.

Location is a critical concept in many industries, yet the importance of a standardized approach to managing location is often unnoticed. For example, for some client scenarios, the business driver is reducing risk. Insurance companies like to see their customer base (and therefore, their exposure to certain types of hazards such as floods or earthquakes) spread across different geographic regions. Financial institutions may be subject to different laws (with different penalties for violations) at different geographic jurisdictional levels for the protection of private information.

Healthcare providers need to ensure that protected health information is not inadvertently exposed by being sent to the wrong address.

In other scenarios, the business drivers are financial, such as focusing on customer acquisition, retention or cost management. In some cases, marketing budgets are allocated to local print and media advertising to grow the customer base. In other cases, reducing transport costs by optimizing the supply chain looks at distances between delivery points.

Either way, the underlying desire is precision and correctness in geolocation.

But note that precision and correctness are very separate ideas, and that difference underlies some of the main problems related to address and location management. Next post: Accuracy vs. Precision.