November 2011 Archives

Distributed Data and Distributed Information

| No Comments | No TrackBacks
By David Loshin

You might not realize how broad your electronic footprint really is. Do you have any idea how many data sets contain information about and specific individual? These days, any interaction you have with any organization is likely to be documented electronically. And, for those curious enough to read the fine print of the "privacy" policies, you might not be
surprised to find that many of those organizations managing information about you are sharing that information with others.

Actually, this is not a new phenomenon; this has been going on for many years by data aggregator companies who just love to collect data turn it into salable products. The easiest example to share is that of the mailing list company with the reference database that can be segmented across numerous geographic dimensions (in incremental precision such as state, county, town, ZIP code, ZIP+4, street name, etc.) as well as demographic dimensions such as number of cars owned, favorite leisure activities, or household income.

And any time you fill out some form or respond to some survey or another, more information about you is captured. Remember that registration card you filled out for the toaster you bought? The survey you filled out to get that free subscription? Didn't you subscribe to some magazine about fishing and other outdoors activities? How about that contest you entered at the county fair?

Actually, you are not the only one collecting your information. Did you buy a house? Home sales are reported to the state and the data is made available, including address, sales price, and often the amount of your mortgage. Wedding announcements, birth announcements, obituaries log life cycle events.

Every single one of these artifacts captures more than just some information about an individual - it also captures the time and place where that information is captured, sometimes with accurate precision (such as the time of an online order) or less precision (such as the day the contest entry was collected from the box.)

There are many distributed sources of information about customers, and each individual piece of collected data holds a little bit of value. But when these distributed pieces of data are merged together, they can be used to reconstruct an incredibly insightful profile of the customer. How does this work? More in the next set of posts.


You Can't Identify the Problem Without a Scorecard

| No Comments | No TrackBacks
By Elliot King

Elliot King
If you can't measure a process, you can't improve upon it. That's one of the ironclad quality initiatives. Edward Deming's revolutionary insight was that if you can measure a process or outcome continually, you have created the opportunity to improve it continually.

 If you can't measure a process, you can't improve upon it. That's one of the ironclad quality initiatives. Edward Deming's revolutionary insight was that if you can measure a process or outcome continually, you have created the opportunity to improve it continually.

But measurement is only a starting point to quality improvement. Those measurements have to be assessed according to specific standards and then their impact on business operations has to be assessed. Only at that point can managers determine whether investing in remediation is appropriate or cost effective.

In many data quality improvement programs, data scorecards are the tools of choice for data managers to place raw statistics into meaningful contexts. From that point, different constituencies can identify problem areas and determine what actions should be taken.

To understand the idea of a data scorecard, a baseball analogy actually works here. Broken down into its granular bits, a baseball game consists of strikes and balls, hits and outs.

But in most contexts (although not all), to know that there were 210 pitches, 13 hits and 8 runs scored has little value. That data has be to combined and grouped correctly to have meaning. And different stakeholders--the players, the manager, the front office, the opposing teams--may want that data in different formats.

For data scorecards, the process of measuring and aggregating raw statistics into useful combinations is called a hierarchical rollup. It could look like this. At the most granular level, data is measured against the metadata associated with the database. Are fields the right length? Is there missing data? These statistics are of interest to database managers.

The next step is to assess whether the data meets data quality standards such as accuracy and completeness. This is a concern to data quality managers. Does the data conform to business rules? Business analysts need to know that. And what is the impact of data quality on process outcomes? As that point, management can determine what actions are prudent.

In baseball, a glance at a scorecard reveals the outcome of the game and who played poorly as well as offering the chance to drill down deeper to the underlying raw data if need be. A data scorecard plays the same role in data quality improvement.


Business Rules for Standardization: Bringing it all Together

| No Comments | No TrackBacks
By David Loshin

While we have been talking in the last few posts about checking whether a data value observes the standard (and is therefore a valid value), the real challenge in standardization is in determining (1) that a value does not meet the standard and then (2) taking the right actions to modify it so that it does meet the standard.

That process, strangely enough, is called "standardization," and it extends the tokenization and parsing to recognize both valid tokens and common patterns for invalid ones, and that is where the power of standardization lies. Here is the basic idea: when you recognize a token value to be a known error, you can define a business rule to map it to a corrected version.

The example I have used over the recent blog posts is a simple address standard:

· The number must be a positive integer number

· The name must have one and only one word

· The street type must be one of the following: RD, ST, AV, PL, or CT

And deriving these additional expectations:

· The address string must have three components to it (format)

· The first component has to only have characters that are digits 0-9 (syntax)

· The first character of the first component cannot be a '0' (syntax)

· The third component must be of length 2 (format)

· The third component has to have one of the valid street types (content)

The next step would be to consider the variations from the expected values. A good example might look at the third token, namely the street type, and presume the types of errors that could happen and how they'd be corrected, such as:

Possible errors Standard
Rd, Road, Raod, rd RD
Street, STR ST
Avenue, AVE, avenue, abenue, avenoo AV
Place, PLC PL
CRT, Court, court CT

In this example, we see some variant abbreviations, fully-spelled out words, a finger flub (the typist hit the b key instead of the v in "abenue" - I do this all the time), and a transposition ("Raod" instead of "Road", I also do this all the time).

Different types of formats and patterns can be subjected to different kinds of rules. The first token has to be an integer, but perhaps some OCR reader mis-translated what it scanned into a character instead of a number, so we might see O instead of 0, A instead of 4, S instead of 8, ) instead of 9, etc. That means that part of the standardization process looks for non-digits and then apply rules that traverse through a string and convert according to the defined mappings (A becomes 4, for example).

For the second token, the challenge is when more than three words appear. One set of rules might take all tokens between the first and the last and concatenate them together into a single word.

Another approach is to scan the tokens and pluck out the one that most closely matches one of the street types and move that to the end.

So these are the basics ideas for standardization: defining the formats and patterns, determining the tokenization rules, parse the data and recognize valid tokens and invalid tokens, define rules for mapping invalid tokens to valid ones, and potentially rearrange tokens into the corrected version. In reality, there are many more challenges, opportunities, and subtleties, but at least this series of notes gives a high level view of the general process.


Context is Key to Measuring Data Quality

| No Comments | No TrackBacks
By Elliot King

Elliot King
Beauty is in the eyes of the beholder, but that is not the case when it comes to data quality or, at least, it is not the whole story. Data quality can be measured along several different dimensions. But in the final analysis data quality depends on the context within which the data is used.

 Perhaps the most obvious criteria by which to measure data quality is accuracy. Does the customer in your customer database actually reside at the associated address? Is the email address actually correct? It's not hard to imagine a customer record filled with inaccurate data.

The issue of completeness is related to the issue of accuracy. All the information you have may be accurate but you may not have all the information you need, particularly the information you need to be able to link records efficiently. If all you have is a customer's name, obviously that will not be good enough to serve as the foundation for a direct marketing campaign. (The flip side of the completeness equation is important as well. Capturing a lot of superfluous information can be just as problematic as missing information.)

Data can also be measured according to its consistency. For example, are customer accounts activated and deactivated appropriately? It doesn't make much business sense to send a subscription solicitation to somebody who already subscribes to a magazine. But it happens.

Other significant criteria by which the quality of data can be assessed are timeliness and the ability to audit it. Does the data enable people to generate reports according to their deadlines? Do your customer service representatives have the most up-to-date pictures of your customers' latest interactions with your organization? Finally, can data be tracked back to the transactions that generated them?

There are other dimensions along which data quality can be assessed. Are records duplicated? Are records captured according to the specified rules?

But the components of data quality are just that--components. Data quality itself is holistic. It allows the processes in which it is used to function efficiently and cost effectively or it doesn't.


Tokenization and Parsing

| No Comments | No TrackBacks
By David Loshin

As we have discussed in the previous posts, the data values stored within data elements carry specific meaning within the context of the business uses of the modeled concepts, so to be able to standardize an address, the first step is identifying those chunks of information that are embedded in the values.

This means breaking out each of the chunks of a data value that carry the meaning, and in the standardization biz, each of those chunks is called a token. A token is representative of all of the character strings used for a particular purpose. In our example, we have three tokens - the number, name, and type.

Token categories can be further refined based on the value domain, such as our street type, with its listed valid values. This distinction and recognition process starts by parsing the tokens and then rearranging the strings that mapped to those tokens through a process called standardization. The process of parsing is intended to achieve two goals - to validate the correctness of the string or to identify what parts of the string need to be corrected and standardized.

We rely on metadata to guide parsing, and parsing tools use format and syntax patterns as part of the analysis.

We would define a set of data element types and patterns that correspond to each token type and the parsing algorithm matches data against the patterns and maps them to the expected tokens in the string. These tokens are then analyzed against the patterns to determine their element types.

Comparing data fields that are expected to have a pattern, such as our initial numeric token or the third street type token, enables a measurement of conformance to defined structure patterns. This can be applied in many other scenarios as well, such as telephone numbers, person names, product code numbers, etc.

Once the tokens are segregated and reviewed, as long as all tokens are valid and are in the right place, the string is valid. In the next post, we will consider what to do if the string is not valid.


Authors