Modeling Issues and Entity Inheritance

| No Comments | No TrackBacks
By David Loshin

In our last set of posts, we looked at matching and record linkage and how approximate matching could be used to improve the organization's view of "customer centricity." Data quality tools such as parsing, standardization, and business-rule based record linkage and similarity scoring can help in assessing the similarity between two records. The result of the similarity analysis is a score that can be used to advise about the likelihood of two records referring to the same real-life individual or organization.

One last thought: this approach is largely a "data-centric" activity. What I mean is that it looks at and compares two records regardless of where those records came from. They might have come from the same data set (as part of a duplicate analysis) or from different data sets (for consolidation or general linkage).

But it does not take into consideration whether one data set models "customer" data and another models "employee" data. While you may link a customer record with an employee record based on a similarity analysis of a set of corresponding data attributes, the contexts are slightly different.

A match across the two data sets is a bit of a hybrid: we have matched the individual but one playing different roles. That introduces a different kind of question: are the identifying attributes associated with the "customer" or the individual acting in the role of "customer"? The same question applies for individual vs. employee.

And finally, are there attributes of the roles that each individual plays that can be used for unique identification within the role context? The answers to these questions become important when matching and linkage are integrated as part and parcel of a business application (such as the consolidation of data being imported into a business intelligence framework).


It Takes a Team

| No Comments | No TrackBacks
By Elliot King

Elliot King
As the cliché has it, data is an organization's most valuable asset. But the question is--who guards those corporate jewels? Is it the IT staff that is charged with making sure the information infrastructure supports the business correctly? Is it the database developers and administrators who are the front-line data professionals? Is it the business
users who need accurate data to make sure tasks are executed as anticipated? Or is it the executive staff, which is in the best position to have a birds-eye view of the entire operation?

In practice, safeguarding data quality requires an interdisciplinary team approach, with different players coming from different parts of the organization. As with most teams, you need a team leader or program manager. This person is charged with supervising the entire data quality improvement program, recommending what resources are needed and where those resources should be invested.

In addition to the program manager, most data quality initiatives require a project leader, a person responsible for addressing specific data quality issues at hand. Each project team has at least three specific roles that need to be filled with representatives from the IT and business staffs.

The IT professionals must have the technical ability to fix what might be broken and the business personnel must serve as the subject matter experts, understanding the characteristics the data must have to get the job done. Finally, there should be a data steward to set policies, procedures and standards to improve standards.

Finally, one last critical role must be filled--executive sponsorship. Those of you who are sports fans may have noticed that some teams are good year after year while others aren't. The difference is in the ownership (think the Los Angeles Dodgers for a case study in good and bad ownership.) A data quality improvement team cannot succeed without a strong commitment from the top.

Approximate Matching

| No Comments | No TrackBacks
By David Loshin

Actually, my first name is not David - that is really my middle name, but it is the given name my parents used when talking to me. This has actually led to a lot of confusion over the years, especially when confronted with a form asking for me "first name" and my "last name." For official forms (like my driver's license) I use my real first name as my "first name," but for non-official forms I often just use David. The result is that there is inconsistency in my own representation in records across different data systems.

If we were to rely solely on an exact data element-to-data element match of values to determine record duplication, the variation in use of my first or middle name would prevent two records from linking. In turn, you can extrapolate and see that any variations across systems of what should be the same values will prevent an exact match, leading to inadvertent duplication.

Fortunately, we can again rely on data quality techniques. We have our stand-bys of parsing and standardization, which can be enhanced through the use of transformation rules to map abbreviations, acronyms, and common misspellings to their standard representations - an example might be mapping "INC" and "INC." and "Inc" and "inc" and "inc." and "incorp" and "incorp." and "incorporated" all to a standard form of "Inc."

We can add to this another tool: approximate matching. This matching technique allows for two values to be compared with a numeric score that indicates the degree to which the values are similar. An example might compare my last name "Loshin" with the word "lotion" and suggest that while the two values are not strict alphabetic matches, they do match phonetically.

There are a number of techniques used for approximate matching of values, such as comparing the set of characters, the number of transposed, inserted, or omitted letters, different kinds of forward and backward phonetic scoring, as well as other more complex algorithms.

In turn, we can apply this approximate matching to the entire set of corresponding identifying attributes and weight each score based on the differentiation factor associated with each attribute. For example, a combination of first name and last name might provide greater differentiation than a birth date, since there is a relatively limited number of dates on which an individual can be born (maximum 366 per year).

By applying a weighted approximate match to pairs of records, we can finesse the occurrence of variations in the data element values that might prevent direct matching from working. More on this topic in future posts.


Assessment is the Critical First Step

| No Comments | No TrackBacks
By Elliot King

Elliot King
Edward Deming taught us long ago about the virtuous cycle of continual quality improvement--plan for change; execute the change; study the results and then take action to improve the process. But Deming's PDSA (plan, do, study, act) cycle is a generic approach. The cycle has to be modified and customized to address targeted areas for quality improvement.

The key steps in the virtuous cycle for data quality improvement are assessment, measurement, integration, improvement and management. Each process is important but assessment is the critical first step.

Data quality assessment is a multi-pronged exercise and the key is to start at the end. What business tasks and processes can be hurt by inaccurate, invalid and incomplete data? And in what ways will poor quality data increase costs, reduce revenues, hurt efficiencies or otherwise inflict pain on the organization? This exercise helps to identify the data sources that should be examined.

After you have determined where to look, you can profile your data to uncover anomalies and data flaws and then bring those flaws to the attention of the data users. In some cases, data anomalies may be harmless and have little impact on actual business activities. In that case, no remedial action is warranted. But when poor data quality does interfere with business operations then further action is needed.

The last piece of the assessment puzzle is to correlate the identified data issues with performance through a defined set of data quality business rules such as completeness, accuracy, and consistency. The rules provide a framework within which data quality can be measured.

The rule of thumb with assessment is relatively easy. First determine where poor quality data will have the most impact within your organization. Then figure out if it has to be fixed.

The Challenge of Identifying Information

| No Comments | No TrackBacks
By David Loshin

In my last post, I introduced the question of determining which characteristics are used to uniquely differentiate between any pair of records within a data set. The same question is relevant when attempting to match a pair of records as well, once they are determined to represent the same entity. I like to call these "identifying attributes," and the values contained therein I call "identifying information."

Let's look at an example for customer data integration: what data element values do I compare when trying to link two records together? Let's start with the obvious ones, namely (ha ha) first and last names. Of course, we all know that there are certain names that are relatively common - just ask my friend John Smith, with whom I worked at one of my earlier jobs.

But even if you have an uncommon name, you might be surprised. For example, if you type in my name ("David Loshin") at Google, you will find entries for me, but you will also find entries for a dentist in Seattle and a professor.

Apparently, first and last names are not enough identifying information for distinction. Perhaps there is another attribute we can use? You probably know that I have written some books, (see http:\\dataqualitybook.com), so maybe that is an additional attribute to be used. But if you go to Amazon and do a search for "David Loshin," you will find me, but it turns out the professor has also written a book.

Even an uncommon name such as mine still finds multiple hits, and while attempting to add more identifying information can reduce that number of hits, a poorly selected set of attributes may still not provide the right amount of distinction. It may take a number of iterations to review a proposed set of identifying attributes, determine their completeness, density, and accuracy before settling on a core set of identifying characteristics to be used for comparison.

One more thing to think about, though. Once you get to the point where you are pretty confident that those attributes are enough for differentiation, there is one last monkey wrench in the works: even if you had the absolute set of identifying attributes, there is no guarantee that the values themselves are exact matches!

Entities and their Characteristics

| No Comments | No TrackBacks
By David Loshin

How can you tell if two records refer to the same person (or company, or other type of organization)? In our recent posts, we have looked at how data quality techniques such as parsing and standardization help in normalizing the data values within different records so that the records can be compared. But what is being compared? That is the topic of this next set of entries.

A simplistic view might suggest that when looking at two records, comparing the corresponding values is the best way to start. For example, we might compare the corresponding names, telephone numbers, street addresses - stuff that usually appears in records representing customers, residences, patients, etc.

But the simple concept belies a much more complex question about the attributes used to describe the individual as well as differentiate pairs of individuals. Much of this issue revolves around the approaches taken for determining what characteristics are being managed within a representative record, the motivation for including those characteristics, and importantly, are those data elements used solely as "attribution" (or additional description of the entity involved) or are they used for "distinction" (to help in unique identification).

More to the point: what are the core data elements necessary for determining the uniqueness of a record? We often take for granted the fact that our relational models presume one and only one record per entity, and that there might be business impacts should more than one entry exist for each individual.

Yet individual "entities" may exist in multiple data sets, even in different contexts. Some characteristics are part and parcel of each entity, while others describe the entity playing a particular role. Our upcoming posts are intended to consider some of these issues when assessing similarity for record linkage and matching.

What is a Data Steward and Do You Need One?

| No Comments | No TrackBacks
By Elliot King

Elliot King
The metaphor of "ownership" has become popular in organizations and their IT shops. Companies have "application owners" and projects that are "owned" by this or that group. So that raises the question, who "owns" your data?

The right answer is that nobody "owns" the data. Data is a resource that must be shared across an organization. Data flows from the point of creation--perhaps capturing contact information on a website or importing a third-party mailing list--through staging, consumption, storage and archiving. At each step of the way, a different functional group within an organization has to be able to use the data in different ways.

To insure that data meets the standards needed by each stakeholder in the data lifecycle, companies have to implement enterprise-wide data management policies and procedures. A typical policy might say that all contact information must conform to a specific format. Don't assume that to be the case in your organization. Unmonitored, your sales department, service organization and billing department could easily capture names differently. Indeed, in larger corporations, different sales organizations might have different formats for names and addresses.

Data stewards both develop those policies and create mechanisms to insure that the policies are enforced. On the flip side, the data steward should be accountable for enterprise data quality and the advocate for data quality initiatives.

Data stewardship is neither an easy job nor an easy job to fill. The foundational technical skill is a deep understanding of specific business functions, the data associated with those functions and the processes that rely on the data.

Those technical skills have to be coupled with a strong set of interpersonal skills as, by definition, data stewardship requires interacting with a wide range of stakeholders (often including other data stewards). Finally, regardless of the formal position they hold, data stewards need to be able to establish their authority as the role sometimes calls for stepping on other people's toes.

Stewardship is quite different than ownership. But if your organization has data, it probably needs a data steward.

By David Loshin

What I have found to be the most interesting byproduct of record linkage is the ability to infer explicit facts about individuals that are obfuscated as a result of distribution of data. As an example, consider these records, taken from different data sets:

A:
David
Loshin
301-754-6350
1163 Kersey Rd
Silver Spring
MD
20902

B:
Knowledge Integrity, Inc
1163 Kersey Rd
Silver Spring
MD
20902

C:
H David
Lotion
1163 Kersey Rd
Silver Spring
MD
20902

D:
Knowledge Integrity, Inc.
301
7546350
7546351
MD
20902

We could establish a relationship between record A and records B and C because they share the same street address. We could establish a relationship between record B and record D because the company names are the same.

Therefore, by transitivity, we can infer a relationship between "David Loshin" and the company "Knowledge Integrity, Inc" (A links to B, B links to D, therefore A links to D). However, none of these records alone explicitly shows the relationship between "David Loshin" and "Knowledge Integrity, Inc" - that is inferred knowledge.

You can probably see the opportunity here - basically, by merging a number of data sets together, you can enrich all the records as a byproduct of exposed transitive relationships.

This provides us with one more valuable type of enhancements that record linkage provides. And this is particularly valuable, since the exposure of embedded knowledge can in turn contribute to our other enhancement techniques for cleansing, enrichment, and merge/purge.

The Ethics of Data Quality

| No Comments | No TrackBacks
By Elliot King

Elliot King

Technical people often don't seem too interested in ethical issues related to their work. Discussions of right and wrong are often "squishy." Too frequently, they have no clear answers and the answer can change from one context to another. In contrast, technical people like to deal with facts. They like clear outcomes--it worked or it didn't work--without
any value judgments attached.

Perhaps the most profound ethical discussion associated with a significant technical advancement was the one scientists engaged in when they developed the atomic bomb. Was it right to contribute to the building of the most destructive weapon in human history--a weapon that could destroy the earth? The argument that the atomic bomb was the inevitable result of technical advances is just not compelling or satisfactory.

While certainly not as momentous as the debates about the atomic bomb or those debated in bioethics, like it or not, data quality professionals face ethical questions everyday. These questions revolve around privacy, data integrity, security, retention, access and so on.

Take the issue of privacy, for example. As we know in industries ranging from health care to financial, there are a slew of legal standards that companies have to meet. But beyond that, companies must decide exactly what data they collect about their customer or clients and why? Should your company routinely collect and store social security numbers, for example? If so, why? The question is not just one of legal liability but the ethics of putting your customers at unnecessary risk.

Similar sorts of ethical questions can be raised around data retention policies. Once again there are legal restrictions--in certain fields, records must be retained for legally determined periods of time--but there are also questions of right and wrong. Beyond your legal obligations, how much of your data should be retained for what period of time and why?

Incorporating the idea of ethics into your data management decision-making processes will help make your decisions more deliberative. Facing ethical concerns forces people to confront not only what they are required to do but what they should do as well.

Record Linkage and Data Enhancement

| No Comments | No TrackBacks
By David Loshin

In my last two posts we looked at the distribution of information about entities and the use of record linkage to find corresponding data records in different data sets that can be linked together. Record linkage can be used for a number of processes that we bundle under the concept of "data enhancement," which we'll use to describe any methods for
improving the value and usefulness of information. In this post, we'll look at three different types of enhancement:

· Data cleansing - The first type of enhancement is relatively straightforward: our idea is to link records together for the purposes of cleansing the data, or making it more suitable for use. Often, one data set may have a more trustworthy representation of an entity, or we may have more than one data set, each potentially containing overlapping data elements such as birth date, address, telephone number. By linking two different records, you can compare the corresponding values, find those that are of better quality (e.g. more complete or more current values) and update the "delinquent" record with the higher quality values.

· Enrichment - Existing records for entities (such as people or products) can be matched against other data sets with additional reference information. For example, you might want to match your customer data with a credit bureau's data and enrich your own data set with each individual's credit ratings.

· Merge/Purge - Duplicate records entered into one data set often plague the business in attempting to actively manage customer accounts. Applying the record linkage methodology to the records in a single data set helps find multiple records that refer to the same individual. These records can be presented to a data analyst to review and determine the surviving record and updating the record with the highest quality values.
There are many variations on these themes. For example, merge/purge can be used for combining customer data sets after a corporate acquisition; enrichment can be used to institute a taxonomic hierarchy for customer classification and segmentation. Loosening the matching rules for merge/purge can help with a process called "householding," which attempts to identify individuals with some shared characteristics (such as "living in the same house").

Authors