Recently in Data Management Category

Where Do You Fit In?

| No Comments | No TrackBacks
By Elliot King

Elliot King
Too often, those of us with our noses to the grindstone have no time to look up. We are so busy putting out fires, monitoring and maintaining what we have, or trying to launch new initiatives that we never look around to see how other organizations are dealing with similar issues.

This may be particularly true in the data quality world. Data quality is often seen as an internal problem and it is often addressed differently in different settings, both organizationally and technically. Indeed, even the terminology is not consistent across industries.

So a recent study conducted by the International Association for Information and Data Quality (IAIDQ) working in conjunction with the Information Quality Program at the University of Arkansas, Little Rock (UALR-IQ) reveals some very interesting trends. The survey of 270 data quality professionals identified the top challenges faced by data quality professionals.

Heading the list is a lack of accountability and responsibility for data quality, followed by too many data and information silos to manage, a lack of awareness and discussion of the size and impact of data quality problems and a lack of understanding of what data quality means. These challenges are fundamental and each was tabbed by more than 50 percent of the respondents.

Considering the basic nature of the challenges, perhaps it should be no surprise that 66 percent of the respondents believed that the effectiveness of the data quality efforts in their organization were only OK (some goals were met) or poor (few goals were met.) Ironically, 70 percent claimed that their organizations recognized that data and information were important strategic assets and managed it with that in mind.

So what is driving companies to improve their data quality efforts? According to the survey, the top driver is just a general desire to improve the quality of data, which was cited by 68 percent of the respondents. Other important motivations to improve data quality were the desire to improve business intelligence, and compliance and legal considerations.


Using Data Quality Tools for Classification

| No Comments | No TrackBacks
By David Loshin

Hierarchical classification schemes are great for scanning through unstructured text for identifying critical pieces of information that can be mapped to an organized analytical profile. To enable this scanning capability, you will need two pieces of technology.

The first involves a text analysis methodology for scanning text and determining which character strings and phrases are meaningful and which ones are largely noise.

The second capability maps identified terms and phrases within existing known hierarchies and perform the classification.

Both of these techniques would work perfectly as long as the input data is always correct and complete - quite an assumption. That is why we need to augment these approaches with data quality techniques, largely in the area of data validation and data standardization/correction. For example, I am particularly guilty of character transposition when I type, and am as likely to tweet about my "Frod F-150" as I would about my "Ford F-150." In this example, the inexact spelling would lead to a failure to classify my automobile preference.

However, using data quality tools, we can create a knowledge base of standard transformations that map common error schemes to their most appropriate matches. Creating a transformation rule mapping "Frod F-150" to "Ford F-150" would suggest the likely intent, supplementing the classification process.

In other words, integrating our text analytics tools with more traditional data quality methodology will not only (yet again) reduce inconsistency and confusion, it will also enhance the precision for analytical results and enable more robust customer profiling - a necessity for customer centricity.

Understanding Hierarchies

| No Comments | No TrackBacks
By David Loshin

Defining standards for group classification helps in reducing confusion due to inconsistencies across generated reports and analyses. In the automobile classification example we have been using for the past few posts, we might pick the NHTSA values (mini passenger cars, light passenger cars, compact passenger cars, medium passenger cars, heavy passenger cars, sport utility vehicles, pickup trucks, and vans) as the standard.

Yet, as more organizations look to merge data sets and feeds from different sources, some challenges remain, particularly with the use of unstructured text (such as that presented via Twitter or Facebook.) People cannot be expected to always conform to your organization's data standards, and often use colloquial terms or their own words to describe ideas that would map to your own dimensional values.

For example, if you wanted to filter out the individuals who prefer to drive "pickup trucks" (one of our standard values), it is not enough to scan for that phrase. Many individuals will refer to their pickup truck using different terms, such as a make and model ("Ford F-150," "Chevy Silverado") or a different name ("light truck") or a nickname ("baby monster"), but these terms have to be linked to the overall classification term.

This is an example of a simple hierarchy, in which one concept ("automobiles") is divided into a collection of smaller classes (the NHTSA classifications). Each of those classes in turn contains other phrases and terms. Within each of those included collections, there may be other inclusive categorization, such as by make and then model.

With a well-defined hierarchy for classification, unstructured text can be scanned for matches with values that live within the hierarchy, and that enables the standardized classification. To round out the example, a Twitter tweet exclaiming the author's love of "driving his Ford F-150" can be scanned, with the model name extracted, located within the make and model hierarchy for pickup trucks, thereby allowing us to register his/her automobile driving preference!

Standardizing Classifications

| No Comments | No TrackBacks
By David Loshin

In the most recent post, we posed a straightforward problem: if we have a reporting or analytical objective that depends on using a dimension for classification, what happens when two different value domains are presumed to map to the same conceptual domain?

More concretely, the example we used was mapping individuals to their car purchase preferences, but different applications used different car classifications that did not share the same number of values and the value sets did not directly map in a one-to-one manner. The potential result is confusion in interpreting the results, especially if this classification is just one variable used for creating a customer profile.

One way to address this is to put a standards policy for classification dimensions in effect by selecting a single set of concepts, mapping those conceptual values to a standard single set of values, and then insisting that any application that uses that conceptual domain always use the standard.

This sounds simple, but it actually may entail some effort, since no one person may be aware of all the places that any specific classification domain is used.

This task goes beyond a "data management" activity and essentially becomes a "data governance" one involving a broad solicitation across the community of data consumers to determine the classification dimensions used and the enumerations of values employed within each dimension.

At the same time, the analyst spearheading this effort must have a plan for capturing the classification data, harmonizing values across variant lists, selecting a standard, communicating the standard, and then ensuring that the standard is put into practice.

Establishing good practices and processes for domain harmonization and standardization is an important topic to be considered in upcoming posts, but next time we will look at a growing challenge for classification domains: aligning data from unstructured text with the standard classification dimensions.

Managing Customer Connectivity

| No Comments | No TrackBacks
By David Loshin

At the end of our last entry, we had come to the conclusion that standardization of potentially variant data values was a key activator for evaluating record similarity when looking to group customer records together based on any set of characteristic attributes. From an operational standpoint, this activity is supported using data quality tools that can parse and standardize data.

But the process must go beyond the purchase and use of the tools. For any customer centricity program in which connectivity is relevant, there are going to be multiple dimensions of connectivity employed in business decisions. We can immediately fall back on my original example of the "household" grouping, and depending on the objectives for customer outreach and experience, other groups will be overlaid with each other.

Here is a clear example that builds on my post from a few weeks back. We originally suggested that the household was relevant for mobile telephone companies looking to expand residential customer commitment though increased product sales and service contracts within the household, since one decision-maker might be responsible for adding new lines and services.

That same mobile telephone company might also look at their business-to-business relationships and look to expand their footprint among business customers, suggesting a new grouping of customers based on their employer.

Overlaying the households and the corporate customers would provide a picture of companies existing brand predispositions among the employees. Identifying the key corporate decision makers and offering combined business and residential account discounts might be a good way to exploit knowledge of overlapping connected groups.

The result is that the analysis not only depends on good quality data, it assumes that good processes are in place for managing the hierarchy data that maps individuals into groups - an example of what could be called metadata quality. Keeping hierarchies of concepts, data attributes, and mappings among individuals based on those hierarchical attributes (and of course, similarity scoring for linkage!) is a valuable skill, one that we will revisit in upcoming series of posts...

What Can Health Care Teach Us About Data Quality?

| No Comments | No TrackBacks
By Elliot King

Elliot King
Data quality issues are more acute in health care than in perhaps any other industry sector. According to a seminal study by the Institute of Medicine, (IOM) preventable medical errors are responsible for nearly 100,000 deaths annually, making it the sixth leading cause of death in the United States. These errors cost $98 billion annually.

Of course, not all of these errors reflect problems in data, but many do. And sometimes, data errors seems astonishing and the outcome devastating. According to the IOM, as many as 40 wrong sites, wrong side, wrong patient procedures occur every week. For example, a surgeon will amputate a person's right leg instead of the left or remove the gall bladder instead of a kidney. This is a "data error" of the most grievous kind.

Archaic paper record keeping has long been cited as a source of cost and a cause of medical errors. With paper records, if a patient is treated by more than one doctor, each doctor may not have read what the other doctors are doing. And that is not good.

But the aggressive move to electronic health records (EHR) has its own risks as well. Mistakes in patient data can follow the patient from doctor to doctor. Moreover, some programs allow health care providers to auto-fill fields, making it appear that they performed a more thorough examination, let's say, than they actually did.

The issue of the quality of the data used in EHRs has been thrown into the spotlight by efforts to reuse EHR information for clinical research. Within the medical community, most practitioners agree that clinical data is not recorded as carefully as research data. So researchers studying the potential of using EHRs for research measure quality by the characteristics with which most data quality professionals are familiar--completeness, correctness, timeliness, and so on. They assess those characteristics using multiple methods including comparisons to established standards, data element agreement, data source agreement and others.

Unfortunately, the early results of data quality assessment for EHRs are not very encouraging. Some studies have indicated that the introduction of EHRs does not lead to higher quality data being gathered, but just larger quantities of bad data. Findings like that have triggered a spirited debate in the medical community, with some arguing that the experience with EHRs demonstrate the first law of informatics--that data should only be used for the purpose for which is was collected.

-- For more info on data quality and the healthcare industry, download our free whitepaper on "Data Quality Is Good Medicine for Healthcare Providers."

Customer Centricity and Connections: Establishing the Link

| No Comments | No TrackBacks
By David Loshin

In my last post, we began to look at the value proposition for grouping individual customers into logical groupings. We began by looking at a grouping that generally appears naturally, namely the traditional residential household.

We talked about householding in a previous blog posting, but it is worth reviewing the basic approaches used for determining that a group of individuals share a household. The general approach is to analyze a collection of data records and examine sets of identifying attributes for degrees of similarity in naming and residence locations. Many situations are relatively straightforward, such as this example:

John Hansen, 1824 Polk Ave., Memphis TN 38177
Emily S. Hansen, 1824 Polk Ave., Memphis, TN 38177

In this example, two individuals share both a last name and a location address, and although the data evidence does not guarantee truth of the inference, it might be reasonable to suggest that because there is a link between the family name and the residence location, these two individuals are members of the same household. The algorithm, then, is to link records into a collection of similar records based on similarity of the surname and residence characteristics.

However, the concept of grouping is not limited to conventional groups, since there are many artificial groups formed as a result of shared interests or similarities in profile criteria. For example, people interested in certain sports car models often organize "fan clubs," new mothers often organize toddler play groups, and sports team fans are often rabid about their franchise alliances.

In turn, your company might want to create marketing campaigns that target sets of individuals grouped together by demographic or psychographic attributes. In these cases, you would adjust your algorithms to link records based on similarity of the values in other sets of data attributes.

Establishing the link goes beyond looking at the data that already exists in your data set. Rather, you may need to append additional data acquired from alternate sources.

And, interestingly enough, you will need to connect the acquired data to your existing data, and that requires yet another record linkage effort. Apparently, understanding customer collectives is pretty dependent on record linkage. And while linking records is straightforward when all the data values line up nicely, as you might suspect, there are some curious intricacies of linkage in the presence of data with questionable quality.


Melding Aspects of Real-Life and Virtual Contact and Location

| No Comments | No TrackBacks
By David Loshin

In the most recent posts, we have been exploring the emerging opportunity for developing demographic profiles for customers based on their virtual locations. More to the point, if we are using real-life (yet two-dimensional) geographies to help in developing customer profiling and segmentation models, how much more interesting would those profiles be when expanded to include behavior characteristics associated with many more dimensions?

In real life, (as my father used to say), you can only occupy a single chair with a single bottom. But there are many virtual spaces that can be occupied by one individual simultaneously, providing multiple dimensions for behavior analysis. I can have a presence on any number of social networks, play different online games, post comments at different venues, and tweet about all of these, almost all at the same time.

Not only that, but recall that all transactions take place with the parties in a real location, and that goes for online activity - much of our actions are still pegged to some documented location, usually by IP address, which can be resolved geographically based on the Internet topology maps. We can link real-world individuals existing in real-live space to online activities, and we can link online activities to real-world locations and real-world people.

By melding the characteristics of individuals associated with the different virtual spaces with those characteristics associated with physical contact mechanisms and locations, you begin to develop different segmentation models that can intersect with real-world locations in different ways. Perhaps improved resolution, precision, and hopefully quality of these expanded models will account for any diminished precision associated with the gradual anachronistic features of traditional approaches to geographic localization.


Understanding Data Quality Services

| No Comments | No TrackBacks
Knowledge Base, Knowledge Discovery, Domain Management,
and Third Reference Data Sets



PASS Virtual Chapter Meeting: Thursday, Jan. 31, 2013 at 9 am PDT, 12 pm EST.

REGISTER NOW!


With the release of Data Quality Services (DQS), Microsoft innovates its solutions on Data Quality and Data Cleansing by approaching it from a Knowledge Driver Standpoint. In this presentation, Joseph Vertido from Melissa Data will discuss the key concepts behind Knowledge Driven Data Quality, implementing a Data Quality Project, and will demonstrate how to build and improve your Knowledge Base through Domain Management and Knowledge Discovery.

What sets DQS apart is its ability to provide access to Third Party Reference Data Sets through the Azure Marketplace. This access to shared knowledge empowers the business user to efficiently cleanse complicated and domain specific information such as addresses. During this session examples will be presented on how to access RBS Providers and integrate them from the DQS Client.

REGISTER NOW!

What is Meant by "Flocking Together" Virtually?

| No Comments | No TrackBacks
By David Loshin

In my last two posts, we have been reviewing the concepts of contact methods that have in the past been used for identifying a customer's location, and the ramifications of an increasing trend in which the contact mechanism is less reliable for establishing a location.

In particular, the traditional use of telephone numbers to isolate a customer's location and subsequent use of geo-demographic profiles associated with locations is less trustworthy, while at the same time, your customers are relying more on virtual channels of communication.

If the traditional approaches for location-based demographics are becoming less reliable, are there other methods for "location-based" segmentation using virtual contact methods? In other words, do birds of a feather still flock together online?

In my opinion, the answer is unequivocally yes. A fundamental driver for online communities is shared interests, and many communities have evolved wither focusing on specific behavioral characteristics, likes, and dislikes (an example might be an online community designed around knitting) or around sharing information about common ideas.

Even the terse 140-character format of an SMS-based framework like Twitter allows using hash tags to define the boundaries around areas of interest. Online games also provide fertile ground for analyzing customer characteristics and behaviors.

So even as the credibility of traditional contact mechanisms for customer behavior segmentation and analysis diminishes, there is an emerging opportunity for evaluating customer behaviors that are virtually "geo-demographic."

More next time...