Recently in Data Integration Category

Understanding Data Quality Services

| No Comments | No TrackBacks
Knowledge Base, Knowledge Discovery, Domain Management,
and Third Reference Data Sets



PASS Virtual Chapter Meeting: Thursday, Jan. 31, 2013 at 9 am PDT, 12 pm EST.

REGISTER NOW!


With the release of Data Quality Services (DQS), Microsoft innovates its solutions on Data Quality and Data Cleansing by approaching it from a Knowledge Driver Standpoint. In this presentation, Joseph Vertido from Melissa Data will discuss the key concepts behind Knowledge Driven Data Quality, implementing a Data Quality Project, and will demonstrate how to build and improve your Knowledge Base through Domain Management and Knowledge Discovery.

What sets DQS apart is its ability to provide access to Third Party Reference Data Sets through the Azure Marketplace. This access to shared knowledge empowers the business user to efficiently cleanse complicated and domain specific information such as addresses. During this session examples will be presented on how to access RBS Providers and integrate them from the DQS Client.

REGISTER NOW!

A Guide to Better Survivorship - A Melissa Data Approach

| No Comments | No TrackBacks
By Joseph Vertido

The importance of survivorship - or as others may refer to as the Golden Record - is quite often overlooked. It is the final step in the record matching and consolidation process which ultimately allows us to create a single accurate and complete version of a record. In this article, we will take a look at how Melissa Data uniquely differentiates itself in approaching the concept of survivorship compared to some of the more conventional practices.

The process of selecting surviving records means selecting the best possible candidate as its representation. However, best in the perspective of survivorship can really mean a lot of things. It can be affected by the structure of data, where the data is gathered from, how data comes in, what kind of data is stored, and sometimes by the nature of business rules. Thus techniques can be applied in order to accommodate certain types of variations when performing survivorship. We find that there are three very commonly used techniques in determining the surviving record:

I. Most Recent

Date stamped records can be ordered from most recent to less recent. The most recent record can be considered eligible as the survivor.

II. Most Frequent

Matching records containing the same information are also an indication for correctness. Repeating records indicate that the information is persistent and therefore reliable.

III. Most Complete

Field completeness is also a factor of consideration. Records with more values populated for each available field are also viable candidates for survivorship.


Although these techniques are commonly applied in survivorship schemas, its correctness may not be as reliable in many circumstances. Because these techniques apply to almost any type of data, the basis in which a surviving record is created conforms only to "generic" rules. This is where Melissa Data is able to set itself apart from "generic" survivorship. By leveraging reference data, we can steer a way to generating better and more effective schemas for survivorship.

The incorporation of reference data in survivorship changes how rules come into play. Using the Most Recent, Most Frequent or Most Complete logic really has more of an aesthetic basis for selection. Ideally, the selection of the surviving record should be based off an actual understanding of our data.

And this is where reference data comes into play. What it boils down to at the very end is simply being able to consolidate the best quality data. Thus by incorporating reference data, we gain an understanding of the actual contents of data, and create better decisions for survivorship. Let's take a look at some instances on how reference data and data quality affect decisions for survivorship.

I. Address Quality

Separating good data from bad data should take precedence in making decisions for survivorship.

Address Quality Sample

In the case of addresses, giving priority to good addresses makes for a better decision in the survivorship schema.

II. Record Quality

It could also be argued that good data may exist in a single group of matching records. In cases like these, we can assess the overall quality of data by taking into consideration other pieces of information that affect the weight of overall data quality. Take for example the following data:

Record Quality Sample

In this case, the ideal approach is to evaluate multiple elements for each record in the group. Since the second record contains a valid phone number, it can be given more weight or more importance than the third record despite it being more complete.

Whether we're working with contact data, product data or any other form of data, in summary, the methodologies and logic used for record survivorship become dependent primarily on data quality. And however we choose to define data quality, it is imperative that we keep only the best pieces of data if we are to have the most accurate and correct information. In the case of Contact Data however, Melissa Data changes the perspective as to how the quality of data is defined, therefore breaking the norm of typical survivorship schemas.


Making Sense Out of Missing Data

| No Comments | No TrackBacks
By David Loshin

I have spent the past few blog posts considering different aspects of null values and missing data. As I mentioned last time, it is easy to test for incompleteness, especially when system nulls are allowed. And even in older systems, the variable ways that missing or null data is represented is finite, making it easy to describe rules for flagging incomplete records.

The challenge is determining how to address the missing values, and unfortunately there are no magic bullets to infer a value when there is no information provided. On the other hand, one might consider some different ideas for determining whether a data element's value may be null, and if not, how to find a reasonable or valid value for it.

For example, linking data between different data sets can enable some degree of inference. If I can link a record in one data set that is missing a value with a similar record in a different data set whose data elements are complete, as long as certain rules are observed (such as timeliness and consistency rules), we could make the presumption that the missing value can be completed by copying from the linked record.

Alternatively, we could adjust the business processes, and either determine when there are situations in which a value is mandatory when it really doesn't need to be, or to examine ways to engineer aspects of a workflow to ensure that the missing data is collected prior to gating transactions to their subsequent stages.

These are just a few ideas, but the sheer fact that data incompleteness remains a problem these days is a testament to the fact that the issues is not given enough attention. But with the growth in the reliance on greater volumes of data being streamed at higher velocities than ever before, the problems of missing and incomplete data sets are only going to become more acute, so perhaps now is a good time to start considering the negative impacts of missing data within your own environments!

Structural Differences and Data Matching

| No Comments | No TrackBacks
By David Loshin

Data matching is easy when the values are exact, but there are different types of variation that complicate matters. Let's start at the foundation: structural differences in the ways that two data sets represent the same concepts. For example, early application systems used data files that were relatively "wide," capturing a lot of information in each record, but with a lot of duplication.

More modern systems use a relational structure that segregates unique attributes associated with each data concept - attributes about an individual are stored in one data table, and those records are linked to other tables containing telephone numbers, street addresses, and other contact data.

Transaction records refer back to the individual records, which reduces the duplication in the transaction log tables.

The differences are largely in the representation - the older system might have a field for a name, a field for an address, perhaps a field for a telephone number, and the newer system might break up the name field into a first name, middle name, and last name, the address into fields for street, city, state, and ZIP code, and a telephone number into fields for area code and exchange/line number.

These structural differences become a barrier when performing records searches and matching. The record structures are incompatible: different number of fields, different field names, and different precision in what is stored.

This is the first opportunity to consider standardization: if structural differences affect the ability to compare a record in one data set to records in another data set, then applying some standards to normalize the data across the data sets will remove that barrier. More on structural standardization in my next post.

By David Loshin

One of the most frequently-performed activities associated with customer data is searching - given a customer's name (and perhaps some other information), looking that customer's records up in databases. And this leads to an enduring challenge for data quality management, which supports finding the right data through record matching, especially when you don't have all the data values, or if the values are incorrect.

When applications allow free-formed text to be inserted into data elements with ill-defined semantics, there is the risk that the values stored may not completely observe the expected data quality rules.

As an example, many customer service representatives may expect that if a customer calls the company, there will be a record in the customer database for that customer. If for some reason, though, the customer's name is not entered exactly the same way as presented during a lookup, there is a chance that the record won't be found. This happens a lot with me, since I go by my middle name, "David," and often people will shorten that to "Dave" when entering data, so when I give my name as "David" the search fails when there is no exact match.

The same scenario takes place when the customer herself does not recall the data used to create the electronic persona - in fact, how many times have you created a new online account when you couldn't remember your user id? Also, it is important to recognize that although we think in terms of interactive lookups of individual data, a huge amount of record matching is performed as bulk operations, such as mail merges, merging data during corporate acquisitions, eligibility validation, claims processing, and many other examples.

It is relatively easy to find a record when you have all the right data. As long as the values used for search criteria are available and exactly match the ones used in the database, the application will find the record. The big differentiator, though, is the ability to find those records even when some of the values are missing, or vary somewhat from the system of record. In the next few postings we'll dive a bit deeper into the types of variations and then some approaches used to address those variations.

In a Global Economy, a Global Solution is Vital

| No Comments | No TrackBacks
By Patrick Bayne
Data Quality Tools Software Engineer

Heidelberglaan 8
3584 CS Utrecht


If you were given this address, how would you know it was valid? Is it formatted correctly? How long will it take you to verify? For years businesses have understood a need for address validating solutions, because clean, accurate data is essential. Without accurate and consistent data, customers are lost due to missed mailings, synchronization across servers is difficult, and reports become inaccurate - all adding to costs and missed opportunities for your business.

In an ever-emerging, ever-increasing global economy, there is a strong need for a global address solution that is not only accurate but cost-effective. Melissa Data, a long time value leader in address cleansing solutions, is proud to announce the launch of its Global Address Verification Web Service. Now it's easier than ever to cleanse, validate, and standardize your international data - allowing you to make confident business decisions, execute accurate mailings and shipping, plus maintain contact with your customers.

The Global Address Verification Web Service easily integrates into your systems. Because of the nature of a Web service, any platform that can communicate with the Web is open to use the service. And the Web service supports a variety of popular protocols - SOAP, XML, JSON, REST and Forms. The multiplatform, multi-Web protocol openness of the service allows simple connection to any point of entry, or batch system.

Enhanced Global Address Verification

What if you only had the street address and not the actual country name? Global Address Verification Web Service's unrestrictive address cleansing capabilities will append the name of the country to ensure deliverability.

The solution automatically puts components into the correct address field, making applications that process location data more accurate - bringing more value to your data. The process is also resilient to erroneous, non-address data. The solution can translate addresses between many languages and can format address to a country's local postal standards.

Result Codes for the Highest Verification Level

While it's easy to set up the Web Service and make calls to it, you will also know exactly what changed in the submitted address through results codes. There are three types of codes to detail the data quality: AE (Error) codes signify missing or invalid information; AC (Change) codes indicate address pieces that have been changed; and new AV (Verification) codes inform you as to how strongly the address was verified according to the reference data we have for a particular country.

Quantifying the quality of a match is easy through the results codes. An AV followed by a 2 designated that the record was matched to the highest level possible according to the reference data available. An AV followed by a 1 denotes that the address is partially verified, but not to the highest level possible. The number following, which is 1 to 5, indicates the level to which the address was verified. Think of it as a sliding scale.

For countries like the France, Finland and Germany, where there is data up to the delivery point, you will know that there was a full verification on an address when there is a result code of AV24 or AV25. A country such as American Samoa, where reference data is up to locality, would be fully verified with a results code of AV22. Any missing or inaccurate information would change the results to be partially verified. The user-friendly results codes will help you make more informed decisions about your data.

The Global Address Verification Web Service will open new doors to cleanse and validate your international data. The service is operating system and programming language neutral, allowing you to use it wherever you desire.

All data is maintained by Melissa Data, reassuring you that it is up-to-date and relieving you of the stress of any maintenance. You will have confidence that your data is accurate and be able to make informed, effective business decisions based on your data, increasing your productivity. So the next time you see, "Heidelberglaan 8, 3584 CS Utrecht," you will know how to quickly and assuredly verify that it is a valid address.

--- Patrick Bayne is a data quality tools software engineer at Melissa Data.

For a free trial of the Global Address Verification Web Service, click here.


Translating Expectations into Quality Directives

| No Comments | No TrackBacks
By David Loshin

In my last post, I raised the question about the variety of terms used in describing address quality, and I introduced a set of core concepts that needed to be correct to provide the best benefits for accurate parcel delivery. Let's look at these more carefully:

1) The item must be directed to a specific recipient party (either an individual or an organization.)
2) The address must be a deliverable address.
3) The intended recipient must be associated with the deliverable address.
4) The delivery address must conform to the USPS standard.
Together these concepts have implications for address quality, and we can start with the first 3 concepts. The first concept implies a direct connection between entities: the sender and the recipient.

The corresponding business rule is relatively subtle - it suggests that the recipient must be identifiable to the sender. Concept #2 is a bit more direct: the address must be a deliverable address. This means that the address must carry enough information to enable a carrier to locate the address as a prelude to delivery. Concept #3 establishes a direct dependence between the recipient and the addressed location, implying awareness of that connection.

Together we can infer more discrete assertions:

• The address must be accurately mappable to a real location.
• The address must contain enough information to ensure delivery.
• The recipient must be a recognized entity.
• The recipient must be connected to the address.
In the next few posts we will figure out what these assertions really mean in terms of transforming a provided address into a complete, validated, and standardized address.


Sometimes Data Quality is the Law

| No Comments | No TrackBacks
By Elliot King

Elliot King
We have all read the statistics about the real costs that poor data quality represents. And intuitively, we know that bad data is, well, bad. But, in many cases, bad data is more than just bad for business. Increasingly, good data is required by law.

In 2001, the U.S. Congress added two lines to its major appropriations act that required that the Office of Management and Budget "provide policy and procedural guidance to Federal agencies for ensuring and maximizing the quality, objectivity, utility, and integrity of information (including statistical information) disseminated by Federal agencies." These two lines have come to be known as the Data Quality Act (DQA) or the Information Quality Act.

While the DQA applied to federal agencies, it also put a marker in the ground. Regulatory agencies could demand that data given to them meet criteria set by the Congress. Government data had to be accurate, objective and have integrity as a matter of law.

The same notion has spread to corporate America and other specific industries. In one of the most high profile examples, the Dodd-Frank Wall Street Reform and Consumer Protection Act passed in the wake of the financial meltdown of 2008 established the Office of Financial Research with a mandate to improve the quality of financial data accessible to regulators.

One of the primary challenges for financial institutions is to insure that the information they supply to regulators is consistent across all their divisions. Companies are also going to be able to track data flows and usage and develop chains of custody that can be audited. The net result should be that all users of the data will see consistent information. And that may not be easy when data is siloed in different operating entities, which is often the case in those too-big-to-fail financial behemoths.

The regulatory pressure for data quality is being felt elsewhere as well. The American Health Information Management Association, which promotes the technological advancement of health information management systems, has noted that for electronic health records (EHR) to have the positive impact on overall health care that their proponents anticipate, data quality can no longer be a reactive process based on auditing but must be proactively focused on data capture. Standards to ensure that result will be built into the requirements for EHRs.

While many folks in IT chafe at government regulations. However, the Federal government often has been a leader in IT innovation. Think about the Internet. Federal IT initiatives have also been inept--think about the FBI case management system or the Air Traffic Control system. While burdensome, the regulatory demand for higher quality data could easily have a positive ripple effect that will spread widely.


Clean Data is Good Data

| No Comments | No TrackBacks
By Elliot King

Elliot King
The cliché is as old as computing itself--garbage in, garbage out. And that cliché is as true now as ever, if not more so. Unfortunately, with information flowing into companies from so many sources including the Web and third-party providers, mistakes should not just be expected; they are basically inevitable. Garbage data is going to get in your data systems.

We want to close our eyes to bad data and just pretend it doesn't matter; but that would be a major mistake. Virtually any operation driven by faulty data is suspect. Trends you may uncover could be wrong. Your customer contact efforts could be inappropriate or misdirected.

Data cleansing, the systematic effort to remediate bad data is no trivial task. First, so many different kinds of errors can exist. Mistakes can occur in single source systems as well as multiple source systems. Errors and inconsistencies can be introduced at both the metadata level--the data schema or the information wrapper, in the case of the Web, may be flawed--or at the granular level, where the information itself is just not right.

Just as the information itself, so much can go awry. There can be missing values and misspellings. Information can be entered into the wrong field--a street name in the city field perhaps. Attributes that should be linked aren't--let's say a city without a ZIP code. Records could contradict each other or be associated incorrectly. For example "John Smith" may actually work in payroll and not in human resources.

Not surprisingly, data cleansing has more steps than doing the laundry. The first step is to analyze the data and find the real mistakes--the omissions, the contradictions and the errors. The next step is to re-engineer and validate the new metadata and rules to address those errors. The data has to be transformed, expunging the problems. At that point, the data can be reloaded into the database. And in case that doesn't sound all that daunting, each of the steps generally has a slew of sub processes too.

Of course, none of this had to be done manually. Many different tools have been introduced into the market. Some are more generalized while others specialize in fixing a specific problem such as names and addresses.

At the end of the day, "garbage in, garbage out" sounds really harsh. Maybe you should look at it this way--clean data is good data. But just like your clothes, if you use data it will get dirty, so the cleansing actually never ends.


By Joseph Vertido

For many, the concepts of data integration and data quality are separate and have no commonality. But in reality, when you combine them - they create a partnership that excels. Where data quality leaves off, data integration begins, and vice versa. A new product - Contact Zone - fuses these two concepts together into one revolutionary solution for where data integration and data quality converge.

Data integration tools simplify data migration and data warehousing procedures - both of which are concerned with the issue of data management, i.e. keeping data organized. Data quality, on the other hand, is concerned primarily with an understanding of the nature, and validity of the contents of the actual data, i.e. keeping data clean. Maintaining an organized database is not the same as keeping it clean - they are two different approaches to handling data - but they can be combined, or should they?

The short answer is yes.

In essence, data integration allows for the migration of data from a given source to a given destination. Typically, users take advantage of data integration to accomplish data warehousing initiatives - allowing for easy migration and manipulation of data, which ultimately leads to maximizing the efficiency of business intelligence and analytics.

However, Gartner states that "only 30 percent of business intelligence and data warehousing implementations fully succeed." Why? The top two reasons for failure are budget constraints and data quality. So, although the architectural constraints of building a data warehouse can be addressed by utilizing data integration tools, it still leaves the problem of poor data quality - something that most data integration tools handle with mediocrity at best.

That's where Contact Zone comes into play. It's a data integration tool optimized for data quality, allowing you to shoot two birds with one stone.

Contact Zone connects to virtually any source, overcoming an obstacle our clients frequently encounter when implementing data quality, namely there is such a variety of database format and platforms today that the types of environments and combinations can be overwhelming.

Whether you have an IBM DB2 database or PostgreSQL, leveraging Contact Zone allows for data integration for almost any form of database format, while making sure that all data is clean, correct, standardized, and valid.