Recently in Data Integration Category

Rancho Santa Margarita, CALIF- January 14, 2014 - Melissa Data, a leading provider of global contact data quality and integration solutions, today announced its strategic alliance with Blu Sky to solve growing data management challengesin healthcare markets. Melissa Data offers a comprehensive platform for data integration and data quality, and Blu Sky provides data capture technologies optimized for EpicCare software deployments used to administer mid-size and large medical groups, hospitals, and integrated healthcare organizations. By partnering with Melissa Data and its extensive suite of established data quality solutions, healthcare providers have a comprehensive single source for superior data management and compliance.

"Integrated data quality is essential to advancing universal healthcare options, yet the complexities of healthcare data management are evident in today's headlines," said Gary Van Roekel, COO, Melissa Data. "As government initiatives catalyze change in the market, our alliance with Blu Sky provides a significant technical and competitive advantage for healthcare CTOs - offering a comprehensive, proven resource for data quality, integration, and capture. Improved and integrated patient data quality will not only help providers reduce the cost of care, but also facilitate better diagnosis and treatment options for patients."

With this alliance, Melissa Data provides global data quality solutions that verify, standardize, consolidate, and enhance U.S. and international contact data, in combination with a comprehensive Data Integration Suite in Contact Zone, enabling cleansed and enhanced patient data to be transformed and shared securely within a healthcare network. Blu Sky adds subject matter experts to the equation - with deep expertise in the EpicCare software used extensively in healthcare networks to facilitate a "one patient, one record" approach; patient data capture, storage, and management is assured of compliance with a growing range of healthcare regulations, including CASS certification of address results, and HIPAA privacy and security policies.

"Mobile healthcare, connected pharmacy applications, and electronic medical records represent tangible advancements in healthcare accessibility," said Rick O'Connor, President, Blu Sky. "The same advances increase complexity of data management in the context of HIPAA confidentiality and other industry standards. With a single source to address compliance network-wide, providers are poised for healthcare innovations based on secure, high quality patient information."

For more information about the healthcare data management alliance between Melissa Data and Blu Sky, contact Annie Shannahan at 360-527-9111, or call 1-800-MELISSA (635-4772).

Performance Scalability

| No Comments | No TrackBacks
By David Loshin

In my last post I noted that there is a growing need for continuous entity identification and identity resolution as part of the information architecture for most businesses, and that the need for these tools is only growing in proportion to the types and volumes of data that are absorbed from different sources and analyzed.

While I have discussed the methods used for parsing, standardization, and matching is past blog series, one thing I alluded to a few notes back was the need for increased performance of these methods as the data volumes grow.

Let's think about this for a second. Assume we have 1,000 records, each with a set of data attributes that are selected to be compared for similarity and matching. In the worst case, if we were looking to determine duplicates in that data set, we would need to compare each records against the remaining records. That means doing 999 comparisons 1,000 times, for a total of 999,000 comparisons.

Now assume that we have 1,000, 000 records. Again, in the worst case we compare each record against all the others, and that means 999,999 comparisons performed 1,000,000 times, for a total of 999,999,000,000 potential comparisons. So if we scale up the number of records by a factor of 1,000, the number of total comparisons increases by a factor of 1,000,000!

Of course, our algorithms are going to be smart enough top figure out ways to reduce the computation complexity, but you get the idea - the number of comparisons grows in a geometric way. And even with algorithmic optimizations, the need for computational performance remains, especially when you realize that 1,000,000 records is no longer considered to be a large number of records - more often we look at data sets with tens or hundreds of millions of records, if not billions.

In the best scenario, performance scales with the size of the input. New technologies enable the use of high performance platforms, through hardware appliances, software that exploits massive parallelism and data distribution, and innovative methods for data layouts and exchanges.

In my early projects on large-scale entity recognition and master data management, we designed algorithms that would operate in parallel on a network of workstations. Today, these methods have been absorbed into the operational fabric, in which software layers adapt in an elastic manner to existing computing resources.

Either way, the demand is real, and the need for performance will only grow more acute as more data with greater variety and diversity is subjected to analysis. You can't always just throw more hardware at a problem - you need to understand its complexity and adapt the solutions accordingly. In future blog series, we will look at some of these issues and ways that new tools can be adopted to address the growing performance need.

Understanding Data Quality Services

| No Comments | No TrackBacks
Knowledge Base, Knowledge Discovery, Domain Management,
and Third Reference Data Sets

PASS Virtual Chapter Meeting: Thursday, Jan. 31, 2013 at 9 am PDT, 12 pm EST.


With the release of Data Quality Services (DQS), Microsoft innovates its solutions on Data Quality and Data Cleansing by approaching it from a Knowledge Driver Standpoint. In this presentation, Joseph Vertido from Melissa Data will discuss the key concepts behind Knowledge Driven Data Quality, implementing a Data Quality Project, and will demonstrate how to build and improve your Knowledge Base through Domain Management and Knowledge Discovery.

What sets DQS apart is its ability to provide access to Third Party Reference Data Sets through the Azure Marketplace. This access to shared knowledge empowers the business user to efficiently cleanse complicated and domain specific information such as addresses. During this session examples will be presented on how to access RBS Providers and integrate them from the DQS Client.


A Guide to Better Survivorship - A Melissa Data Approach

| No Comments | No TrackBacks
By Joseph Vertido

The importance of survivorship - or as others may refer to as the Golden Record - is quite often overlooked. It is the final step in the record matching and consolidation process which ultimately allows us to create a single accurate and complete version of a record. In this article, we will take a look at how Melissa Data uniquely differentiates itself in approaching the concept of survivorship compared to some of the more conventional practices.

The process of selecting surviving records means selecting the best possible candidate as its representation. However, best in the perspective of survivorship can really mean a lot of things. It can be affected by the structure of data, where the data is gathered from, how data comes in, what kind of data is stored, and sometimes by the nature of business rules. Thus techniques can be applied in order to accommodate certain types of variations when performing survivorship. We find that there are three very commonly used techniques in determining the surviving record:

I. Most Recent

Date stamped records can be ordered from most recent to less recent. The most recent record can be considered eligible as the survivor.

II. Most Frequent

Matching records containing the same information are also an indication for correctness. Repeating records indicate that the information is persistent and therefore reliable.

III. Most Complete

Field completeness is also a factor of consideration. Records with more values populated for each available field are also viable candidates for survivorship.

Although these techniques are commonly applied in survivorship schemas, its correctness may not be as reliable in many circumstances. Because these techniques apply to almost any type of data, the basis in which a surviving record is created conforms only to "generic" rules. This is where Melissa Data is able to set itself apart from "generic" survivorship. By leveraging reference data, we can steer a way to generating better and more effective schemas for survivorship.

The incorporation of reference data in survivorship changes how rules come into play. Using the Most Recent, Most Frequent or Most Complete logic really has more of an aesthetic basis for selection. Ideally, the selection of the surviving record should be based off an actual understanding of our data.

And this is where reference data comes into play. What it boils down to at the very end is simply being able to consolidate the best quality data. Thus by incorporating reference data, we gain an understanding of the actual contents of data, and create better decisions for survivorship. Let's take a look at some instances on how reference data and data quality affect decisions for survivorship.

I. Address Quality

Separating good data from bad data should take precedence in making decisions for survivorship.

Address Quality Sample

In the case of addresses, giving priority to good addresses makes for a better decision in the survivorship schema.

II. Record Quality

It could also be argued that good data may exist in a single group of matching records. In cases like these, we can assess the overall quality of data by taking into consideration other pieces of information that affect the weight of overall data quality. Take for example the following data:

Record Quality Sample

In this case, the ideal approach is to evaluate multiple elements for each record in the group. Since the second record contains a valid phone number, it can be given more weight or more importance than the third record despite it being more complete.

Whether we're working with contact data, product data or any other form of data, in summary, the methodologies and logic used for record survivorship become dependent primarily on data quality. And however we choose to define data quality, it is imperative that we keep only the best pieces of data if we are to have the most accurate and correct information. In the case of Contact Data however, Melissa Data changes the perspective as to how the quality of data is defined, therefore breaking the norm of typical survivorship schemas.

Making Sense Out of Missing Data

| No Comments | No TrackBacks
By David Loshin

I have spent the past few blog posts considering different aspects of null values and missing data. As I mentioned last time, it is easy to test for incompleteness, especially when system nulls are allowed. And even in older systems, the variable ways that missing or null data is represented is finite, making it easy to describe rules for flagging incomplete records.

The challenge is determining how to address the missing values, and unfortunately there are no magic bullets to infer a value when there is no information provided. On the other hand, one might consider some different ideas for determining whether a data element's value may be null, and if not, how to find a reasonable or valid value for it.

For example, linking data between different data sets can enable some degree of inference. If I can link a record in one data set that is missing a value with a similar record in a different data set whose data elements are complete, as long as certain rules are observed (such as timeliness and consistency rules), we could make the presumption that the missing value can be completed by copying from the linked record.

Alternatively, we could adjust the business processes, and either determine when there are situations in which a value is mandatory when it really doesn't need to be, or to examine ways to engineer aspects of a workflow to ensure that the missing data is collected prior to gating transactions to their subsequent stages.

These are just a few ideas, but the sheer fact that data incompleteness remains a problem these days is a testament to the fact that the issues is not given enough attention. But with the growth in the reliance on greater volumes of data being streamed at higher velocities than ever before, the problems of missing and incomplete data sets are only going to become more acute, so perhaps now is a good time to start considering the negative impacts of missing data within your own environments!

Structural Differences and Data Matching

| No Comments | No TrackBacks
By David Loshin

Data matching is easy when the values are exact, but there are different types of variation that complicate matters. Let's start at the foundation: structural differences in the ways that two data sets represent the same concepts. For example, early application systems used data files that were relatively "wide," capturing a lot of information in each record, but with a lot of duplication.

More modern systems use a relational structure that segregates unique attributes associated with each data concept - attributes about an individual are stored in one data table, and those records are linked to other tables containing telephone numbers, street addresses, and other contact data.

Transaction records refer back to the individual records, which reduces the duplication in the transaction log tables.

The differences are largely in the representation - the older system might have a field for a name, a field for an address, perhaps a field for a telephone number, and the newer system might break up the name field into a first name, middle name, and last name, the address into fields for street, city, state, and ZIP code, and a telephone number into fields for area code and exchange/line number.

These structural differences become a barrier when performing records searches and matching. The record structures are incompatible: different number of fields, different field names, and different precision in what is stored.

This is the first opportunity to consider standardization: if structural differences affect the ability to compare a record in one data set to records in another data set, then applying some standards to normalize the data across the data sets will remove that barrier. More on structural standardization in my next post.

By David Loshin

One of the most frequently-performed activities associated with customer data is searching - given a customer's name (and perhaps some other information), looking that customer's records up in databases. And this leads to an enduring challenge for data quality management, which supports finding the right data through record matching, especially when you don't have all the data values, or if the values are incorrect.

When applications allow free-formed text to be inserted into data elements with ill-defined semantics, there is the risk that the values stored may not completely observe the expected data quality rules.

As an example, many customer service representatives may expect that if a customer calls the company, there will be a record in the customer database for that customer. If for some reason, though, the customer's name is not entered exactly the same way as presented during a lookup, there is a chance that the record won't be found. This happens a lot with me, since I go by my middle name, "David," and often people will shorten that to "Dave" when entering data, so when I give my name as "David" the search fails when there is no exact match.

The same scenario takes place when the customer herself does not recall the data used to create the electronic persona - in fact, how many times have you created a new online account when you couldn't remember your user id? Also, it is important to recognize that although we think in terms of interactive lookups of individual data, a huge amount of record matching is performed as bulk operations, such as mail merges, merging data during corporate acquisitions, eligibility validation, claims processing, and many other examples.

It is relatively easy to find a record when you have all the right data. As long as the values used for search criteria are available and exactly match the ones used in the database, the application will find the record. The big differentiator, though, is the ability to find those records even when some of the values are missing, or vary somewhat from the system of record. In the next few postings we'll dive a bit deeper into the types of variations and then some approaches used to address those variations.

In a Global Economy, a Global Solution is Vital

| No Comments | No TrackBacks
By Patrick Bayne
Data Quality Tools Software Engineer

Heidelberglaan 8
3584 CS Utrecht

If you were given this address, how would you know it was valid? Is it formatted correctly? How long will it take you to verify? For years businesses have understood a need for address validating solutions, because clean, accurate data is essential. Without accurate and consistent data, customers are lost due to missed mailings, synchronization across servers is difficult, and reports become inaccurate - all adding to costs and missed opportunities for your business.

In an ever-emerging, ever-increasing global economy, there is a strong need for a global address solution that is not only accurate but cost-effective. Melissa Data, a long time value leader in address cleansing solutions, is proud to announce the launch of its Global Address Verification Web Service. Now it's easier than ever to cleanse, validate, and standardize your international data - allowing you to make confident business decisions, execute accurate mailings and shipping, plus maintain contact with your customers.

The Global Address Verification Web Service easily integrates into your systems. Because of the nature of a Web service, any platform that can communicate with the Web is open to use the service. And the Web service supports a variety of popular protocols - SOAP, XML, JSON, REST and Forms. The multiplatform, multi-Web protocol openness of the service allows simple connection to any point of entry, or batch system.

Enhanced Global Address Verification

What if you only had the street address and not the actual country name? Global Address Verification Web Service's unrestrictive address cleansing capabilities will append the name of the country to ensure deliverability.

The solution automatically puts components into the correct address field, making applications that process location data more accurate - bringing more value to your data. The process is also resilient to erroneous, non-address data. The solution can translate addresses between many languages and can format address to a country's local postal standards.

Result Codes for the Highest Verification Level

While it's easy to set up the Web Service and make calls to it, you will also know exactly what changed in the submitted address through results codes. There are three types of codes to detail the data quality: AE (Error) codes signify missing or invalid information; AC (Change) codes indicate address pieces that have been changed; and new AV (Verification) codes inform you as to how strongly the address was verified according to the reference data we have for a particular country.

Quantifying the quality of a match is easy through the results codes. An AV followed by a 2 designated that the record was matched to the highest level possible according to the reference data available. An AV followed by a 1 denotes that the address is partially verified, but not to the highest level possible. The number following, which is 1 to 5, indicates the level to which the address was verified. Think of it as a sliding scale.

For countries like the France, Finland and Germany, where there is data up to the delivery point, you will know that there was a full verification on an address when there is a result code of AV24 or AV25. A country such as American Samoa, where reference data is up to locality, would be fully verified with a results code of AV22. Any missing or inaccurate information would change the results to be partially verified. The user-friendly results codes will help you make more informed decisions about your data.

The Global Address Verification Web Service will open new doors to cleanse and validate your international data. The service is operating system and programming language neutral, allowing you to use it wherever you desire.

All data is maintained by Melissa Data, reassuring you that it is up-to-date and relieving you of the stress of any maintenance. You will have confidence that your data is accurate and be able to make informed, effective business decisions based on your data, increasing your productivity. So the next time you see, "Heidelberglaan 8, 3584 CS Utrecht," you will know how to quickly and assuredly verify that it is a valid address.

--- Patrick Bayne is a data quality tools software engineer at Melissa Data.

For a free trial of the Global Address Verification Web Service, click here.

Translating Expectations into Quality Directives

| No Comments | No TrackBacks
By David Loshin

In my last post, I raised the question about the variety of terms used in describing address quality, and I introduced a set of core concepts that needed to be correct to provide the best benefits for accurate parcel delivery. Let's look at these more carefully:

1) The item must be directed to a specific recipient party (either an individual or an organization.)
2) The address must be a deliverable address.
3) The intended recipient must be associated with the deliverable address.
4) The delivery address must conform to the USPS standard.
Together these concepts have implications for address quality, and we can start with the first 3 concepts. The first concept implies a direct connection between entities: the sender and the recipient.

The corresponding business rule is relatively subtle - it suggests that the recipient must be identifiable to the sender. Concept #2 is a bit more direct: the address must be a deliverable address. This means that the address must carry enough information to enable a carrier to locate the address as a prelude to delivery. Concept #3 establishes a direct dependence between the recipient and the addressed location, implying awareness of that connection.

Together we can infer more discrete assertions:

• The address must be accurately mappable to a real location.
• The address must contain enough information to ensure delivery.
• The recipient must be a recognized entity.
• The recipient must be connected to the address.
In the next few posts we will figure out what these assertions really mean in terms of transforming a provided address into a complete, validated, and standardized address.

Sometimes Data Quality is the Law

| No Comments | No TrackBacks
By Elliot King

Elliot King
We have all read the statistics about the real costs that poor data quality represents. And intuitively, we know that bad data is, well, bad. But, in many cases, bad data is more than just bad for business. Increasingly, good data is required by law.

In 2001, the U.S. Congress added two lines to its major appropriations act that required that the Office of Management and Budget "provide policy and procedural guidance to Federal agencies for ensuring and maximizing the quality, objectivity, utility, and integrity of information (including statistical information) disseminated by Federal agencies." These two lines have come to be known as the Data Quality Act (DQA) or the Information Quality Act.

While the DQA applied to federal agencies, it also put a marker in the ground. Regulatory agencies could demand that data given to them meet criteria set by the Congress. Government data had to be accurate, objective and have integrity as a matter of law.

The same notion has spread to corporate America and other specific industries. In one of the most high profile examples, the Dodd-Frank Wall Street Reform and Consumer Protection Act passed in the wake of the financial meltdown of 2008 established the Office of Financial Research with a mandate to improve the quality of financial data accessible to regulators.

One of the primary challenges for financial institutions is to insure that the information they supply to regulators is consistent across all their divisions. Companies are also going to be able to track data flows and usage and develop chains of custody that can be audited. The net result should be that all users of the data will see consistent information. And that may not be easy when data is siloed in different operating entities, which is often the case in those too-big-to-fail financial behemoths.

The regulatory pressure for data quality is being felt elsewhere as well. The American Health Information Management Association, which promotes the technological advancement of health information management systems, has noted that for electronic health records (EHR) to have the positive impact on overall health care that their proponents anticipate, data quality can no longer be a reactive process based on auditing but must be proactively focused on data capture. Standards to ensure that result will be built into the requirements for EHRs.

While many folks in IT chafe at government regulations. However, the Federal government often has been a leader in IT innovation. Think about the Internet. Federal IT initiatives have also been inept--think about the FBI case management system or the Air Traffic Control system. While burdensome, the regulatory demand for higher quality data could easily have a positive ripple effect that will spread widely.