What is Record Linkage?

| No Comments | No TrackBacks
By David Loshin

In my last entry, I talked about the fact that many distributed pieces of data about a single individual can be combined together to form a deep profile about that individual. But how are different data records from disparate data sets combined to formulate insightful profiles?

The answer lies in the ability to collect the different pieces of data that belong to a single individual and then glom them together. For example, let's presume the existence of a record in one data set that has a person's address, a record in another data set that has that person's telephone number, a third record that has that person's registration number for a toaster, another with the person's car year, make, and model, etc.

As long as you can find all the records that are associated with each person and connect them together, you could collect all the interesting information together and create a single representative profile. That profile is then suitable for use in list generation, but is also used for more comprehensive analytics such as segmentation, clustering analysis, and classification.

The way these records are connected together is through a process called "record linkage." This process searches through one or more data sets looking for records that refer to the same unique entity based on identifying characteristics that can be used to distinguish one entity from all others, such as names, addresses, or telephone numbers.

When two records are found to share the same pieces of identifying information, you might assume that those records can be linked together. It sounds simple, but unfortunately, there are a number of challenges with linking records across more than one data set, such as:

· The records from the different data sets don't share the same identifying attributes (one might have phone number but the other one does not).

· The values in one data set use a different structure or format than the data in another data set (such as using hyphens for social security numbers in one data set but not in the other).

· The values in one data set are slightly different than the ones in the other data set (such as using nicknames instead of given names).

· One data set has the values broken out into separate data elements while the other does not (such as titles and name suffixes).

Luckily, there are numerous software products that are designed to address these discrepancies, which can simplify the record linkage process. If you recall some of my previous posts, you may begin to see how parsing and standardization start to fit in. These tools will parse and standardize the values prior to attempting to compare for the purposes of linkage, and that alleviates some of the challenges I noted.


No TrackBacks

TrackBack URL: http://blog.melissadata.com/mt-tb.cgi/148

Leave a comment

Authors