By David Loshin
A simplistic view might suggest that when looking at two records, comparing the corresponding values is the best way to start. For example, we might compare the corresponding names, telephone numbers, street addresses - stuff that usually appears in records representing customers, residences, patients, etc.
But the simple concept belies a much more complex question about the attributes used to describe the individual as well as differentiate pairs of individuals. Much of this issue revolves around the approaches taken for determining what characteristics are being managed within a representative record, the motivation for including those characteristics, and importantly, are those data elements used solely as "attribution" (or additional description of the entity involved) or are they used for "distinction" (to help in unique identification).
More to the point: what are the core data elements necessary for determining the uniqueness of a record? We often take for granted the fact that our relational models presume one and only one record per entity, and that there might be business impacts should more than one entry exist for each individual.
Yet individual "entities" may exist in multiple data sets, even in different contexts. Some characteristics are part and parcel of each entity, while others describe the entity playing a particular role. Our upcoming posts are intended to consider some of these issues when assessing similarity for record linkage and matching.
How can you tell if two records refer to the same person (or company, or other type of organization)? In our recent posts, we have looked at how data quality techniques such as parsing and standardization help in normalizing the data values within different records so that the records can be compared. But what is being compared? That is the topic of this next set of entries.
A simplistic view might suggest that when looking at two records, comparing the corresponding values is the best way to start. For example, we might compare the corresponding names, telephone numbers, street addresses - stuff that usually appears in records representing customers, residences, patients, etc.
But the simple concept belies a much more complex question about the attributes used to describe the individual as well as differentiate pairs of individuals. Much of this issue revolves around the approaches taken for determining what characteristics are being managed within a representative record, the motivation for including those characteristics, and importantly, are those data elements used solely as "attribution" (or additional description of the entity involved) or are they used for "distinction" (to help in unique identification).
More to the point: what are the core data elements necessary for determining the uniqueness of a record? We often take for granted the fact that our relational models presume one and only one record per entity, and that there might be business impacts should more than one entry exist for each individual.
Yet individual "entities" may exist in multiple data sets, even in different contexts. Some characteristics are part and parcel of each entity, while others describe the entity playing a particular role. Our upcoming posts are intended to consider some of these issues when assessing similarity for record linkage and matching.




Leave a comment