Recently in Data Quality Components for SSIS Category

Check out this video featuring a step-by-step/start-to-finish approach to cleanse and enrich address data using the new SQL Server 2012 Data Quality Services (DQS), and Melissa Data's Datamarket Address Check service. From the video, you will learn how to create a comprehensive Data Quality project that leverages Melissa Data's deep experience in address verification to cleanse, verify and standardize postal address information in SQL tables and Excel spreadsheets.

Click Here to watch video.

Are You A Dupe Detective? Part 1

| No Comments | No TrackBacks
By Joseph Vertido

The process of finding approximate matching records in your data to get rid of duplicates is precisely that - fuzzy. It raises as many questions as answers. Am I using a good matching algorithm? Am I matching on the right fields? Is it a true match or a false one?

The problem begins when inconsistent data enters from multiple sources. The meticulous process of finding these similar records and comparing to see if they are actually the same is a daunting challenge. But with the release of the Melissa Data Fuzzy Matching Component for SQL Server Integration Services (SSIS), you

now have a tool that will make this all elementary. With this component, you become a Sherlock Holmes - easily cracking the case of the data doppelgangers.

Finding the Culprits

Matching duplicate records are identified through a percent score. Compared records will be given a match score ranging from 0% (non-matching) to 100% (exact match). So what about records that score in between?

By leveraging the ETL capabilities of SSIS, the Fuzzy Matching Component allows you to send the results through three different output destinations: Match, Non-Match; and Possible Match. Based on how strict or loose you set our thresholds to be, records will be redirected to the output tables accordingly - making the job of keeping the bad records out much easier.

More Brains are Better Than One

So how exactly does the Fuzzy Matching Component determine match percentages? Similarity computation is done through built-in fuzzy matching algorithms. But why settle with just one algorithm for similarity computation when you can have 16!

Available algorithms include common algorithms like the Jaro and Levenstein, but also includes other more advanced algorithms such as Smith-Waterman-Gotoh and PhonetEx.

Why so many you might ask? Each algorithm has its own strengths and weaknesses.
Some algorithms are more accurate when it comes to company and peoples' names, while some are more effective when it comes to company names. But with the wide array of algorithms to choose from, you have the flexibility to choose the logic that works for your data.

Are You A Dupe Detective? Part 1

The Seven Features of the Fuzzy Matching Component

The SSIS component includes these features:
  1. Match based on several columns in your data
  2. Get actual Match Score Percentages
  3. Multiple fuzzy matching algorithms
  4. Automatic filtering of matching, non-matching, and possible matching records
  5. Built-in pre-cleansing through search and replace patterns
  6. Data Driven Model for easy migration to production
  7. Matching Metadata for data driven decision making

It's Elementary, My Dear Watson

In what seems to be an impossible task of finding and blocking dupe records, Melissa Data aims to help you put the puzzle pieces together and help solve the mysteries of

fuzzy matching. With the Melissa Data Fuzzy Matching Component for SSIS, you're now one step closer to making your data problem-free.


Review: Contact Verification Component

| No Comments | No TrackBacks
By Brian Erlich
SQL Server Magazine

Melissa Data provides mailing lists, mailing and data management software, and other products that can help companies improve the quality of their data. It now offers the Data Quality Components for SSIS, a suite of components that integrate into SQL Server Integration Services (SSIS). I recently had a chance to try the company's Contact Verification Component, which you use with custom data (also provided by Melissa Data). With the component, you can validate and correct contact information, parse and reformat the data in intelligent ways, and enhance the information by providing additional data and removing questionable data.

Melissa Data recommends installing the component on a 64-bit Windows machine (Windows XP SP2 or later) running the 64-bit version of SQL Server 2008 R2 and Microsoft .NET Framework 3.5. Installation is pretty easy. It took me about 10 minutes on my Windows 7 machine, which has the 64-bit version of SQL Server 2008 R2 Enterprise Edition installed.

The first step is to install Setup.exe, which starts the installation wizard. The wizard installs the task and the data files that it relies on. Once installed, you must add the Data Flow task to the Toolbox in SSIS. If you have never done this before, don't worry--it's not hard and you only need to do it once. To add the task, open an SSIS project in Business Intelligence Development Studio (BIDS) and right-click anywhere in the Toolbox. From the drop-down menu, select Choose Items and a dialog box will open with multiple tabs along the top. On the SSIS Data Flow Items tab, select the Melissa Data Contact Verification check box. To finish the process, click OK. The Melissa Data Contact Verification task is now available for use.

To test the new task, I built a sample package using an OLE DB source. The first thing I tested was name parsing. On the Name tab, I selected FullName from the drop-down list in the Input Name section to indicate the field (i.e., database column) containing the names to parse. I then specified the names of the fields in which to place the output. The task not only parsed the full names into the five fields (prefix, first name, middle name, last name, and suffix) but also enhanced the data by adding two new fields containing the person's gender and a salutation. You can even separate out a second name from the input name. An example of this would be turning the name John and Jane Doe into John Doe and Jane Doe.

Address verification works in a similar way to name parsing. On the Address tab, which Figure 1 shows, I selected the address1, address2, city, state, and zip fields in my database as the input and assigned field names for each possible output. The task did the rest. It parsed the suite and apartment numbers into their own fields, added the nine-digit ZIP code, added latitude and longitude coordinates, and even told me which addresses were private post-office boxes, like those rented from The UPS Store.

Data Quality Components for SSIS
Figure 1: Verifying addresses in the Contact Verification Component

The Contact Verification task also verifies and enriches email addresses and phone numbers. A phone number can yield a lot of information, which you can use to enhance your contact information. For example, you can use the task to add the type of phone number (landline, cellular, or Voice over IP--VoIP), country, time zone, and a lot more.

When addresses and phone numbers are verified, the component provides result codes, which add an additional layer of information to each line of data. You can use these codes to define data as valid or invalid based on your business rules. For example, you could use the result codes to remove addresses that contain vulgarities or questionable names such as Donald Duck.

If you need to validate and enhance your company's contact information, the Contact Verification component makes it almost effortless. The component is easy to install. More important, it's easy to use, thanks to the integration with SSIS.

Contact Verification Component PROS:
Easy to install and use; complete control over the output fields; identified questionable addresses and names; added useful data to the contact database.

CONS: Tendency to hold on to some of the metadata during testing but was easily resolved by removing and re-adding the task to the design plan.

RATING: 5 out of 5

PRICE: Licenses based on the number of records; can purchase the Contact Verification Component separately or as part of the Data Quality Components for SSIS suite.

RECOMMENDATION: I would recommend this component to anyone who needs to verify or enhance contact information. Its integration with SSIS makes it easy to use.

CONTACT: Melissa Data • 800-635-4772 or 949-858-3000 • www.melissadata.com 

FREE TRIAL: www.melissadata.com/ssis10


Authors