Recently in ETL Category

Get in the Contact Zone

| No Comments | No TrackBacks

If you're looking for that perfect all-in-one solution that combines the power of easy integration and Melissa's full spectrum of data quality solutions, then you need to get in the Zone - The Contact Zone®.

Contact Zone employs all the customer data management tools you'll need to help provide consistent, trusted, accurate data across the enterprise - all in one single platform for effortless integration. 

Plus, it's powered by Pentaho® Data Integration (PDI), which gives Contact Zone a simple, graphical user interface, dynamic templates, administrative features, and so much more. Collect data from any source, cleanse and transform it, and gain immediate insight for meaningful use.

How to Perform ETL - Fast and Easy

| No Comments | No TrackBacks
Imagine a more simplified approach to data integration - one that doesn't require the use of different connectors, etc. It's all possible with expressor Software's latest product release - version 3.5 of its desktop ETL platform - which will feature Melissa Data and integration.

The platform will leverage the power of Melissa Data's WebSmart services for postal address verification, phone verification, email validation and name parsing - from within the expressor data flow application. The integration will allow users to read and write CRM data to from their on-premises business systems.

What is expressor 3.5?

It's an easy-to download and install, metadata-driven ETL tool that offers a simplified approach to data integration, ranging from small ETL tasks to complex data integration projects. The tool provides a user interface with a Microsoft Office look and feel, and a drag and drop configuration that lets you process your data with speed and power.

But its biggest lure lies in its connectivity. Unlike other ETL products, expressor's tool does not require the use of different connectors. Instead, the expressor platform contains as few connections as possible - but each is configurable and includes all the parameters needed to get your data in and out of almost any data source.

Learn more about expressor 3.5 and Melissa Data's integration in a special co-hosted webinar, Dec. 15 at 2 pm EST. To register for the event, click here.

Are You A Dupe Detective?

| No Comments | No TrackBacks
By Joseph Vertido

The process of finding approximate matching records in your data to get rid of duplicates is precisely that - fuzzy. It raises as many questions as answers. Am I using a good matching algorithm? Am I matching on the right fields? Is it a true match or a false one?

The problem begins when inconsistent data enters from multiple sources. The meticulous process of finding these similar records and comparing to see if they are actually the same is a daunting challenge. But with the release of the Melissa Data Fuzzy Matching Component for SQL Server Integration Services (SSIS), you

now have a tool that will make this all elementary. With this component, you become a Sherlock Holmes - easily cracking the case of the data doppelgangers.

Finding the Culprits

Matching duplicate records are identified through a percent score. Compared records will be given a match score ranging from 0% (non-matching) to 100% (exact match). So what about records that score in between?

By leveraging the ETL capabilities of SSIS, the Fuzzy Matching Component allows you to send the results through three different output destinations: Match, Non-Match; and Possible Match. Based on how strict or loose you set our thresholds to be, records will be redirected to the output tables accordingly - making the job of keeping the bad records out much easier.

More Brains are Better Than One

So how exactly does the Fuzzy Matching Component determine match percentages? Similarity computation is done through built-in fuzzy matching algorithms. But why settle with just one algorithm for similarity computation when you can have 16!

Available algorithms include common algorithms like the Jaro and Levenstein, but also includes other more advanced algorithms such as Smith-Waterman-Gotoh and PhonetEx.

Why so many you might ask? Each algorithm has its own strengths and weaknesses.
Some algorithms are more accurate when it comes to company and peoples' names, while some are more effective when it comes to company names. But with the wide array of algorithms to choose from, you have the flexibility to choose the logic that works for your data.

The Seven Features of the Fuzzy Matching Component

The SSIS component includes these features:
  1. Match based on several columns in your data
  2. Get actual Match Score Percentages
  3. Multiple fuzzy matching algorithms
  4. Automatic filtering of matching, non-matching, and possible matching records
  5. Built-in pre-cleansing through search and replace patterns
  6. Data Driven Model for easy migration to production
  7. Matching Metadata for data driven decision making

It's Elementary, My Dear Watson

In what seems to be an impossible task of finding and blocking dupe records, Melissa Data aims to help you put the puzzle pieces together and help solve the mysteries of

fuzzy matching. With the Melissa Data Fuzzy Matching Component for SSIS, you're now one step closer to making your data problem-free.

Review: Contact Verification Component

| No Comments | No TrackBacks
By Brian Erlich
SQL Server Magazine

Melissa Data provides mailing lists, mailing and data management software, and other products that can help companies improve the quality of their data. It now offers the Data Quality Components for SSIS, a suite of components that integrate into SQL Server Integration Services (SSIS). I recently had a chance to try the company's Contact Verification Component, which you use with custom data (also provided by Melissa Data). With the component, you can validate and correct contact information, parse and reformat the data in intelligent ways, and enhance the information by providing additional data and removing questionable data.

Melissa Data recommends installing the component on a 64-bit Windows machine (Windows XP SP2 or later) running the 64-bit version of SQL Server 2008 R2 and Microsoft .NET Framework 3.5. Installation is pretty easy. It took me about 10 minutes on my Windows 7 machine, which has the 64-bit version of SQL Server 2008 R2 Enterprise Edition installed.

The first step is to install Setup.exe, which starts the installation wizard. The wizard installs the task and the data files that it relies on. Once installed, you must add the Data Flow task to the Toolbox in SSIS. If you have never done this before, don't worry--it's not hard and you only need to do it once. To add the task, open an SSIS project in Business Intelligence Development Studio (BIDS) and right-click anywhere in the Toolbox. From the drop-down menu, select Choose Items and a dialog box will open with multiple tabs along the top. On the SSIS Data Flow Items tab, select the Melissa Data Contact Verification check box. To finish the process, click OK. The Melissa Data Contact Verification task is now available for use.

To test the new task, I built a sample package using an OLE DB source. The first thing I tested was name parsing. On the Name tab, I selected FullName from the drop-down list in the Input Name section to indicate the field (i.e., database column) containing the names to parse. I then specified the names of the fields in which to place the output. The task not only parsed the full names into the five fields (prefix, first name, middle name, last name, and suffix) but also enhanced the data by adding two new fields containing the person's gender and a salutation. You can even separate out a second name from the input name. An example of this would be turning the name John and Jane Doe into John Doe and Jane Doe.

Address verification works in a similar way to name parsing. On the Address tab, which Figure 1 shows, I selected the address1, address2, city, state, and zip fields in my database as the input and assigned field names for each possible output. The task did the rest. It parsed the suite and apartment numbers into their own fields, added the nine-digit ZIP code, added latitude and longitude coordinates, and even told me which addresses were private post-office boxes, like those rented from The UPS Store.

Data Quality Components for SSIS
Figure 1: Verifying addresses in the Contact Verification Component

The Contact Verification task also verifies and enriches email addresses and phone numbers. A phone number can yield a lot of information, which you can use to enhance your contact information. For example, you can use the task to add the type of phone number (landline, cellular, or Voice over IP--VoIP), country, time zone, and a lot more.

When addresses and phone numbers are verified, the component provides result codes, which add an additional layer of information to each line of data. You can use these codes to define data as valid or invalid based on your business rules. For example, you could use the result codes to remove addresses that contain vulgarities or questionable names such as Donald Duck.

If you need to validate and enhance your company's contact information, the Contact Verification component makes it almost effortless. The component is easy to install. More important, it's easy to use, thanks to the integration with SSIS.

Contact Verification Component PROS:
Easy to install and use; complete control over the output fields; identified questionable addresses and names; added useful data to the contact database.

CONS: Tendency to hold on to some of the metadata during testing but was easily resolved by removing and re-adding the task to the design plan.

RATING: 5 out of 5

PRICE: Licenses based on the number of records; can purchase the Contact Verification Component separately or as part of the Data Quality Components for SSIS suite.

RECOMMENDATION: I would recommend this component to anyone who needs to verify or enhance contact information. Its integration with SSIS makes it easy to use.

CONTACT: Melissa Data • 800-635-4772 or 949-858-3000 •