Recently in data profiling Category

Discover Data Quality Issues Before they Arise

| No Comments | No TrackBacks

By Taky Djarou, Data Quality Analyst

Melissa has released its new data Profiler API. The Profiler Object offers a unique approach to profiling your data, combining years of contact data quality experience, the power of many Melissa Objects, and data source tables to help you dig deeper into your data and return hundreds of properties about the input table, columns and individual values.

For example, many existing Profilers will allow the user to set a RegEx to capture an email pattern. The Melissa Profiler offers that function, as well as checking the syntax, the domain, and whether it's disposable, has a spammy reputation, or is invalid and will return counts that reflect all of the above.

Data validation is also performed on city, state/province, ZIP and postal code fields to report any discrepancies in your data. Even if you accidentally put a phone number in a name field, Melissa's Profiler can detect and report it.

The Profiler Object returns counts of duplicate records using four different matching criteria (Exact, Address Only, Household, and Contact.) Using the power of our flagship deduplication solution MatchUp, the number of unique records, duplicates and the largest group of duplicate counts will be reported for all four matching criteria.

Melissa's Profiler also provides value specific iterators (pattern, word, data, date, Soundex, etc.) that allow the user to loop through any column in an ascending or descending order to retrieve those values and their respective counts.

The date iterator for example, allows the user to see the busiest/slowest time/day of the month/day of the week using a time stamp field of when a record was created.

To demo the Melissa Profiler, please visit us at: or call 1-800-MELISSA (635-4772) and one of our Sales Representatives will set you up with a free trial.

How to Do It All with Melissa

| No Comments | No TrackBacks

With Melissa, you can do it all - see for yourself with the brand new Solutions Catalog. This catalog showcases products to transform your people data (names, addresses, emails, phone numbers) into accurate, actionable insight. Our products are in the Cloud or available via easy plugins and APIs. We provide solutions to power Know Your Customer initiatives, improve mail deliverability and response, drive sales, clean and match data, and boost ROI.


Specific solutions include:

·         Cleaning, matching & enriching data
·         Creating a 360 degree profile of every customer
·         Finding more customers like your best ones with lookalike profiling
·         Integrating data from any source, at any time

Other highlights include: global address autocompletion; mobile phone verification; real-time email address ping; a new customer management platform; as well as info on a wealth of data append and mailing list services.


Download the catalog now:


Introducing the New and Improved Global Email Web Service

| No Comments | No TrackBacks
How It Can Protect and Increase Your Email Reputation Score

By Oscar Li, Data Quality Sales Engineer/Channel Manager for Global Email

Melissa Data recently introduced several improvements and new features to its Global Email Web Service - an all-in-one real-time email mailbox validation and correction service. Here's a quick list of our latest improvements:

  • Improved fuzzy matching of domain corrections
  • Updated our TLD database with the newest ICANN info
  • Increased control over the freshness of data returned
  • Better unknown status detection capability

In terms of the service's new features, Global Email now offers two validation modes: Express and Premium.

Express can be used in time-sensitive situations and will give back a response in one or two seconds. Premium will actually perform a real-time validation of the email address and can take up to 12 seconds to receive a response back.

If you want to reduce the time taken in premium mode, we offer an advanced email option which will reduce the freshness of the data return in order to increase speed.

Global Email Verify

Using the Global Email Service to Protect & Increase Your Email Reputation Score

What does email reputation mean? A big part of a marketer's campaign is email marketing campaigns. It is important to watch your email reputation with a tool such as senderscore. Once you are blacklisted, your email deliverability will suffer, and as a result, you will have trouble sending out emails in the future as most mail servers subscribe to a spammers list.

Integrators should consult with our data quality experts in order to understand what they need to look out for to avoid a bad campaign, and why certain emails should be flagged/inspected. Even if you avoid spam traps, sending out too many invalid emails will cause the mail server to flag you as a potential spammer. Email marketing campaign servers with a low email reputation score will typically experience aggressive filtering.

On the other hand, maintaining a high reputation score will see less intrusive filtering only applied to individual emails and email campaigns instead of blanket IP addresses. It would be definitely prudent to not allow other users to influence your email reputation.

For example, if you are on a shared server - other companies/users could be sending out their own campaigns without filtering emails through our service. It would be a waste if you spent all the time and investment controlling your email campaigns and another user is email blasting without mailbox validation causing the entire IP to be affected.

How do I improve my email reputation score?

If your reputation scores are already dismal for your current existing email campaign server IPs, it might be beneficial to do email campaigns on a new or more reputable IP to see better return on investment. This would start your email reputation on a clean slate. As a disclaimer, we are not sure how feasible this would be for everybody, but your team will need to discuss internally.

However, using our service on existing IP should raise the reputation score for that specific IP. On IPs with pre-existing high volume campaign history, the scores will be slow to change. You can now see why it is extremely prudent to invest in an email validation system early on and why Global Email is a valuable tool to utilize.

If you would like to learn more, please visit our website!

Flagship SSIS Developer Suite Now Enables Data Assessment and Continuous Monitoring Over Time; Webinar Adds Detail for SSIS Experts

Rancho Santa Margarita, CALIF - March 17, 2015 - Melissa Data, a leading provider of contact data quality and address management solutions, today announced its new Profiler tool added to the company's flagship developer suite, Data Quality Components for SQL Server Integration Services (SSIS). Profiler completes the data quality circle by enabling users to analyze data records before they enter the data warehouse and continuously monitor level of data quality over time. Developers and database administrators (DBAs) benefit by identifying data quality issues for immediate attention, and by monitoring ongoing conformance to established data governance and business rules.

Register here to attend a Live Product Demo on Wednesday, March 18 from 11:00 am to 11:30 am PDT. This session will explore the ways you can use Profiler to identify problems in your data.

"Profiler is a smart, sharp tool that readily integrates into established business processes to improve overall and ongoing data quality. Users can discover database weaknesses such as duplicates or badly fielded data - and manage these issues before records enter the master data system," said Bud Walker, director of data quality solutions, Melissa Data. "Profiler also enforces established data governance and business rules on incoming records at point-of-entry, essential for systems that support multiple methods of access. Continuous data monitoring means the process comes full circle, and data standardization is maintained even after records are merged into the data warehouse."

Profiler leverages sophisticated parsing technology to identify, extract, and understand data, and offers users three levels of data analysis. General formatting determines if data such as names, emails and postal codes are input as expected; content analysis applies reference data to determine consistency of expected content and field analysis determines the presence of duplicates.

Profiler brings data quality analysis to data contained in individual columns and incorporates every available general profiling count on the market today; sophisticated matching capabilities output both fuzzy and exact match counts. Regular expressions (regexes) and error thresholds can be customized for full-fledged monitoring. In addition to being available as a tool within Melissa Data's Data Quality Components for SSIS, Profiler is also available as an API that can be integrated into custom applications or OEM solutions.

Request a free trial of Data Quality Components for SSIS or the Profiler API.
Call 1-800-MELISSA (635-4772) for more information.

News Release Library

Data Profiling: Pushing Metadata Boundaries

| No Comments | No TrackBacks
By Joseph Vertido
Data Quality Analyst/MVP Channel Manager

Two truths about data: Data is always changing. Data will always have problems. The two truths become one reality--bad data. Elusive by nature, bad data manifests itself in ways we wouldn't consider and conceals itself where we least expect it. Compromised data integrity can be saved with a comprehensive understanding of the structure and contents of data. Enter Data Profiling.

Throw off the mantle of complacency and take an aggressive approach to data quality, leaving no opening for data contamination. How? Profiling.

More truths: Profiling is knowledge. Knowledge is understanding. That understanding extends to discovering what the problems are and what needs to be done to fix it.

Armed with Metadata

Metadata is data about your data. The analysis of gathered metadata with Profiling exposes all the possible issues to its structure and contents, giving you the information--knowledge and understanding--needed to implement Data Quality Regimens.

Here are only a few of the main types of Generic Profiling Metadata and the purpose of each:

  • Column Structure - Maximum/Minimum Lengths and Inferred Data Type - These types of metadata provides information on proper table formatting for a target database. It is considered problematic, for example, when an incoming table has values which exceed the maximum allowed length.

  • Missing Information - NULLs and Blanks - Missing data can be synonymous to bad data. This applies for example where an Address Line is Blank or Null, which in most cases is considered a required element.

  • Duplication - Unique and Distinct Counts - This allows for the indication of duplicate records. De-duplication is a standard practice in Data Quality and is commonly considered problematic. Ideally, there should only be a single golden record representation for each entity in the data.

Other equally important types of Generic Profiling Metadata include Statistics for trends data; Patterns (ReqEx) allow for identifying deviations from formatting rules; Ranges (Date, Time, String and Numbers); Spaces (Leading/Training Spaces and Max Spaces between Words); Casing and Character Sets (Upper/Lower Casing and Foreign, Alpha Numeric, Non UTF-8) Frequencies for an overview of the distribution of records for report generation on demographics and more.

Metadata Revolution & New Face of Profiling

Right now the most powerful profiling tool for gathering Metadata is the Melissa Data Profiler Component for SSIS, which is used at the Data Flow level, allowing you to profile any data type that SSIS can connect with, unlike the stock Microsoft Profiling Component, which is only for SQL Server databases.

More importantly the Melissa Data Profiler offers over 100 types of Metadata including all the Generic Profiling Metadata mentioned here.

The innovative Melissa Data's Profiler Component gathers Data Driven Metadata, which goes beyond the standard set of profiling categories. By combining our extensive knowledge on Contact Data, this allows us to get information not simply based on rules, norms, and proper formatting. Rather, it provides metadata with the aid of a back-end knowledge base. We can gather unique types of metadata such as postal code, State and Postal Code Mismatch, Invalid Country, Email Metadata, Phone and Names.

Take Control

The secret to possessing good data goes back to a simple truth: understanding and knowledge of your data through profiling. The release of Melissa Data's Profiler for SSIS allows you to take control of your data through use of knowledge base driven metadata. The truth shall set you free!

For more information on our profiling solutions, please visit our website

Validation of Data Rules

| No Comments | No TrackBacks
By David Loshin

Over the past few blog posts, we have looked at the ability to define data quality rules asserting consistency constraints between two or more data attributes within a single data instance, as well as cross-table consistency constraints to ensure referential integrity. Data profiling tools provide the ability to both capture these kinds of rules within a rule repository and then apply those rules against data sets as a method for validation.

As a preparatory step focusing the profiler for an assessment, the cross-column rules to be applied to each record are organized in a way such that as the table (or file) is scanned, the data attributes within each individual record's that are the subject of a rule are extracted and submitted for assessment. If the record complies with all the rules, it is presumed to be valid. If the record fails any of the rules, it is reported as a violation and tagged with all of the rules that were not observed.

Likewise for the cross-table rules, the profiler will need to identify the dependent data attributes taken from the corresponding tables that need to be scanned for validation of referential integrity. Those column data sets can be subjected to a set intersection algorithm to determine if any values exist in the referring set that do not exist in the target (i.e., "referred-to") data set.

Any items in the referring set that do not link to an existing master entity are called out as potential violations.

After the assessment step is completed, a formal report can be created and delivered to the data stewards delineating the records that failed any data quality rules. The data stewards can use this report to prioritize potential issues and then for root cause analysis and remediation.

More About Data Quality Assessment

| No Comments | No TrackBacks
By David Loshin

In our last series of blog entries, I shared some thoughts about data quality assessment and the use of data profiling techniques for analyzing how column value distribution and population corresponded to expectations for data quality. Reviewing the frequency distribution allowed an analyst to draw conclusions about column value completeness, the validity of data values, and compliance with defined constraints on a column-by-column basis.

However, data quality measurement and assessment goes beyond validation of column values, and some of the more interesting data quality standards and policies apply across a set of data attributes within the same record, across sets of values mapped between columns, or relationships of values that cross data set or table boundaries.

Data profiling tools can be used to assess these types of data quality standards in two ways. One approach is more of an undirected discovery of potential dependencies that are inherent in the data, while the other seeks to apply defined validity rules and identify violations. The first approach relies on some algorithmic complexity that I would like to address in a future blog series, and instead in the upcoming set of posts we will focus on the second approach.

To frame the discussion, let's agree on a simple concept regarding a data quality rule and its use for validation, and we will focus specifically on those rules applied to a data instance, such as a record in a database or a row in a table. A data instance quality rule defines an assertion about each data instance that must be true if the rule is observed. If the assertion evaluates to be not true, the record or table row is in violation of the rule.

For example, a data quality rule might specify that the END_DATE field must be later in time than the BEGIN_DATE field, and that means that for each record, verifying observance of the rule means comparing the two date fields and making sure that the END_DATE field is later in time than the BEGIN_DATE field.

This all seems pretty obvious, of course, and we can use data profiling tools to both capture and apply the validation of the rules to provide an assessment of observance. In the next set of posts we will focus on the definition and application of cross-column and cross-table data quality rules.


Data Quality Assessment: Value Domain Compliance

| No Comments | No TrackBacks
By David Loshin

To continue the review of techniques for using column value analysis for assessing data quality, we can build on a concept I brought up in my last post about format and pattern analysis and the reasonableness of data values, namely whether the set of values that appear in the column complies with the set of allowable data values altogether.

Many business applications rely on well-defined reference data sets, especially for contact information and product data. These reference data sets are often managed as master data, with the values enumerated in a shared space. For example, a conceptual data domain for the states of the United States can be represented using an enumerated list of 2-character codes as provided by the United States Postal Service (USPS).

That list establishes a set of valid values, which can be used for verification for any dataset column that is supposed to use that format to represent states of the United States.

A good data profiling tool can be configured to perform yet another column analysis that verifies that each value that appears in the column coincides with one of those in the enumerated master reference set. After the values have been scanned and their number of occurrences tallied, the set of unique values can be traversed and each value compared against the reference set.

Any values that appear in the column that do not appear in the reference set can be culled out as potential issues to be reviewed with the business subject matter expert.

In this blog series, we have looked at a number of methods that column value scanning and frequency analysis can be used as part of an objective review of potential data issues.

In a future series, we will look more closely at why these types of issues occur as well as methods for logging the issues with enough context and explanation to share with the business users and solicit their input for determination of severity and prioritization for remediation.


Data Quality Assessment: Value and Pattern Frequency

| No Comments | No TrackBacks
By David Loshin

Once we have started our data quality assessment process by performing column value analysis, we can reach out beyond the scope of the types of null value analysis we discussed in the previous blog post. Since our column analysis effectively tallies the number of each value that appears in the column, we can use this frequency distribution of values to identify additional potential data flaws by considering a number of different aspects of value frequency (as well as lexicographic ordering), including:

  • Range Analysis, which looks at the values, and allows the analyst to consider whether they can be ordered so as to determine whether the values are constrained within a well-defined range.

  • Cardinality Analysis, which analyzes the number of distinct values that appear within the column to help determine if the values that actually appear are reasonable for what the users expect to see.

  • Uniqueness, which indicates if each of the values assigned to the attribute is used once and only once within the column, helping the analyst to determine if the field is (or can be used as) a key.

  • Value Distribution, which presents an ordering of the relative frequency (count and percentage) of the assignment of distinct values. Reviewing this enumeration and distribution alerts the analyst to any outlier values, either ones that appear more than expected, or invalid values that appear few times and are the result of finger flubs or other data entry errors.
In addition, good data profiling tools can abstract the value strings by mapping the different character types such as alphabetic, digits, or special characters to a reduced representation of different patterns. For example, a telephone number like "(301) 754-6350" can be represented as "(DDD) DDD-DDDD." Once the abstract patterns are created, they can also be subjected to frequency analysis, allowing such assessment like:

  • Format and/or pattern analysis, which involves inspecting representational alphanumeric patterns and formats of values and reviewing the frequency of each to determine if the value patterns are reasonable and correct.

  • Expected frequency analysis, or reviewing those columns whose values are expected to reflect certain frequency distribution patterns, validate compliance with the expected patterns.
Recall again that the identification of potential issues can only be verified as a problem by review with a business process subject matter expert.

Data Quality Assessment: Column Value Analysis

| No Comments | No TrackBacks
By David Loshin

In recent blog series, I have shared some thoughts about methods used for data quality and data correction/cleansing. This month, I'd like to share some thoughts about data quality assessment, and the techniques that analysts use to review potential anomalies that present themselves.

The place to start, though is not with the assessment task per se, but the context in which the data quality analyst will find him/herself when asked to identify potential data quality flaws. The challenge is in interpretation of the goal: an objective assessment is intended to identify data errors and flaws, but when the task is handed off to a technical data practitioner outside of the context of business needs, the review can be more of a fishing expedition than a true analysis.

What I mean here is that an undirected approach to data quality assessment is likely to expose numerous potential issues, and without some content scoping as to which potential issues are or are not relevant to specific business processes, a lot of time may be spent on wild goose chases to fix issues that are not really problems.

With that caveat, though, we can start to look at some data quality assessment methods, starting with one particular aspect of data profiling: column value analysis. The idea is that reviewing all of the values in a specific column along with their corresponding frequencies will expose situations in which values vary from what they should be. Most column analysis centers on value frequency. In essence, the technical approach for column analysis is to scan all the values in a column and add up their frequencies, then, present the frequencies to the analyst, ordered by frequency or in lexicographic order.

These two orderings enable the lion's share of the analysis, yet many people don't realize that the analysis itself must be driven by the practitioner within the context of the expectation. Over the next three postings in this series, we will look at three different ways to assess quality through reviewing the enumeration of column values and their relative frequency.