Wiki – Data Cleansing

Wiki – Data Cleansing

Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data.

After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores.

Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data.
The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities.


PC Magazine’s Definition

PC Magazine’s Definition

The condition of data in a database. Clean data are error free or have very few errors. Dirty data have errors, including incorrect spelling and punctuation of names and addresses, redundant data in several records or simply erroneous data (not the correct amounts, names, etc.).

Red Gate – Data Cleanser for SSIS

1.) Red Gate Forum:  Click Here.

2.) Free “Beta Tool” For Data Cleansing:  Click Here.

3.) Products Page:  Click Here.

4.) Adding Data Cleanser To Your SQL Toolbox:  Click Here.


Data Warehouse Studio

Here is another request I received and would like to share with my readers:

Data Warehouse Studio complements your current ETL technology and helps to automate the management and code generation processes as you build your Warehouse and/or Mart. Data Warehouse Studio customers are accelerating their implementations by as much as 3 − 10 times vs. conventional methods.

The Software collects specifications from various contributors – architects, analysts and modelers taking in to consideration that a large number of specs can be reused across different feeds and projects. By providing each type of contributor a custom entry screen we are eliminating redundant manual activities that are present in conventional methodologies. Once the specs have been entered, with the push of a button DWS generates ETL & Sql code in the tool of your choice (Informatica, DataStage, Talend, etc.)

Here are some of the benefits:

For Architects:

> We ensure consistency of implementation across the entire project and adhere to architectural requirements

For Developers:

> The automation of mundane tasks
> The generation of staging jobs
> Automated surrogate key generation
> Automated Referential Integrity validation
> Automated History Management
> Automated Error Handling
> Impact Analysis

For Analysts:

> We ensure generated code fully corresponds to teh business logic
> Fast turnaround time
> Facilitation of data mapping activities

For Managers:

> Short project timelines
> Lower cost deployments
> Documentation generations
> Inventory of Feeds
> Hardware and Software Utilization Metrics
> Data Lineage Reporting

Please email Scott Davidson below if you would like to receive a 2 week trial license and/or a demo.

Scott Davidson
Gamma Systems

For a real life customer success story, please check out this link:

Is Your Data Getting Better or Worse?

Over the past 25 years for those of us that have been in the industry that long… We have seen a number of advancements in data hygiene applications and techniques? Regardless, if you are old or new to the industry – What data hygiene software best suits your current needs today?

(LOL: Even the word Data Hygiene is old school… today we use more robust terms like Data Management, Data Quality, Enterprise Data Quality, Master Data Management, Data Governance, etc…)

Dots On A Map Provide Unique Insights Into Data Quality

This was a presentation I originally prepared back in 2005, but is probably even more applicable in 2009 given the impact using a GIS tool can have on visualizing data quality – customer addresses on a map! The next time you conduct a customer “data” assessment – try this! You can also see a high level data profile I prepared for this trade area of specific customers.

What Different Routines Do You Consider Important When “Data Profiling” In Order To Reveal The Quality Of Information In A Data Source?

There are several different types of data quality tools in the marketplace today to essentially do one important thing – cleanse, validate, correct, and enhance your data.

In order to better understand what the “quality expectation” is for YOUR CLIENT a baseline (or scorecard) must be established for each source system. Data profiling is an ideal way to reveal and share the results with others in order to make an informed decision and rank your findings.

Address Quality Extends Beyond CASS and NCOA

Several consultants have asked me over the last 6-8 months why don’t they (their respective firms) just build there own address management solution (code from scratch!) and purchase/license the CASS, NCOA, etc… content “only” directly from the USPS.

My answer in short is “let me tell you some reasons why not”.

1.) First and most important is many of the well known original suppliers of postal coding solutions in place today had to (and still go through) rigorous certification processes to insure their software and subsequent updates (versions) continue to comply with the USPS guidelines. (Note: Many of these well known suppliers I am referring to have been around since the early 1980’s when I began my career at Metromail. Now, that’s old. LOL)

2.) The number of “bugs” that have been reported to these same vendors over the years by their respective client bases (2,000+ clients in some cases) are best positioned to minimize risk for each “new license” of their products and services sold.

3.) The “people” behind the design, creation, and ongoing development of these (existing) postal products and services have 15+ (minimum) years of experience in the industry rather than a new team that may be just formed with little or no knowledge about this process.

4.) The barrier to entry for a new “postal coding” engine in 2009 (with the exception of a new add-on service you may want to bolt-on) to an existing postal coding engine is hard to envision. But, that’s my opinion.

In summary, my advice is to stick to creating some kind of exception process or create a client-specific data governance process (or standard) using an existing vendor solution offering that already has an established relationship with the USPS.

Here is a good example of one software supplier who exemplifies several of my points above:

GreyHair Software, Inc. goes beyond CASS/NCOA as major sources to power their address quality (best practice) offering which includes other alliances like the UAA Clearinghouse.

Let me explain further:

Here is a brief excerpt from an article last years published by GreyHair Software, Inc. Note: One of the executives at GrayHair is a past work associate of mine – Raymond Chin, Vice President of Product Management & Development. (See point #3 above)

Hold that thought, and read about how providers today are enhancing the traditional “postal” offerings today to expand beyond traditional USPS – CASS and NCOA… content!

Publication: Business Wire
Date: Wednesday, April 9 2008

GrayHair Software, Inc. and UAA Clearinghouse today introduced the most comprehensive set of offerings for managing Address Quality and reducing Undeliverable-As-Addressed (UAA) mail. By using source data from the USPS([R]) next-generation Intelligent Mail([R]) Barcode with change-of-address data from publishing and telecommunications organizations, best-of-breed solutions are now available for suppressing and/or redirecting addresses.

This will improve responses and reduce the cost of business mailings, thus enhancing the return-on-investment of direct mail programs, and making a significant contribution to the bottom line.

The article goes on… you can read the rest by going to:

(End of article, excerpt.)

In summary, consultants “do your research”… find existing companies like GreyHair Software to support your basic client needs (with confidence) plus any other unique requirements.

Addtional note,

Address Quality (Best Practices) today are providing more “value” than just save postage ($$$) and improved deliverability of a piece of mail… like days past.

The benefit of good address quality (best practices) is also a big plus for customer data integration (CDI) initiatives… resulting in increased customer match/merge/link/search scenario’s, especially in customer (MDM) hubs where clients today are centralizing disparate customer data sources across the enterprise into a single view of a customer.

To my fellow consultants… and postal software vendors… you are welcome to add your own comments or share with us your unique “product” differentiators.

Enterprise Data Quality Blog

Here is the link to the most recent article on my Enterprise Data Quality blog:

Premier-International’s EPACTL Tool (Applaud)

Premier-International is based in Chicago and has software and consulting services:

What is Applaud?

Applaud is the only “EPACTL” tool – the only single software product with integrated tools to extract, profile, analyze, cleanse, transform and load data.

EPACTL is a new breed of software that provides integrated tools to accomplish all requirements of data quality and data migration/consolidation projects.

After reviewing the website, here are some of the key service offerings I would like to share which has been directly taken from their website to avoid mis-representation:

1.) Data Migration and Data Conversion – Migrating data from legacy systems to a new replacement system.

2.) Data Consolidation – Consolidating data from multiple instances of the same system or multiple disparate systems.

3.) Data Cleansing – Cleansing data and supporting data quality initiatives.

4.) Data Quality Audits – Performing data quality audits.

5.) Data Integration – Constructing interfaces between on-going systems.

6.) Data Management for IT – Building customized data management solutions.

7.) Data Management for Employee Benefits – Delivering customized data management solutions for employee benefit consultants and actuaries.

8.) Rapid Application Development – Using Applaud’s RAD tools to deliver dynamic system solutions fast.

If you want to learn more about Applaud and Premier International, visit…

If there are any readers out their who have knowledge about Premier-International or Applaud, please feel free to comment.