Data Profiling and Best Practices

33 downloads 24 Views 197KB Size Report
“[Data profiling] employs analytic methods for looking at data for the purpose ... A good data profiling [system] can process very large amounts of data, and with ...

Data Profiling:

A Practitioner‟s approach using Dataflux “[Data profiling] employs analytic methods for looking at data for the purpose of developing a thorough understanding of the content, structure, and quality of the data. A good data profiling [system] can process very large amounts of data, and with the skills of the analyst, uncover all sorts of issues that need to be addressed.” - Data Quality: The Accuracy Dimension, Jack E. Olson

Author: Satya P Vedula

dbSpeak DBs peak when we speak

What is Data Profiling? “Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data.” – Wikipedia What-if Data Profiling is not done? The Data Warehousing Institute (TDWI) estimates that poor quality customer data costs U.S. businesses a staggering $611 billion a year in postage, printing, and staff overhead.1 “Oh, I hate it when it does that. Just enter „1-2-3‟ for the address and all nines for the phone and keep going to the next screen.” – A sales associate Poor data quality leads to high tech failures. Here are few as quoted by TDWI,  A large bank discovered that 62 percent of its home equity loans were being calculated incorrectly, with the principal getting larger each month  A telecommunications firm lost $8 million a month because data entry errors incorrectly coded accounts, preventing bills from being sent out. Why Data Profiling? Data profiling is the first step to ensure a successful data intensive project.  It‟s the most practical way to learn about database quality and its contents and to effectively track data quality  to achieve the goal of reliable data by making it a single truth of data  to assess the risk involved in integrating data for new applications  to assess if metadata accurately describes the actual values in the source database  to understand challenges early in data intensive project,  for enterprise view of all data such as, Master Data Management, or Data governance  to determine data cleansing and build related routines How Data Profiling is done?   

1

Structured discovery: Perform a structure analysis of all tables by examining data complete columns. Various techniques that could be used include metadata validation, pattern matching, and basic statistics. Content discovery: Once structured analysis is complete apply matching technologies like standardization, frequency counts & outliers detection, business rules validation , Employ statistical techniques Relationship discovery: As today‟s data warehouses contain massive amounts of data covering various subject matters; understanding the data relationships (for example customer in claims against customer in policy) and relating them results in better business decisions

TDWI estimate based on cost-savings cited by survey respondents and others who have cleaned up name and address data, combined with Dunn & Bradstreet counts of U.S. businesses by number of employees

Data profiling in Practice: Using a Tool Gartner Report, June 2007 shows there are only two leading 3rd party players in market for data profiling, DataFlux and Trillium Software. This paper shows how to leverage DataFlux as a tool for Data Profiling activities. Data profiling in practice has three distinctive phases,  Initial Profiling: Perform the initial data profiling and do the data assessment  Integration & Automate: Integrate the profiling and automate the profiling process to pro-actively monitor changing data  Report the results: Pass on the profiling results to the Business users, data architects, and developers to act on For better quality control, data profiling needs to perform following audits 2, Audit Type Domain checking Range checking Cross-field verification Address format verification Name standardization Reference field consolidation Format consolidation Referential integrity Basic statistics, frequencies, ranges, and outliers Duplicate identification Uniqueness and missing value validation Key identification Data rule compliance

2

Source from DataFlux

Example In a gender field, the value should be M or F. For age, the value should be less than 125 and greater than 0. If a customer orders an upgrade, then make sure that the customer already owns the product. If “Street” is the designation for street, then make sure no other designations are used. If “Robert” is the standard name for Robert, then make sure that Bob, Robt and Rob are not used. If “GM” stands for “General Motors,” then make sure it does not stand for “General Mills” elsewhere. Make sure that date information is stored as yyyymmdd in each applicable field. If an order shows that a customer bought product XYZ, then make sure that there actually is a product XYZ. If an organization has products that cost between 1000 and 10000 dollars, you can run a report for product prices that are not in this range. You can also view product information, such as SKU codes, to find out if the SKU groupings are correct and in line with the expected frequencies. If an inactive flag is used to identify customers that are no longer covered by health benefits, then make sure all duplicate records are also marked inactive. If UPC or SKU codes are supposed to be unique, and then make sure they are not being reused. If there is a defined primary key/foreign key relationship across tables, then validate it by looking for records that do not have a parent. If closed credit accounts must have a balance of 0, then make sure there are no records where the closed account flag is true and the account balance total is greater than 0.

Best Practices in using Dataflux for Profiling & Testing Dataflux has many capabilities, including data testing, automation using Data Integration Server in addition to the Data Profiling. Best practices listed here in using Dataflux are not just limited to Data Profiling, but to the entire data testing capabilities of DataFlux. Preparation  Gather source and target system documentation  Compiling logons and passwords into existing metadata repository Additional tools, setup & connectivity  Create necessary DSN using appropriate odbc drivers  For DB2 use DB2 Wired Protocol instead of native drivers  Prepare macros for source, target DSNS using connection strings into architect.cfg  Use macros for input and output files.  Identify and categorize sources and targets by the way a tool connects, Relational, file connection etc. As required identify and save the file layouts  Use a versioning system to store architect job, source/target file layouts, copybooks, xml layouts, and sample data (in files)  The connection names on Windows and UNIX must correspond for Data Integration Server and Remote Job to work properly. Team Training, Roles & Responsibilities  Prepare reports/deliverables from the profiled data  Familiarity with other tools like Microsoft Excel to dynamically parse and reanalyze data  Identify the lead to participate in issues, scope, and progress  Identify data analyst to look for data anomalies  Identify business analyst to look for business rules violation Decide the approach  Start profiling by Subject area or Physical structure  Identify how architect jobs will be automated on Integration Server Extract, Load & Transform  Create extract programs/sql to fetch data from source and target systems  Code Page compatibility: When source and Target databases use different code pages certain characters (CHAR, VARCHAR, and TEXT data types) are translated differently. Check with ETL tool used (eg: Informatica PowerCenter) for code page compatibility issues. ( Known problem exists converting certain characters from Sybase, Mainframe DB2 and certain flat files having accent characters to Teradata)

 

Check for Unicode support, where needed. Dataflux doesn‟t support VSAM tables on Mainframe with shipped ODBC drivers. Check if Data Direct Shadow drivers are needed.

Dataflux doesn‟t support COMP-3 clauses (packed decimal) in Cobol copybooks well  Dataflux has open issues handling xml files larger than 2 GB. Make sure these limits are not reached.  Push the “key creation” logic, and other formatting to the database being queried (into sql)  DB2 allows a time of 24:00:00; however this is not valid in most of the other Relational Databases including Teradata and Oracle. Make necessary format changes to handle this time.  ETL tool, Informatica ignores/truncates milliseconds during transformation process. This might result in invalid results while checking  Identify fields that need to be split/separated into various parts for analysis Sampling the data  Dataflux can determine the layout of an input flat file. A good sample gives better file layout. A sampling of 100-500 rows might be sufficient for this purpose.  Large free form text fields may not be benefited by data profiling. However, first few characters (20 – 50 characters) contain cryptic codes. Make sure, they are analyzed separately.  Split or merge fields containing telephone numbers, addresses, and names prior to load (i.e. push to sql)  When a table has a combination of columns as its key, its desirable to combine them into a single column prior to load (as part of sql)  Once the architect job is executed, plan for its input and output filesto be moved into a different location, to free up server for space. Analysis While data is better understood by Subject Matter Experts (SMEs), following statistical approach may benefit a better understanding by data analysts  Use the profiled data for sanity check of data, by comparing with prior measurements / metrics collected.  Check if Records / Counts / Null Count match  Check if the high/low values are appropriate in Min and Max fields.  For a table level matching between Source and Target databases, a Sum on money columns should be matched  Identify the most used values in min and max values. These are good candidates for enabling compression  Use USPS and Geocode database for validating addresses, phone numbers etc  Use chop tables for logically creating sub-string elements from input data  Use Grammar is a set of rules that represent expected patterns of words in a given context.  Use phonetic library for analysis (Phonetics) during the process of generating match codes. Eg: matching “SCHMIDT” and “SCHMITT”  Use Reg-Ex libraries for normalization, standardization, and other input string pre-processing activities. 