Data Profiling: Attribute level Checks

 
     
  By Colin Ross
 
   
     
  Before starting any data initiative you need to understand your data. Data Profiling is about building that understanding, using a set of simple and quick checks.



Here we’re going to focus on attribute-level checks, which tend to form the bulk of any Data Profiling exercise. In part this is because most data profiling is performed at the level of a single dataset or table, but also because this attribute-level analysis is fairly simple and can uncover some pretty significant data issues very quickly.

Before going any further, we ought to say that an attribute is simply a single column in your data, perhaps a date of birth, a postcode, or a transaction date.

Anyway, there are a number of checks which you should look to apply to any attribute of interest to your project.

• How much data do you have, specifically, how many rows in your dataset?
• What is the data type, length (and precision of numbers if relevant)
• What are the Min and Max values? Are these reasonable? For example, are there any dates of birth, or transactions, in the future?
• How many Nulls are these? If there are very few Nulls maybe this is an issue. On the other hand, if there are lots of Nulls, is there an implied meaning which you need to be aware of? For example, does a Null gender imply a corporate rather than a personal client?
• Are there any missing values (empty strings or unexpected zeros)?
• How many unique values does the attribute have (its cardinality)?
• How does the cardinality compare to the total number of rows in your dataset? A relatively low cardinality would suggest codes and status values, while a very high cardinality (near identical to the row count) would suggest a possible key.
• For those attributes with a low cardinality, document all values present in the dataset with a frequency distribution (often termed a List of Values, or LoV). Are there any clear patterns or popular values?
• Are there any outliers or values present only a handful of times?
• Is there any value which particularly stands out? Is there any chance that this is the default value offered by the application or that this is a “magic number” used by the business?
• Are there any codes which seem very similar to popular codes? For example, ‘00’ versus ‘OO’.
• Are there any codes where you can infer meaning, e.g. “Male”, and “Female”? Are there codes present which seem inconsistent with such meaning, e.g. “Y” and “N”? How can these be explained? With old systems there is always a risk that the meaning of codes has changed. Or perhaps the data is being sourced from more than once source (ecommerce and telesales) and the codes may be inconsistent.
• Do the rest of the codes look consistent? Are there as many as you might expect? For example US states or for European countries?
• Is the same meaning represented by multiple codes? For example “UK”, “United Kingdom”, “Britain”.
• Does there appear to be overloading of an attribute’s meaning? E.g. if a flag is set does this attribute mean something else? Classic cases may be the use of forename for company name.
• If a set of attributes seem to belong together, are there other such sets, and do they have the same consistent structure? Maybe there are two sets of bank details, or two customers on the dataset. Do they all have the same attributes, with the same properties?
• If there are multiple sets of related attributes, is it clear which attributes relate to each other? E.g. Bank_account_1, sort_code_1…
• Are any attributes being used for other purposes? Memo fields with bits of other data such as account details or contact information? Address fields with directions or elements of the entity’s name?

In my experience even checks as simple as these will turn up issues in any dataset. For very little investment in time or resources, I can guarantee Data Profiling will uncover some interesting issues in your data.

 
   
  Article Source: http://interpret.zar.vg   
     
  About The Author
Citrus Technology provide Data Profiling and Data Quality tools to help you understand your data; to find patterns, issues and opportunities. Visit our website for a free white paper on Data Profiling and a free trial of our software.
 
     
 
More Articles about: Software
 
 
 
  • Another Method On How To Unlock Wii Homebrew Without Spending Much
  • From Excel 2003 to Excel 2007 (Part One)
  • Important Factors To Note About Dispatch Software
  • Making Markets Less Competetive With Good Software
  • XSitePro Takes The Complexity Out Of Building Your Website
  • MP3 to MP4 Converter Find the Most Reliable Audio Conversion Software Online
  • Hints And Tips On Using Manufacturing Software
  • The Many Advantages In Using The Journal App
  • Another Online Option To Unlock Wii Homebrew Without Spending Much
  • What is Day Trading
  • Trade Software Forex Success
  • Trusted Way To Unlock Wii Homebrew Without Spending Much
  • How To Use Live Chat Software To Market Your Website
  • 5 Benefits Of Business Voip
  • CRM Software And A Surge In New Business
  • Importance of Design in Online Project Management Software
  • World Wide Web Sources On How To Unlock Wii Homebrew
  • Another Way To Unlock Wii Homebrew Without Spending Much
  • Freedom From Free Microsoft Points
  • 4 Business Benefits Of Live Chat Software
  • 5 Ways Business Voip Can Enhance Your Business Life
  • Tips Of It Careers
  • 3 Simple Ways to Read PDF on Kindle 1 and Kindle Touch
  • What Is Supply Chain Management (SCM)?
  • How to Improve Game Performance
  •  
     
         
         
        © 2012 interpret.zar.vg