Data Profiling: Attribute level Checks

 
     
  By Colin Ross
 
   
     
  Before starting any data initiative you need to understand your data. Data Profiling is about building that understanding, using a set of simple and quick checks.



Here we’re going to focus on attribute-level checks, which tend to form the bulk of any Data Profiling exercise. In part this is because most data profiling is performed at the level of a single dataset or table, but also because this attribute-level analysis is fairly simple and can uncover some pretty significant data issues very quickly.

Before going any further, we ought to say that an attribute is simply a single column in your data, perhaps a date of birth, a postcode, or a transaction date.

Anyway, there are a number of checks which you should look to apply to any attribute of interest to your project.

• How much data do you have, specifically, how many rows in your dataset?
• What is the data type, length (and precision of numbers if relevant)
• What are the Min and Max values? Are these reasonable? For example, are there any dates of birth, or transactions, in the future?
• How many Nulls are these? If there are very few Nulls maybe this is an issue. On the other hand, if there are lots of Nulls, is there an implied meaning which you need to be aware of? For example, does a Null gender imply a corporate rather than a personal client?
• Are there any missing values (empty strings or unexpected zeros)?
• How many unique values does the attribute have (its cardinality)?
• How does the cardinality compare to the total number of rows in your dataset? A relatively low cardinality would suggest codes and status values, while a very high cardinality (near identical to the row count) would suggest a possible key.
• For those attributes with a low cardinality, document all values present in the dataset with a frequency distribution (often termed a List of Values, or LoV). Are there any clear patterns or popular values?
• Are there any outliers or values present only a handful of times?
• Is there any value which particularly stands out? Is there any chance that this is the default value offered by the application or that this is a “magic number” used by the business?
• Are there any codes which seem very similar to popular codes? For example, ‘00’ versus ‘OO’.
• Are there any codes where you can infer meaning, e.g. “Male”, and “Female”? Are there codes present which seem inconsistent with such meaning, e.g. “Y” and “N”? How can these be explained? With old systems there is always a risk that the meaning of codes has changed. Or perhaps the data is being sourced from more than once source (ecommerce and telesales) and the codes may be inconsistent.
• Do the rest of the codes look consistent? Are there as many as you might expect? For example US states or for European countries?
• Is the same meaning represented by multiple codes? For example “UK”, “United Kingdom”, “Britain”.
• Does there appear to be overloading of an attribute’s meaning? E.g. if a flag is set does this attribute mean something else? Classic cases may be the use of forename for company name.
• If a set of attributes seem to belong together, are there other such sets, and do they have the same consistent structure? Maybe there are two sets of bank details, or two customers on the dataset. Do they all have the same attributes, with the same properties?
• If there are multiple sets of related attributes, is it clear which attributes relate to each other? E.g. Bank_account_1, sort_code_1…
• Are any attributes being used for other purposes? Memo fields with bits of other data such as account details or contact information? Address fields with directions or elements of the entity’s name?

In my experience even checks as simple as these will turn up issues in any dataset. For very little investment in time or resources, I can guarantee Data Profiling will uncover some interesting issues in your data.

 
   
  Article Source: http://interpret.zar.vg   
     
  About The Author
Citrus Technology provide Data Profiling and Data Quality tools to help you understand your data; to find patterns, issues and opportunities. Visit our website for a free white paper on Data Profiling and a free trial of our software.
 
     
 
More Articles about: Software
 
 
 
  • The Benefits of Using Online Timesheet Software
  • Choose Fee based Services When Asking How To Unlock Wii
  • Price Comparison Software From Zebra Web Design
  • How Advances In Writing Software Have Made Dictation Easier
  • New Complete Methods On How To Unlock Wii This Year
  • Downloading And Printing Documents From The Internet Joint Business And Consumer Responsibility
  • Selecting The Best 2012 Method On How To Unlock Wii
  • Project Management Software And Project Scheduling
  • New Complete Instructions On How To Unlock Wii Homebrew This 2012
  • E guides On How To Unlock Wii Homebrew This Year
  • Hunting For Trusted Electronic Guides That Will Teach How To Unlock Wii Homebrew
  • Internet Ways On How To Unlock Wii Homebrew Today
  • Improve Your Trade Techniques With A Trading Software
  • Buying IT Helpdesk Software : All You Need To Know In A Nutshell
  • Another Method On How To Unlock Wii Homebrew Without Spending Much
  • From Excel 2003 to Excel 2007 (Part One)
  • Important Factors To Note About Dispatch Software
  • Making Markets Less Competetive With Good Software
  • XSitePro Takes The Complexity Out Of Building Your Website
  • MP3 to MP4 Converter Find the Most Reliable Audio Conversion Software Online
  • Hints And Tips On Using Manufacturing Software
  • The Many Advantages In Using The Journal App
  • Another Online Option To Unlock Wii Homebrew Without Spending Much
  • What is Day Trading
  • Trade Software Forex Success
  •  
     
         
         
        © 2012 interpret.zar.vg