Data Profiling: Attribute level Checks |
||||
| By Colin Ross |
||||
| Before starting any data initiative you need to understand
your data. Data Profiling is about building that
understanding, using a set of simple and quick checks. Here we’re going to focus on attribute-level checks, which tend to form the bulk of any Data Profiling exercise. In part this is because most data profiling is performed at the level of a single dataset or table, but also because this attribute-level analysis is fairly simple and can uncover some pretty significant data issues very quickly. Before going any further, we ought to say that an attribute is simply a single column in your data, perhaps a date of birth, a postcode, or a transaction date. Anyway, there are a number of checks which you should look to apply to any attribute of interest to your project. • How much data do you have, specifically, how many rows in your dataset? • What is the data type, length (and precision of numbers if relevant) • What are the Min and Max values? Are these reasonable? For example, are there any dates of birth, or transactions, in the future? • How many Nulls are these? If there are very few Nulls maybe this is an issue. On the other hand, if there are lots of Nulls, is there an implied meaning which you need to be aware of? For example, does a Null gender imply a corporate rather than a personal client? • Are there any missing values (empty strings or unexpected zeros)? • How many unique values does the attribute have (its cardinality)? • How does the cardinality compare to the total number of rows in your dataset? A relatively low cardinality would suggest codes and status values, while a very high cardinality (near identical to the row count) would suggest a possible key. • For those attributes with a low cardinality, document all values present in the dataset with a frequency distribution (often termed a List of Values, or LoV). Are there any clear patterns or popular values? • Are there any outliers or values present only a handful of times? • Is there any value which particularly stands out? Is there any chance that this is the default value offered by the application or that this is a “magic number” used by the business? • Are there any codes which seem very similar to popular codes? For example, ‘00’ versus ‘OO’. • Are there any codes where you can infer meaning, e.g. “Male”, and “Female”? Are there codes present which seem inconsistent with such meaning, e.g. “Y” and “N”? How can these be explained? With old systems there is always a risk that the meaning of codes has changed. Or perhaps the data is being sourced from more than once source (ecommerce and telesales) and the codes may be inconsistent. • Do the rest of the codes look consistent? Are there as many as you might expect? For example US states or for European countries? • Is the same meaning represented by multiple codes? For example “UK”, “United Kingdom”, “Britain”. • Does there appear to be overloading of an attribute’s meaning? E.g. if a flag is set does this attribute mean something else? Classic cases may be the use of forename for company name. • If a set of attributes seem to belong together, are there other such sets, and do they have the same consistent structure? Maybe there are two sets of bank details, or two customers on the dataset. Do they all have the same attributes, with the same properties? • If there are multiple sets of related attributes, is it clear which attributes relate to each other? E.g. Bank_account_1, sort_code_1… • Are any attributes being used for other purposes? Memo fields with bits of other data such as account details or contact information? Address fields with directions or elements of the entity’s name? In my experience even checks as simple as these will turn up issues in any dataset. For very little investment in time or resources, I can guarantee Data Profiling will uncover some interesting issues in your data. |
||||
| Article Source: http://interpret.zar.vg | ||||
| About The Author Citrus Technology provide Data Profiling and Data Quality tools to help you understand your data; to find patterns, issues and opportunities. Visit our website for a free white paper on Data Profiling and a free trial of our software. |
||||
|
||||
| © 2012 interpret.zar.vg |