Mining in profiling for data warehouse

by / Tuesday, 01 September 2015 / Published in Billet

Mining basic statistics

Attribute profiling produces large volumes of information. As a minimum it can be used as a source of reference information during data mapping, quality assessment, and cleansing. It can also be used to systematically update data catalogues. However, the main purpose is to gain knowledge of various unexpected data abnormalities. We must know what to look for in the attribute profiles.

Arguably the most important piece of information in the basic statistics summary is count of NULL values. Some attributes have Null values for all attributes and thus can be ignored. Others may have a few Null values indicating that they are required and the missing values are erroneous. Yet others may have large portion of Null values indicating that they are either optional or largely missing and cannot be used as a good source of information.

Further, some attributes have strange minimum or maximum values suggesting defaults. Count of distinct values is also useful. A single value in all records indicates useless attribute, while a combination of a single value with a number of Null’s suggests the attribute is just a flag.

Mining frequency charts

Frequency charts are the main source of useful metadata. The most basic step is to cross-reference frequency charts with valid values listed in data catalogues or data lookup tables. Values missing in the catalogue can point to the fact that the catalogue is incomplete or obsolete, though they also may identify erroneous values. Values missing in profile (or appearing with unexpectedly low frequency) may indicate the data collection process that misses certain information. Of course data mapping that relies on the data catalogue and assumes that all values are collected will lead to bad data conversion.

Another technique is to look for any values with unusually high frequency. These are likely to be default substitutions for missing values.

Mining distribution charts (value clustering)

Distribution charts are more difficult to analyze because they contain more information, but they still can lead to important findings.

One clear pattern to be looked for is value clustering, which occurs when distribution of attribute values falls into 2 or more clusters. Value clustering is very common when single data field is used to store different logical attributes, such as weekly vs. annual pay rate.

Mining distribution charts (other distribution patterns)

Another common pattern is presence of multiple peaks and troughs. A pick in values may have special meaning. This could be indicative of a large acquisition in that timeframe. This in its turn may suggest a need to look for more data about this event in other parts of the database.

Some distributions have long tails. The actual values will concentrate in a small subset of the range suggested by the minimum and maximum values.


Others articles

State transition model 








state-transition timeline profiling Analyzing Mining Mining Profiling relationship profiling

Subject profiling Profiling


Get Free Email Updates!

Signup now and receive an email once We publish new content.

We will never give away, trade or sell your email address. You can unsubscribe at any time.

Please follow and like us:
David-Marc Petit

David-Marc Petit

Président at DWBI Expert
David-Marc PETIT est le président de DWBI Expert Inc. Il cumule plus de 20 ans d’expérience dans des entreprises de toutes tailles et tous secteurs, sur trois continents, en tant qu’expert en intelligence d’affaires. Il a fait de sa mission la démocratisation de l’intelligence d’affaires (Business Intelligence) pour optimiser les revenus et la performance de ses clients.
David-Marc Petit
David-Marc Petit

Latest posts by David-Marc Petit (see all)

Leave a Reply


Enjoy this website? Please spread the word :)