Mining in profiling for data warehouse
Mining basic statistics
Attribute profiling produces large volumes of information. As a minimum it can be used as a source of reference information during data mapping, quality assessment, and cleansing. It can also be used to systematically update data catalogues. However, the main purpose is to gain knowledge of various unexpected data abnormalities. We must know what to look for in the attribute profiles.
Arguably the most important piece of information in the basic statistics summary is count of NULL values. Some attributes have Null values for all attributes and thus can be ignored. Others may have a few Null values indicating that they are required and the missing values are erroneous. Yet others may have large portion of Null values indicating that they are either optional or largely missing and cannot be used as a good source of information.
Further, some attributes have strange minimum or maximum values suggesting defaults. Count of distinct values is also useful. A single value in all records indicates useless attribute, while a combination of a single value with a number of Null’s suggests the attribute is just a flag.
Mining frequency charts
Frequency charts are the main source of useful metadata. The most basic step is to cross-reference frequency charts with valid values listed in data catalogues or data lookup tables. Values missing in the catalogue can point to the fact that the catalogue is incomplete or obsolete, though they also may identify erroneous values. Values missing in profile (or appearing with unexpectedly low frequency) may indicate the data collection process that misses certain information. Of course data mapping that relies on the data catalogue and assumes that all values are collected will lead to bad data conversion.
Another technique is to look for any values with unusually high frequency. These are likely to be default substitutions for missing values.
Mining distribution charts (value clustering)
Distribution charts are more difficult to analyze because they contain more information, but they still can lead to important findings.
One clear pattern to be looked for is value clustering, which occurs when distribution of attribute values falls into 2 or more clusters. Value clustering is very common when single data field is used to store different logical attributes, such as weekly vs. annual pay rate.
Mining distribution charts (other distribution patterns)
Another common pattern is presence of multiple peaks and troughs. A pick in values may have special meaning. This could be indicative of a large acquisition in that timeframe. This in its turn may suggest a need to look for more data about this event in other parts of the database.
Some distributions have long tails. The actual values will concentrate in a small subset of the range suggested by the minimum and maximum values.
- 4State-transition model profiling examines life...
- Timeline profiling looks for patterns in histor...
- Analyzing profiling results Data profiles provi...
- Mining basic statistics Attribute profiling pro...
- Attribute profiling examines values of individu...