Wednesday, December 8, 2010

Horizontal Format Data Mining with Extended Bitmaps

Abstract
Analysing the data warehouses to foresee the patterns of the transactions often needs high computational power and memory space due to the huge set of past history of the data transactions. Apriori algorithm is a mostly learned and implemented algorithm that mines the data warehouses to find the associations. Frequent item set mining with vertical data format has been proposed as an improvement over the basic Apriori algorithm.
In this paper we are proposing an algorithm as an alternative to Apriori algorithm, which will use bitmap indices in conjunction with a horizontal format data set converted to a vertical format data structure to mine frequent itemsets leveraging efficiencies of bitmap based operations and vertical format data orientation.

Categories and Subject Descriptors
[Data Mining] Association Rule, Apriori, Bitmap Indices.
[Data Analysis] Data warehousing, Data Analysis.
General Terms - Data Analysis and Mining
Keywords - Data mining, Association Rule, Apriori, Vertical format mining, Bitmap Indices



Here we are proposing an algorithm "Horizontal Format Data Mining with Extended Bitmaps," for the Association Rule Mining. First we will have a look into the association rule mining and the roots of our algorithm. What is association rule mining? Finding interesting relationships between the variables is defined as Association Rule Mining. Association rule mining is often explained by market basket analysis, where the customers' purchase details are analyzed to find interesting relationships between the items. Here we find the variable sets which appear together. This is defined as Frequent Itemsets, and it is an interest of research due to its expensiveness.

Apriori Algorithm is a fundamental algorithm for the association rule mining. This mines the frequent patterns that are presented in a horizontal format, where the items are listed against their respective transaction. Apriori algorithm abides to the apriori property - any subset of the frequent itemsets is frequent. Each pass should go through the whole data set in Apriori algorithm. Hence it is not an optimized algorithm. Many improvements are suggested to the Apriori Algorithm.

Transaction data mostly occur in horizontal format, where vertical format is an alternative way of looking into it. Here the transactions are listed against the respective items. Since data may not appear in this format, we may need to re-organize the data into the vertical format, before mining them for the associations. Many effective algorithms are built on top of the vertical format data mining.

Now let's look at the next interesting terminology of our algorithm - Bitmaps. Bitmaps are used to store individual bits compactly. It's 0's and 1's where 1's depict the existence. Major advantage of using bitmaps is the possibility of effectively exploiting the bit-level parallelism. We have seen the vertical data formats and the bitmaps. Now we have a question. Is it possible to grab the benefits from both the vertical format representation and the bitmap operations to find frequent itemsets in a distributed environment?

Here we propose the algorithm 'Horizontal Format Data Mining with Extended Bitmaps'. The algorithm takes the data set organized in horizontal format. With one pass of the data set, we construct a bit map based data structure. The bit map structure will be in the vertical format. This structure facilitates an efficient mining.

First we take the transaction id of T100. (T100, {I1, I2, I5}). We will mark the items that appear in T100 in the ordered array. At the same time we link the associated items to the ordered array. Hence I2 will be linked to I1 in the master array, while I5 will be linked to I2. I5 will also be linked to I2 in the ordered array. Linking I1 to I2 or I2 to I5 in the ordered array is avoided to prevent redundancy. I1, I2, and I5 are marked 1 to represent their existence, hence constructing their bitmaps.

Refer to the Slides for a simple explanation on the algorithm "Horizontal Format Data Mining with Extended Bitmaps" itself.

T - Average size of transaction (Transactions).
I - Average size of maximal potentially-large itemsets (Itemsets).
D - Number of Transactions (Datasets).


Image: http://en.wikipedia.org/wiki/File:Storeisle.png

No comments:

Post a Comment

You are welcome to provide your opinions in the comments. Spam comments and comments with random links will be deleted.