• Nem Talált Eredményt

Biclustering algorithms for Data mining in high-dimensional data

4.3 Ecient methods for bicluster mining

4.3.1 A novel way to mine closed patterns

In this section we propose a new method for mining closed patterns (i.e.

frequent closed itemsets and biclusters) for data matrices with up to three values: -1, 0, 1. This is an extension of the special binary case and therefore, applicable to both data types. The benet of this kind of general approach has been presented in [75] using gene expression data. The key benet of the generalized method is the gained ability to make a distinction between up- and down-regulated genes and thus, discover previously hidden closed patterns [75].

The proposed method consists of two procedures and one function to discover all frequent closed patterns:

• FCPMain is the main procedure (Algorithm 3). First the procedure takes the three input parameters (input data matrix and minimum support thresholds) before encoding the input matrix (A) into a smaller data structure (B) by taking only non-zero matrix values as follows:

B = (bi),where bi ={j : A(i, j)6= 0} (4.1) Note that the transformation in Eq. 4.1 corresponds to the classical rep-resentation of transaction databases in frequent itemset mining prob-lems where bi represents the ith transaction.

The procedure then independently calls the recursive miner procedure FCPMiner for each B row. Note that the parameter vector missin-gRows stores the indices of those rows which are not examined in the actual call (they have been checked before). This is important as close-ness will be checked based on these indices by the IsClosed function.

• FCPMiner procedure is the heart of the method by recursively build-ing up the frequent closed patterns (Algorithm 5). This is done by taking the consecutive rows one-by-one and recording only those col-umn indices that show the same changing tendency (same or exactly the opposite). Then the closeness of the candidate pattern is checked before the method is calling itself with the updated parameters. Fi-nally, the newly discovered patterns are added to the output set of frequent closed patterns.

• IsClosed is a simple function to check whether adding a new row index to the candidate pattern would result in a closed pattern (Algorithm 4).

This is done by checking whether there is a row in missingRows that contains the same column indices with the same changing tendency as

in the pattern under examination. If no such row can be found then the pattern is already a closed one.

Algorithm 3 FCPMain: Main procedure for mining closed patterns Require: A: input discrete matrix

minrows: minimum number of rows in a frequent closed pattern mincols: minimum number of columns in a frequent closed pattern Ensure: Y: List of all closed frequent patterns

1: global A, Y ={}, minrows, mincols, B

2: M issingRows={}

3: Transform An×m into data structure B

4: for every row Ri ∈B wherei= 1. . .(n−minrows)do

5: if i >1then

6: M issingRows=M issingRows∪ {i}

7: end if

8: if |Ri| ≥mc then

9: if (i== 1) or IsClosed(M issingRows, Ri, i) then

10: FCPMiner(M issingRows, Ri,{i})

11: end if

12: end if

13: end for

14: return Y

Algorithm 4 isClosed method

Require: missingRows: indices of previously examined rows (omitted) actualCols: current column indices under examination

actualRow: actual row index under examination

Ensure: boolean: is this candidate frequent pattern closed?

1: global A

2: for every index i inmissingRows do

3: if Ai,j∗Ak,j = 1 ∀j ∈actualCols,∀k∈actualRow or Ai,j ∗Ak,j =−1 ∀j ∈actualCols,∀k∈actualRow then

4: return true

5: end if

6: end for

7: return false

Algorithm 5 FCPMiner procedure

Require: missingRows: indices of previously examined rows (omitted) candidateRows: set of row indices in a candidate closed frequent pattern actualCols: actual column indices under examination

1: global A, Y, minrows, mincols, B

2: for every rows' indexi in {B'srowindices} \candidateRows do

3: actIndices=actualCols∩Bi

4: change1 ={j}, whereAi,j∗AcandidateRows(1),j = 1, j ∈actIndices

5: change−1 ={j}, whereAi,j ∗AcandidateRows(1),j =−1, j ∈actIndices

6: if (|actualCols|==|change1|)or(|actualCols|== |change−1|)then

7: candidateRows=candidateRows∪ {i}

8: else

9: if (|change1| ≥mincols)then

10: if IsClosed(missingRows,change1, i) then

11: FCPMiner(missingRows, candidateRows∪ {i}, change1)

12: end if

13: end if

14: if (|change−1| ≥mincols) then

15: if IsClosed(missingRows,change−1, i) then

16: FCPMiner(missingRows, candidateRows∪ {i}, change−1)

17: end if

18: end if

19: missingRows=missingRows∪ {i}

20: end if

21: end for

22: if |candidateRows|>=minrows then

23: Y =Y ∪ {candidateRows, actualCols}

24: end if

Fig. 4.3 illustrates how the proposed method discovers all frequent closed patterns from a simple example data matrix with minimum support thresh-olds min_rows=min_cols= 2. While the process ow is marked by solid arrows the recursive steps are highlighted by dashed arrows. A bold cross sign indicates that the investigated pattern is not closed, or it does not sat-isfy the minimum support conditions for rows or columns. The discovered frequent closed patterns are surrounded by solid rectangles.

Figure 4.3. A simple example illustrating how the proposed method works.

The minimum support thresholds have been set to 2 for both rows and columns. The method starts by transforming the input matrix into a smaller data structure by taking only non-zero matrix values. Then the recursive miner procedure is called for each row (Steps 1,9,12). Then the next row in-dexes are added to the candidate pattern until the calculated changes of the rst row and the added rows for all column values are identical or opposite, i.e. 1 or -1. For example, in Step 2, the change between the values of column indexes 1,2,3,4 and 6 is always 1 and therefore, row 2 (r2) is added to the rst row (r1) with column indexes 1,2,3,4,6. This pattern is a valid frequent closed pattern as it is not contained in any other closed pattern. In Step 3, a new recursion is initiated forr1,r2,r3 because only a subset of columns (2,3,4) gives the same change (-1) between the rst and the third row. This pattern is also a valid frequent closed pattern. The same applies to patterns at Steps 5 and 10. During the mining process there are many candidate patterns that are not added to the result list of valid frequent closed patterns. Patterns at Steps 6,7,8,11,12,13 are also not closed as they are subsets of other valid frequent closed patterns. For example, the candidate pattern at Step 7 (with row indexes 1 and 3) is not closed as it is part of the closed pattern discovered at Step 4. The IsClosed function ensures that all of this kind of candidate patterns are excluded.