• Nem Talált Eredményt

time (sec)

N/A
N/A
Protected

Academic year: 2022

Ossza meg "time (sec)"

Copied!
31
0
0

Teljes szövegt

(1)

Surprising results of trie-based FIM algorithms

Ferenc Bodon

bodon@cs.bme.hu

Department of Computer Science and Information Theory, Budapest University of Technology and Economics

supervisor: Lajos Rónyai

(2)

Purpose of the work

The three central FIM algorithms:

APRIORI,

Eclat,

FP-growth.

Two of them use tries.

Small details have considerable influence on efficiency.

5 details were theoretically and experimentally examined.

(3)

What kind of a trie

Three-level specification:

Trie type: full trie , pruned trie, collapsed trie ≈ Patricia, O-trie,

edge representation:

nB nD nG

B D G

[(B,&nB), (D,&nD), (G,&nG)]

tabular:

[NIL, &nB, NIL, &nD, NIL, NIL, &nG, NIL, · · ·]

(4)

What kind of a trie

Three-level specification:

trie type: full trie , pruned trie, collapsed trie ≈ Patricia, O-trie,

edge representation:

nB nD nG

B D G

[(B,&nB), (D,&nD), (G,&nG)]

tabular:

[NIL, &nB, NIL, &nD, NIL, NIL, &nG, NIL, · · ·]

(5)

What kind of a trie

Three-level specification:

trie type: full trie , pruned trie, collapsed trie ≈ Patricia, O-trie,

edge representation:

nB nD nG

B D G

doubly-linked:

[(B,&nB), (D,&nD), (G,&nG)]

tabular:

[NIL, &nB, NIL, &nD, NIL, NIL, &nG, NIL, · · ·]

(6)

What kind of a trie

Three-level specification:

trie type: full trie , pruned trie, collapsed trie ≈ Patricia, O-trie,

edge representation:

nB nD nG

B D G

doubly-linked:

[(B,&nB), (D,&nD), (G,&nG)]

tabular:

[NIL, &nB, NIL, &nD, NIL, NIL, &nG, NIL, · · ·]

(7)

What kind of trie

memory occupation

167

123 102

B

D

contiguous-memory based:

[2,167,B,6,D,8,0,123,0,102]

(8)

What kind of trie

memory occupation

167

123 102

B

D

contiguous-memory based:

[2,167,B,6,D,8,0,123,0,102]

node-based:

167,[B,·,D,·]

123,[]

102,[]

(9)

What kind of trie

memory occupation

167

123 102

B

D

contiguous-memory based:

[2,167,B,6,D,8,0,123,0,102]

node-based:

167,[B,·,D,·]

123,[]

102,[]

memory need:

16n (20n) 24n (28n)

modification:

difficult easy

(10)

What kind of trie

memory occupation

167

123 102

B

D

contiguous-memory based:

[2,167,B,6,D,8,0,123,0,102]

node-based:

167,[B,·,D,·]

123,[]

102,[]

memory need:

16n (20n) 24n (28n)

modification:

difficult easy

node-based:

(11)

1/5: effect of ordering

Tries store sequences.

sequence ← itemset + total order on the items

The itemsets and the order together determines the trie

0 1

2 3

A

B

C

0

1 2

3 4

B

C

A A

Question: Which order results in the minimum-size trie?

Theorem (Comer and Sethi). Given I, T ⊆ 2I and integer k, it is

NP-complete to decide if there exists a full trie that stores T, and the number of nodes is no more than k.

(12)

1/5: effect of ordering

A simple heuristic: use the descending order according to the frequencies.

Reasoning: it has the most chance that two randomly chosen itemsets have the same prefix.

Example when heuristic does not result in the smallest trie:

0

1 2 3

4 5 6 7 8

9 10

B X A

X M N

A Y

K L

0

1 2

3 4 5 6 7

8 9

B

A

X M

N

X Y

K L

(13)

1/5: effect of ordering

The heuristic works well on synthetic and real-life datasets. A kind of homogeneity exists.

Why is this important?

In FP-growth: size of FP-tree is critical.

In APRIORI: (1.) size of the trie that stores candidates is critical, (2.) order affects the support count method

Sensitivity of FP-tree:

min_freq (%) 1 0.2 0.09 0.035 0.02 ascending 42.48 58.03 61.34 63.6 65.04 descending 27.58 39.74 41.69 43.66 44.10 random 1 29.84 42.30 44.49 46.60 46.41 random 2 36.98 48.97 55.02 56.85 56.72 random 3 34.87 52.18 55.68 58.01 55.50

Database: BMS-POS

(14)

1/5: effect of ordering

In FIM the sensitivity does not matter.

In FIM-related problems, where order can not be chosen freely this side-effect has to be taken into consideration Support count of APRIORI and the order

0 200 400 600 800 1000 1200 1400

0.001 0.01

time (sec)

Support threshold Database: BMS-POS

ascending order descending order

(15)

1/5: effect of ordering

Results of the experiments:

The memory need of APRIORI is not sensitive to the order.

Ascending order according to frequencies results in the fastest APRIORI.

Argument: the most selective items are checked first.

(16)

2/5: storing the transactions

Let t be a transaction.

filtered t: infrequent items are removed from t.

Collect and store filtered transactions in memory Advantages:

IO cost is reduced,

parsing costs are reduced,

the number of support count method calls is reduced.

Disadvantage:

needs extra memory.

Question: What data-structure should be used?

Some possibilities: ordered list, trie, red-black tree

(17)

2/5: storing the transactions

Let t be a transaction.

filtered t: infrequent items are removed from t.

Collect and store filtered transactions in memory Advantages:

IO cost is reduced,

parsing costs are reduced,

the number of support count method calls is reduced.

Disadvantage:

needs extra memory.

Question: What data-structure should be used?

Some possibilities: ordered list, trie, red-black tree

(18)

2/5: storing the transactions

Let t be a transaction.

filtered t: infrequent items are removed from t.

Collect and store filtered transactions in memory Advantages:

IO cost is reduced,

parsing costs are reduced,

the number of support count method calls is reduced.

Disadvantage:

needs extra memory.

Question: What data-structure should be used?

Some possibilities: ordered list, trie, red-black tree

(19)

2/5: storing the transactions

Let t be a transaction.

filtered t: infrequent items are removed from t.

Collect and store filtered transactions in memory Advantages:

IO cost is reduced,

parsing costs are reduced,

the number of support count method calls is reduced.

Disadvantage:

needs extra memory.

Question: What data-structure should be used?

Some possibilities: ordered list, trie, red-black tree

(20)

2/5: storing the transactions

Expectation: a trie needs the least memory.

Reasoning: it stores same prefixes only once.

Experiments:

min_ sorted trie RB-

freq list tree

0.05 12.4 52.5 13.8 0.02 16.2 76.0 17.1 0.0073 17.0 81.5 18.0 0.006 17.1 81.7 18.1

Database: T40I10D100K

In most cases trie needs the most memory (exception:

connect, accidents)

Cause: a trie has much more nodes than a RB-tree has, and a node is expensive.

(21)

3/5: routing strategies at the nodes

How to find the edge to follow in APRIORI?

Given a node with a list of n edges and a part of the filtered transaction (t0), find matching labels.

simultaneous traversal, O(n + |t0|)

binary search, O(|t0| log n) or O(n log |t0|)

binary vector based, O(n)

indexvector based, O(n)

Experiment: APRIORI is sensitive to the routing strategy Winner: indexvector based

Runner up: simultaneous traversal

(22)

3/5: routing strategies at the nodes

How to find the edge to follow in APRIORI?

Given a node with a list of n edges and a part of the filtered transaction (t0), find matching labels.

simultaneous traversal, O(n + |t0|)

binary search, O(|t0| log n) or O(n log |t0|)

binary vector based, O(n)

indexvector based, O(n)

Experiment: APRIORI is sensitive to the routing strategy Winner: indexvector based

Runner up: simultaneous traversal

(23)

3/5: routing strategies at the nodes

How to find the edge to follow in APRIORI?

Given a node with a list of n edges and a part of the filtered transaction (t0), find matching labels.

simultaneous traversal, O(n + |t0|)

binary search, O(|t0| log n) or O(n log |t0|)

binary vector based, O(n)

indexvector based, O(n)

Experiment: APRIORI is sensitive to the routing strategy Winner: indexvector based

Runner up: simultaneous traversal

(24)

3/5: routing strategies at the nodes

How to find the edge to follow in APRIORI?

Given a node with a list of n edges and a part of the filtered transaction (t0), find matching labels.

simultaneous traversal, O(n + |t0|)

binary search, O(|t0| log n) or O(n log |t0|)

binary vector based, O(n)

indexvector based, O(n) Experiment: APRIORI is sensitive to the routing strategy

Winner: indexvector based

Runner up: simultaneous traversal

(25)

3/5: routing strategies at the nodes

How to find the edge to follow in APRIORI?

Given a node with a list of n edges and a part of the filtered transaction (t0), find matching labels.

simultaneous traversal, O(n + |t0|)

binary search, O(|t0| log n) or O(n log |t0|)

binary vector based, O(n)

indexvector based, O(n)

Experiment: APRIORI is sensitive to the routing strategy Winner: indexvector based

Runner up: simultaneous traversal

(26)

4/5: storing frequent itemsets

Only frequent itemsets of size ` are needed for generating candidates of size ` + 1.

Nodes that are not on a path to any candidate slow down support count method.

remove from the trie

store maximum length values for each node

differentiate edges Experiments:

run-time is insensitive

memory need can be greatly reduced.

(27)

4/5: storing frequent itemsets

Only frequent itemsets of size ` are needed for generating candidates of size ` + 1.

Nodes that are not on a path to any candidate slow down support count method.

remove from the trie

store maximum length values for each node

differentiate edges Experiments:

run-time is insensitive

memory need can be greatly reduced.

(28)

4/5: storing frequent itemsets

Only frequent itemsets of size ` are needed for generating candidates of size ` + 1.

Nodes that are not on a path to any candidate slow down support count method.

remove from the trie

store maximum length values for each node

differentiate edges Experiments:

run-time is insensitive

memory need can be greatly reduced.

(29)

4/5: storing frequent itemsets

Only frequent itemsets of size ` are needed for generating candidates of size ` + 1.

Nodes that are not on a path to any candidate slow down support count method.

remove from the trie

store maximum length values for each node

differentiate edges Experiments:

run-time is insensitive

memory need can be greatly reduced.

(30)

5/5: deleting unimportant transactions

A filtered transaction is unimportant from the `th iteration, if it does not contain any (` − 1)-itemset candidate.

Heuristic: Unimportant transactions should be ignored.

Reasoning: They slow down support count (part of the trie is visited).

Experiments: Ignoring unimportant transactions slows down the algorithm.

Argument: It needs resources to determine if a transaction is unimportant or not. In most cases transactions are important (drawback of generate-and-test, breadth-first search method).

(31)

Conclusion

In a trie-based FIM algorithms trie-related issues have to be carefully examined.

Thank you for your attention!

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

To make the problem harder, suppose that the cycles are not synchronized either (that is, all the nodes have the same period ∆, but they start the cycle at a random time). Can you

a) Identify the substances A-K and write down all the equations 1-14. b) Select the redox processes from the reactions. c) Select those compounds from A-K that are not expected

A European Parliament and Council Recommendation from 2001 45 called on Member States to remove such obstacles to mobility, inter alia by making it easier for students to draw

The paper focuses on the novel method developed for filtering raw processing time data for cycle time calculation, and on applying it for decision support based on the

Flying on straight paths and on orbits are handled differently but the fundamental method is that the aircraft flies in a vector field that - should the aircraft deviate from the path

Our second point cloud sequence alignment method is based on the TrICP registration algorithm [2].. Prior to point cloud registration, the input data are filtered to remove outliers

He proved that there are probabilistic languages (p-languages) that are not generated by any probabilistic finite state grammar (PFSG) even though their support (the set of strings

If we cannot restrict ourselves to using an extensional language, then there is no guarantee that extensional agreement of ‘bachelor’ and ‘unmarried man’ rests on meaning rather