On Benchmarking Frequent Itemset Mining Algorithms

(1)

On Benchmarking Frequent Itemset Mining Algorithms

Balázs Rácz, Ferenc Bodon, Lars Schmidt-Thieme

Budapest University of Technology and Economics Computer and Automation

Research Institute of the Hungarian Academy of Sciences

Computer-Based New Media Group, Institute for

Computer Science

(2)

History



Over 100 papers on Frequent Itemset Mining



Many of them claim to be the ‘best’

 Based on benchmarks run against some publicly available implementation on some datasets



FIMI03, 04 workshop: extensive benchmarks with many implementations and data sets

 Serves as a guideline ever since



How ‘fair’ was the benchmark and what did it

measure?

(3)

On FIMI contests

 Problem 1: We are interested in the quality of algorithms, but we can only measure

implementations.

 No good theoretical data model yet for analytical comparison

 We’ll see later: would need good hardware model

 Problem 2: If we gave our algorithms and ideas to a very talented and experienced low-level

programmer, that could completely re-draw the current FIMI rankings.

A FIMI contest is all about the ‘constant factor’

(4)

On FIMI contests (2)

 Problem 3: Seemingly unimportant implementation details can hide all algorithmic features when

benchmarking.

 These details are often unnoticed even by the author and almost never published.

(5)

On FIMI contests (3)



Problem 4: FIM implementations are

complete ‘suites’ of a basic algorithm and several algorithmic/implementational

optimizations. Comparing such complete

‘suites’ tells us what is fast, but does not tell us why.



Recommendation:

 Modular programming

 Benchmarks on the individual features

(6)

On FIMI contests (4)



Problem 5: All ‘dense’ mining tasks’ run time is dominated by I/O.



Problem 6: On ‘dense’ datasets FIMI

benchmarks are measuring the ability of submitters to code a fast integer-to-string conversion function.



Recommendation:

 Have as much identical code as possible

  library of FIM functions

(7)

On FIMI contests (5)



Problem 7: Run time differences are small



Problem 8: Run time varies from run to run

 The very same executable on the very same input

 Bug or feature of modern hardware?

 What to measure?



Recommendation: ‘winner takes all’

evaluation of a mining task is unfair

(8)

On FIMI contests (6)



Problem 9: Traditional run-time (+memory need) benchmarks do not tell us whether an implementation is better than an other in

algorithmic aspects, or implementational (hardware-friendliness) aspects.



Problem 10: Traditional benchmarks do not show whether on a slightly different hardware architecture (like AMD vs. Intel) the

conclusions would still hold or not.



Recommendation: extend benchmarks

(9)

Library and pluggability

 Code reusal, pluggable components, data structures

 Object oriented design

 Do not sacrifice efficiency

 No virtual method calls allowed in the core

 Then how?

 C++ templates

 Allow pluggability with inlining

 Plugging requires source code change, but several versions can coexist

 Sometimes tricky to code with templates

(10)

I/O efficiency



Variations of output routine:

 normal-simple: renders each itemset and each item separately to text

 normal-cache: caches the string representation of item identifiers

 df-buffered: (depth-first) reuses the string

representation of the last line, appends the last item

 df-cache: like df-buffered, but also caches the string representation of item identifiers

(11)

0.1 1 10 100

Time (seconds, log-scale)

decoder-test

df-buffered df-cache normal-cache normal-simple

(12)

Benchmarking: desiderata

1. The benchmark should be stable, and

reproducible. Ideally it should have no variation, surely not on the same hardware.

2. The benchmark numbers should reflect the actual performance. The benchmark should be a fairly accurate model of actual hardware.

3. The benchmark should be hardware-independent, in the sense that it should be stable against the

slight variation of the underlying hardware architecture, like changing the processor manufacturer or model.

(13)

Benchmarking: reality



Different implementations stress different aspects of the hardware



Migrating to other hardware:

 May be better in one aspect, worse in another one



Ranking cannot be migrated between HW



Complex benchmark results are necessary

 Win due to algorithmic or HW-friendliness reason?



Performance is not as simple as ‘run time in

seconds’

(14)

Benchmark platform



Virtual machine

 How to define?

 How to code the implementations?

 Cost function?



Instrumentation (simulation of actual CPU)

 Slow (100-fold slower than plain run time)

 Accuracy?

 Cost function?

(15)

Benchmark platform (2)



Run-time measurement

 Performance counters

 Present in all modern processor (since i586)

 Count performance-related events real-time

 PerfCtr kernel patch under Linux, vendor-specific software under Windows

 Problem: measured numbers reflect the actual execution, thus are subject to variation

(16)

1 10 100

Time (seconds, log-scale)

BMS-POS.dat apriori-noprune

eclat-cover eclat-diffset nonordfp-classic-td

nonordfp-dense nonordfp-sparse

(17)

0 10 20 30 40 50 60

GClockticks

all uops on BMS-POS at 1000

3 uops/tick 2 uops/tick 1 uop/tick

stall bogus uops

nbogus uops prefetch pending

r/w pending

Three sets of bars:

wide, centered

• total size shows total clockticks

used, i.e. run-time,

• purple shows

time of stall (CPU waiting for sth)

Three sets of bars:

narrow, centered

• brown shows # of instructions (u-ops) executed – stable,

• cyan shows

wasted u-ops due to branch mis-

predictions

Three sets of bars:

narrow, right

• lbrown shows ticks of memory r/w (mostly wait)

• black shows

read-ahead

(prefetch)

(18)

0 10 20 30 40 50 60

GClockticks

all uops on BMS-POS at 1000

3 uops/tick 2 uops/tick 1 uop/tick

stall bogus uops

nbogus uops prefetch pending

r/w pending

(19)

Conclusion

 We cannot measure algorithms, only implementations

 Modular implementations with pluggable features

 Shared code for the common functionality (like I/O)

 FIMI library with C++ templates

 Benchmark: run time varies, depends on hardware used

 Complex benchmarks needed

 Conclusions on algorithmic aspects or hardware friendliness?

(20)

Thank you for your attention



On Benchmarking Frequent Itemset Mining Algorithms