On Benchmarking Frequent Itemset Mining Algorithms
Balázs Rácz, Ferenc Bodon, Lars Schmidt-Thieme
Budapest University of Technology and Economics Computer and Automation
Research Institute of the Hungarian Academy of Sciences
Computer-Based New Media Group, Institute for
Computer Science
History
Over 100 papers on Frequent Itemset Mining
Many of them claim to be the ‘best’
Based on benchmarks run against some publicly available implementation on some datasets
FIMI03, 04 workshop: extensive benchmarks with many implementations and data sets
Serves as a guideline ever since
How ‘fair’ was the benchmark and what did it
measure?
On FIMI contests
Problem 1: We are interested in the quality of algorithms, but we can only measure
implementations.
No good theoretical data model yet for analytical comparison
We’ll see later: would need good hardware model
Problem 2: If we gave our algorithms and ideas to a very talented and experienced low-level
programmer, that could completely re-draw the current FIMI rankings.
A FIMI contest is all about the ‘constant factor’
On FIMI contests (2)
Problem 3: Seemingly unimportant implementation details can hide all algorithmic features when
benchmarking.
These details are often unnoticed even by the author and almost never published.
On FIMI contests (3)
Problem 4: FIM implementations are
complete ‘suites’ of a basic algorithm and several algorithmic/implementational
optimizations. Comparing such complete
‘suites’ tells us what is fast, but does not tell us why.
Recommendation:
Modular programming
Benchmarks on the individual features
On FIMI contests (4)
Problem 5: All ‘dense’ mining tasks’ run time is dominated by I/O.
Problem 6: On ‘dense’ datasets FIMI
benchmarks are measuring the ability of submitters to code a fast integer-to-string conversion function.
Recommendation:
Have as much identical code as possible
library of FIM functions
On FIMI contests (5)
Problem 7: Run time differences are small
Problem 8: Run time varies from run to run
The very same executable on the very same input
Bug or feature of modern hardware?
What to measure?
Recommendation: ‘winner takes all’
evaluation of a mining task is unfair
On FIMI contests (6)
Problem 9: Traditional run-time (+memory need) benchmarks do not tell us whether an implementation is better than an other in
algorithmic aspects, or implementational (hardware-friendliness) aspects.
Problem 10: Traditional benchmarks do not show whether on a slightly different hardware architecture (like AMD vs. Intel) the
conclusions would still hold or not.
Recommendation: extend benchmarks
Library and pluggability
Code reusal, pluggable components, data structures
Object oriented design
Do not sacrifice efficiency
No virtual method calls allowed in the core
Then how?
C++ templates
Allow pluggability with inlining
Plugging requires source code change, but several versions can coexist
Sometimes tricky to code with templates
I/O efficiency
Variations of output routine:
normal-simple: renders each itemset and each item separately to text
normal-cache: caches the string representation of item identifiers
df-buffered: (depth-first) reuses the string
representation of the last line, appends the last item
df-cache: like df-buffered, but also caches the string representation of item identifiers
0.1 1 10 100
Time (seconds, log-scale)
decoder-test
df-buffered df-cache normal-cache normal-simple
Benchmarking: desiderata
1. The benchmark should be stable, and
reproducible. Ideally it should have no variation, surely not on the same hardware.
2. The benchmark numbers should reflect the actual performance. The benchmark should be a fairly accurate model of actual hardware.
3. The benchmark should be hardware-independent, in the sense that it should be stable against the
slight variation of the underlying hardware architecture, like changing the processor manufacturer or model.
Benchmarking: reality
Different implementations stress different aspects of the hardware
Migrating to other hardware:
May be better in one aspect, worse in another one
Ranking cannot be migrated between HW
Complex benchmark results are necessary
Win due to algorithmic or HW-friendliness reason?
Performance is not as simple as ‘run time in
seconds’
Benchmark platform
Virtual machine
How to define?
How to code the implementations?
Cost function?
Instrumentation (simulation of actual CPU)
Slow (100-fold slower than plain run time)
Accuracy?
Cost function?
Benchmark platform (2)
Run-time measurement
Performance counters
Present in all modern processor (since i586)
Count performance-related events real-time
PerfCtr kernel patch under Linux, vendor-specific software under Windows
Problem: measured numbers reflect the actual execution, thus are subject to variation
1 10 100
Time (seconds, log-scale)
BMS-POS.dat apriori-noprune
eclat-cover eclat-diffset nonordfp-classic-td
nonordfp-dense nonordfp-sparse
0 10 20 30 40 50 60
GClockticks
all uops on BMS-POS at 1000
3 uops/tick 2 uops/tick 1 uop/tick
stall bogus uops
nbogus uops prefetch pending
r/w pending
Three sets of bars:
wide, centered
• total size shows total clockticks
used, i.e. run-time,
• purple shows
time of stall (CPU waiting for sth)
Three sets of bars:
narrow, centered
• brown shows # of instructions (u-ops) executed – stable,
• cyan shows
wasted u-ops due to branch mis-
predictions
Three sets of bars:
narrow, right
• lbrown shows ticks of memory r/w (mostly wait)
• black shows
read-ahead
(prefetch)
0 10 20 30 40 50 60
GClockticks
all uops on BMS-POS at 1000
3 uops/tick 2 uops/tick 1 uop/tick
stall bogus uops
nbogus uops prefetch pending
r/w pending
Conclusion
We cannot measure algorithms, only implementations
Modular implementations with pluggable features
Shared code for the common functionality (like I/O)
FIMI library with C++ templates
Benchmark: run time varies, depends on hardware used
Complex benchmarks needed
Conclusions on algorithmic aspects or hardware friendliness?
Thank you for your attention