Important concepts and notation for frequent pattern mining

In document DATAMINING GÁBORBEREND (Pldal 139-146)

? 6.1 The curse of dimensionality

Exercise 7.1. Suppose you have a collection of recipes including a list of ingredients required for them. In case you would like to find recipes that are

7.1 Important concepts and notation for frequent pattern mining

Before delving into the details of frequent pattern mining algorithms, we define a few concepts and introduce some notations first to make the upcoming discussion easier.

Upon introducing the definitions and concepts, let us consider the example transactional dataset from Table7.1. The task of frequent pattern mining naturally becomes more interesting and challenging for transactional datasets of much larger size. This is a rather small dataset which includes only five transactions, however, can be conve-niently used for illustrative purposes. Note that in the remainder of the chapter, we will use the concepts (market) basket and transaction interchangeably.

Basket ID Items

1 {milk, bread, salami}

2 {beer, diapers}

3 {beer, wurst}

4 {beer, baby food, diapers}

5 {diapers, coke, bread}

Table7.1: Example transactional dataset.

7.1.1 Support of item sets

Let us first define thesupportof an item set. Given some item set I, we say that its support is simply the number of transactionsTj

from the entire transactional databaseTsuch thatTj ⊇ I, that is the market basket with indexjcontains the item setI. Support is often

reported as a number between0and1, quantifying the proportion of the transactional database which contains item setI.

Note that a basket increases the support of an item set once all elements of the item set are found in a particular basket. Should a single element from an item set be missing from a basket, it no longer qualifies to increase the support for the particular item set. On the other hand, a basket might include arbitrary number of excess items relative to some item set and still contribute to its overall support.

Example7.2. Let us calculate the support of the item set {beer} from the example transactional dataset from Table7.1. The item beer can be found in three transactions (cf. baskets with ID2,3and4), hence its support is also three. For our particular example – when the transactional database contains 5transactions – this support can also be expressed as3/5=0.6.

It is also possible to quantify the support of multi-item item sets. The support of the item set {beer, diapers} is two (cf. baskets with id2and4), or2/5 = 0.4in the relative sense when normalized by the size of the transactional dataset.

There is an important property of the support of item sets that we will heavily rely which will ensure the correctness of the Apriori al-gorithm, being one of the powerful algorithms to tackle the problem of frequent pattern mining. This important property is the anti-monotonityof the support of the item sets. What anti-monotonity means in general for some function f : X → Ris that for any x1,x2 ∈ X, that is a pair of inputs from the domain of the function the property

x1>x2⇒ f(x1)≤ f(x2)

holds, meaning that the value returned by the function for a larger input is allowed to be at most as large as any of the outputs returned by the function for any smaller input.

We define a partial ordering over the subsets of items as depicted in Figure7.1. According to the partial ordering we say that an item setI is “larger” than item setJ whenever the relationI ⊃ J holds between the two item sets. Now the anti-monotonity property is naturally satisfied for the support of item sets as the support of a superset of some item set is at most as large as the support of the narrower set.

In the example illustrated by Figure7.1, item sets{b},{c},{b,c} are assumed to be frequent. We indicate the fact that these item sets are considered frequent by marking them with red. Note how the anti-monotone property of support is reflected graphically in Fig-ure7.1as all the proper subsets of the frequent item sets are always frequent as well. If this were not the case, that was a violation of the anti-monotone property.

m i n i n g f r e q u e n t i t e m s e t s 141

{a,b,c}

{a,b} {a,c} {b,c}

{a} {b} {c}

Figure7.1: An example Hasse diagram for itemsa,bandc. Item sets marked by red are frequent.

7.1.2 Association rules

The next important concept is that of association rules. From a mar-ket basmar-ket analysis point of view anassociation ruleis a pair of (dis-joint) item sets,(X,Y)such that the purchase of item setX makes the purchase of item setY likely. It is notated asX ⇒ Y.

In order to quantify the strength of an association rule, one can calculate itsconfidence, i.e.,

c(X ⇒ Y) = support(X ∪ Y) support(X) ,

that is the number of transactions containing all of the items present in the association rule, divided by the number of transactions that include at least the ones on the left hand side of the rule (and poten-tially, but not mandatorily any other items, including the ones on the right hand side of the association rule). What confidence intuitively quantifies for an association rule is a conditional probability, i.e., it tells us the probability that a basket would contain item setY given that the basket already contains item setX.

Example7.3. Revisiting the example transactional dataset from Table7.1, let us calculate the confidence of the association rule{beer} ⇒ {diaper}. In order to do so we need the support of the item pair {beer, diapers} and that of the single item on the left hand side of the association rule, i.e., {beer}.

Recall that these support values are exactly the ones we calculated in Example7.2that is

c({beer} ⇒ {diapers}) = support({beer,diapers})

support({beer}) =2/3.

Recall that unlike conditional probabilities are not symmetric, i.e., P(A|B) = P(B|A)need not be the case by definition, the same applies

for the confidence of item sets, meaning that c(X ⇒ Y) =c(Y ⇒ X) does notnecessarily hold.

As an example to see when the symmetry breaks, calculate the confidences of the association rules

{bread} ⇒ {milk}and{milk} ⇒ {bread}.

Also note that association rules can have multiple items on either on their sides, meaning that association rules of the form

{beer} ⇒ {diapers,baby f ood} are totally legit ones.

7.1.3 The interestingness of an association rule

One could potentially think that association rules with high confi-dence are needlessly useful. This is not necessarily the case, however.

Just imagine the simple case when there is some product Awhich simply gets purchased by every customer. Since this product can be found in every market basket, no matter what productBwe choose for, the confidence of the association rulec(B ⇒ A)would also be inevitably 1.0 for any productB.

In order to better access the usefulness of an association rule, we need to devise some notion of true interestingness for the association rules. There exists a variety of such interestingness measures. Going through all of them and detailing their properties is beyond our scope, here we just simply mention a few of the possible ways to quantify the interestingness of an association rule.

A simple way to measure how interesting an association rule A ⇒ Bis to calculate the so-calledliftof the association rule by the formula

c(A ⇒ B) s(B) ,

withc(A ⇒ B)ands(B)denoting the confidence of the associa-tion rule and the relative support of item setB, respectively. Taking into consideration that the confidence of an association rule can be regarded as a conditional probability of purchasing item setBgiven that item setAhad been purchased, and that the relative support of an item set is nothing but the probability of purchasing that given item setA, it is easy to see that the lift of an association rule can be rewritten as

P(A,B) P(A)P(B),

m i n i n g f r e q u e n t i t e m s e t s 143

withP(A,B)indicating the joint probability of buying both item sets AandBsimultaneously,P(A)andP(B)referring to the marginal probability of purchasing item setsAandB, respectively. What it means in the end that the lift of a rule investigates to what extent is the purchase of item setAis independent from that of item set A. A lift value of 1 means that item setsAandBare purchased independent from each other. Larger lift values mean a stronger connection between item setsAandB.

We get a further notion of interestingness for an association rule if we calculate

i(A ⇒ B) =c(A ⇒ B)−s(B),

wherec(A ⇒ B)denotes the confidence of the association rule and s(B)marks the relative support for the item set on the right hand side of the association rule. Unlike lift, this quantity can take negative value, once the conditions(B) > c(A ⇒ B)holds. This happens when item setBis less frequently present among such baskets that contain item setAcompared to the overall frequency of the presence of item setB (irrespective of item setA). A value of zero for that value means that we see item setBjust as frequently in those baskets that contain item setAas well than in any basket not necessarily containing item setAin general. A positive value on the other hand means, that the presence of item set in a basket makes the presence of item setBmore likely compared to the case when we do not know if Ais also present in the basket.

7.1.4 The cardinality of potential association rules

In order to illustrate the difficulty of the problem we try to solve from a combinatorial point of view, let us quantify the number of possible association rules that one can construct out ofdproducts. Intuitively, in an association rule every item can be either absent or present in the left hand side or the right hand side of the rule. This means that for every item, there are three possibilities in which it can be involved in an association rule, meaning that there are exponentially many, i.e.,O(3d)potential association rules that can be assembled fromd distinct items.

Note, however, that the quantity 3dis an overestimation towards the true number ofvalidassociation rules, in which we expect both sides to be disjoint and non-empty. By discounting for the invalid association rules and utilizing the equation

(1+x)d=

d j=1

d j

xdj+xd,

we get for the exact number of valid association rules to be answer was not correct in the strict sense – because it also included the number of invalid association rules – it was correct in the asymp-totic sense.

Based on the above exact results, we can formulate as many as 18,660possible association rules even when there are justd = 9 single items to form association rules from. This quick exponential growth in the number of potential association rules is illustrated in Figure7.2, making it apparent that without efficient algorithms finding association rules would not practically be feasible. Since association rules can be created by partitioning frequent item sets, it is of upmost importance that we could find frequent item sets efficiently in the first place.

2 4 6 8 10 12 14

Figure7.2: Illustration of the distinct potential association rules as a function of the different items/features in our dataset (d).

7.1.5 Special subtypes of frequent item sets

Before delving into the details of actual algorithms which efficiently determine frequent item sets, we first define a few important special subtypes of item sets. These special classes of item sets which are beneficial because they allow for a compressed storage of frequent item sets that can be found in some transactional dataset.

• Amaximal frequent item setis such that all of its supersets are not frequent. To put it formally, an item setIis maximal frequent

m i n i n g f r e q u e n t i t e m s e t s 145

if the property

{I|Ifrequent ∧@frequentJI} holds for it.

• Aclosed item setis such an item set that none of its supersets have a support equal to it, or to put it differently, all of its su-persets have a strictly smaller support compared to it. Formally stating this property

{I|@J⊃I:s(J) =s(I)}

has to hold for an item setIto be closed.

• Aclosed frequent item setis simply an item set which is closed in the above sense and which has a support exceeding some previ-ously defined frequency thresholdτ.

It can be easily seen that maximal item sets always need to be closed as well. This statement can be verified by contradiction. That is if we suppose that there exists some item set Iwhich is maximal, but which is not closed, we get to a contradiction. Indeed, if item set Iis not a closed item set, then it means that there is at least one such supersetJ ⊃ Iwith the exact same support, i.e.,s(I) = s(J). Now, sinceI is a frequent item set based on our initial assumption, so does Jbecause we have just seen that it would have the same support asI.

This, however, contradicts to the assumption ofIbeing a maximal frequent item set, since there exists no superset for maximal frequent item sets that would be frequent as well. Hence maximality of an item set implies its closed property as well. Figure7.3summarizes the relation of the different subtypes of frequent item sets.

As depicted by Figure7.3, maximal frequent item sets holds for just a privileged set of frequent item sets, i.e., those special ones that are located at thefrequent item set border, also referred to as the positive border. Item sets that are located in the positive border are extremely important, since storing only these item sets is sufficient to implicitly store all the frequent item sets. This is again assured by the anti-monotone property of the support of item sets, i.e., all proper subsets of the item sets on the positive border needs to be frequent as well because they have at least the same or even higher support as those item sets present in the positive border.

Additionally, by the definition of maximal frequent item sets, it is also the case that none of their supersets is frequent, hence maximal frequent item sets are indeed sufficient to be stored for implicitly storing all the frequent item sets. Storing maximal frequent item sets, however, would not allow us to reconstruct the supports of all the

Frequent item sets

Closed frequent item sets

Maximal frequent item sets

Figure7.3: The relation of frequent item sets to closed and maximal frequent item sets.

frequent item sets. If it is also important for us that we could tell the exact support for all the frequent item sets, then we also need to store all the closed frequent item sets.

Note, however, that the collection of closed frequent item sets is still narrower than those of frequent item sets (cf. Figure7.3). Hence, storing closed frequent item sets alone also implicitly stores all the frequent item sets in a compressed form together with their support.

Example7.4. Table7.2contains a small transactional dataset alongside with the categorization of the different item sets. In this example, we take the minimum support for regarding an item set to be frequent as3. We can see as mentioned earlier that whenever a dataset qualifies being maximal frequent, it always also holds that the given dataset is closed simultaneously.

Additionally, we can see as well that taking all the proper subsets of the maximal frequent item sets identified in the transactional dataset, we can generate all further item sets that are frequent as well, i.e., the proper subsets of item sets{A,B}and{B,C}, it follows that the individual singleton item sets –{A},{B}and{C}– are also above our frequency threshold that was set to3.

In document DATAMINING GÁBORBEREND (Pldal 139-146)