• Nem Talált Eredményt

An Algebraic Representation of Frequent Market Baskets and Association Rules

N/A
N/A
Protected

Academic year: 2022

Ossza meg "An Algebraic Representation of Frequent Market Baskets and Association Rules"

Copied!
8
0
0

Teljes szövegt

(1)

BULGARIAN ACADEMY OF SCIENCES

CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 11, No 2 Sofia 2011

An Algebraic Representation of Frequent Market Baskets and Association Rules

J. Demetrovics

1

, Hua Nam Son

2

, Akos Guban

2

1 MTA SZTAKI, 1111 Budapest, Lágymányosi u. 11

2 Budapest Business School, 1149 Budapest, Buzogány u. 11-13

Abstract: This study proposes an algebraic approach for formal representation of Market Basket (MB) model. In a more generalized model by taking into consideration the quantity of items in transactions and by using tools of lattice theory we reconsider well-known problems and show an explicit representations of frequent MBs, basic frequent MBs and association rules. As straightforward consequences, the algorithms to find them are presented.

Keywords: market basket, frequent item, association rule, lattice.

1. Introduction

Great efforts have been made to discover the informations hidden in the customer transactions. The study of customer Market Baskets (MB) and mining the association rules are important in various applications, for example, in decision making and strategy determination of retail economy [1]. In those studies the market baskets (transactions) are often considered as sets of items purchased by customers. Discovering of large itemsets and association rules attracts the interest of researchers. One can notice that in these studies the researchers are interested in the set of items (e.g. bread, milk, ...) purchased by customers in the super market, and did not care of the quantity of each item. However, it is interesting also if we know not only that 70% of customers buy bread and milk, but we know also 50% of customers buy 1 kg bread and 2 l milk, while 1% of customers buy 10 kg bread and 1 l milk. Similar example can be found for association rules. The meaning of quantitative analysis of transactions is evident.

(2)

In this study we introduce a quantitative analysis of transactions and association rules of transactions. The quantitative analysis may reveal informations hidden in the transactions. We are interested not only in the statement “90% of customers who buy bread and milk also purchase butter”, but in the statement “90%

of customers who buy 1 kg bread and 2 l milk also purchase 0.5 kg butter”. By dealing with the quantity of items our setting is somehow different of those in previous studies (see [1]). That is why instead of itemsets (see [1]) we use market baskets or transactions. The main advantage of this approach is that all transactions can be examined as elements of a lattice with natural partial order. So the lattice- theoretic methods can be applied for transactions examination.

2. A generalized setting for Market Basket Model

For a finite set of items P={p1,p2,...,pn} we consider a MB as a tube ]),

[ ..., [2], [1], (

= α α α n

α where α[ ]i ∈ℵ is the quantity of pi in the basket

α

. The set of all MBs is denoted by Ω.

For α,β∈Ω where α=(α[1],α[2],...,α[n]), β =(β[1],β[2],...,β[n]) we write α ≤β if for all i=1,2,...,n we have α[i]≤β[i]. 〈Ω,≤〉 is a lattice with the natural partial order ≤. For a set A⊆Ω we denote

, } :

| {

= )

(A α∈Ω ∀β∈A β ≤α U

} :

| {

= )

(A α∈Ω ∀β∈A α ≤β

L .

We denote also

sup(A) = {α∈U(A)|∃/ β∈U(A): β < α}, inf(A) = {α∈L(A) |∃/ β∈L(A): α < β }.

One should remark that sup(A) and inf(A) are single elements of Ω, namely Ω

u A)= (

sup , where u[i] = max{α[i]|α∈A} and inf(A)=v∈Ω, where v[i] = min{α[i]|α∈A}.

For a set A⊆Ω and

α

∈Ω we denote by

|

|

| }

| {

=| ) (

supp A

A

A

β α

α β

the support of

α

in A. In word, suppA(α) denotes the rate of all market baskets that exceeds the given threshold

α

(in the form of a sample market basket) to the whole A. The support of an market basket is a statistical index and naturally, the market baskets of more support are of more significance and attract the attention of the managers, as well as of the researchers.

One can notice that an item pi (discused in other studies, see, for example, [1]) in our study should be identified with Ui), where αi=(α[1],α[2],...,α[n]),

0

= ]

α[k if k=/ i and α[i]=1. We should not confuse pi with

α

i.

For α,β∈Ω where α =(α[1],α[2],...,α[n]) and β =(β[1],β[2],...,β[n]) we write γ =α∪β if γ[i]=max{α[i], β[i]} for all i=1,2,...,n. We call

(3)

β

α → an association rule of β to

α

. By the confidence of α →β in a set of MBs A we understand the rate

) ( supp

) (

=supp ) (

conf α

β β α

α

A A A

→ ∪

As remarked in [1] the support of MBs is a kind of statistical index, while the confidence of association rules is a measure of their “strength”.

3. Frequent Market Baskets

For a set A⊆Ω, α∈Ω and 0≤

ε

≤1 we say that

α

is ε-frequent MB, if ε

α)≥ (

suppA . The set of all ε-frequent MBs is denoted by ΦεA. We have the following

Apriori Principle. For a set A⊆Ω, α,β∈Ω and 0≤ε ≤1, if α ≤β and β is ε-frequent then

α

is ε-frequent.

Example 1. Consider a set of items P={a,b,c} and a set of transactions }

, , , {

= α β γ δ

A , where α =(2,1,0), β=(1,1,1) , γ =(1,0,1), δ =(2,2,0). One can see that for σ =(1,1,0), η=(1,2,0) we have

4

= 3 ) (

suppA σ and

4

= 1 ) (

suppA η . For the threshold 2

= 1

ε the ε-frequent MBs of A are:

0)}.

0, (0, 0), 0, (1, 0), 1, (0, 1), 0, (0, 0), 0, (2, 0), 1, (1, 1), 0, (1, 0), 1, {(2,

=

2 1

ΦA

Let us denote

. }}

..., , , { : ...,

, ,

| {

= 1 2 1 2

,k k k

A α∈Ω ∃α α α ∈A α ≤ α α α

Φ

One can remark that if kl then ΦA,k ⊇ΦA,l and ΦεAA,k, where

⎡ | |

= A

k ε denotes the smallest integer that is greater or equal to ε|A|. We have the following

Theorem 1. For a set of items P={p1,p2,...,pn}, a set of MBs A⊆Ω and a threshold 0≤

ε

≤1 an MB

α

∈Ω is

ε

-frequent iff there exist α12,...,αkA such that α∈L({α12,...,αk}), where k=⎡ε|A|⎤.

P r o o f: If there exist α1, α2, ..., αkA, k =⎡ε|A|⎤, such that })

..., , ,

({ 1 2 k

L α α α

α∈ then α ≤αi for all i=1,2,...,k, i.e.,

β ε α

α β∈ ≤ ≥ ≥

|

|

|

|

| }

| {

=| ) (

supp A

k A

A

A .

Vice versa, if suppA≥ε then |{β∈A|α ≤β}|≥ε.|A|, i.e. there exist

kA α α

α1, 2,..., , k =⎡ε|A|⎤, such that α∈L({α12,...,αk}). The proof is completed.

(4)

By the Theorem 1 we have the following

Algorithm 1 (Creating all ε-frequent MBs of a given set of transactions A).

Input. Set of items P, set of MBs A⊆Ω and a threshold 0≤

ε

≤1. Output. ΦεA.

Step 1. ΦεA:=∅. Step 2. k=⎡ε|A|⎤. For all BA, |B|=k ΦεA:=ΦεAL(B) EndFor;

End

Let |P|=n, k=⎡ε|A|⎤, m=max{α[i]|α∈A,i=1,2,...,n}. The algorithm requires O

( ( )

k|A|.

(

m+1

)

n

)

running time.

As a consequence of the previous theorem we have the following

Theorem 2 (Explicit representation of large MBs). For a set of items }

..., , , {

= p1 p2 pn

P a set of MBs A⊆Ω and a threshold 0≤

ε

≤1 there exist ,

..., , , 2

1 α αs∈Ω

α where s=

( )

|Aε||A| such that ).

(

=

1

= i s

i

A Lα

ε

U

Φ

P r o o f: Let α12,...,αs be the set of all inf{β12,...,βk} where

⎡ | |

= A

k ε and

β

iA. By Theorem 1 we have

}) ..., , , ({

inf 1 2 k

A α β β β

α∈Φε ⇔ ≤

for some {β12,...,βk}⊆A, where k =⎡ε|A|⎤. This implies that )

(

= =1 i s

A i Lα

ε

U

Φ . The proof is completed.

We should remark that αi ≤αj iff Li)⊆Lj). For a set of MBs A and a given threshold

ε

the set of MBs α1, α2, ..., αs for which

(i) = ( ),

1

= i

s

A i L α

ε

U

Φ

(ii) ∀i, j:0≤i, js we have

α

i ≤/

α

j and αj ≤/αi

is called by basic ε-frequent set of MBs of A. It is easy to verify that for a given A, ε the basic ε-frequent set of MBs of A is unique, which we denote by SεA. Since the determination of ΦεA (the set of all ε-frequent set of MBs in A) is important, it is interesting to determine its basic ε-frequent set of MBs SεA. We have the following

(5)

Theorem 3. For a set of items P, a threshold 0≤ε ≤1 every set of MBs Ω

A has an unique basic ε-frequent set of MBs SεA.

The simple proof is omitted. The following algorithm creates the unique basic ε-frequent set of MBs for a given set of MBs A⊆Ω and a given threshold

ε

:

Algorithm 2 (Creating the basic ε-frequent set of MBs SεA).

Input. Set of items P, Set of MBs A⊆Ω and a thershold 0≤ε ≤1. Output. SεA.

Step 1. SεA:=∅. Step 2. k=⎡ε|A|⎤. For BA, |B|=k For

α

SεA

If α≤inf(B) or inf(B)≤α then

SAε:=SAε \{min(α,inf(B))}∪{max(α,inf(B))}. else

SAε:=SAε ∪{inf(B))}. EndIf

EndFor EndFor End

For |P|=n, k=⎡ε|A|⎤, m=max{α[i]|i=1,2,...,n;α∈A} one can see that |SAε |

( )

|Ak| . Therefore the algorithm requires O

( ( )

kA mn

)

|

| running time. One can remark also than in the case of large amount of transactions A the basic ε-frequent set of MBs SAε can be generated much more quickly than the set of all ε-frequent set of MBs ΦεA.

Example 2. We continue the Example 1. For the set of transactions A Algorithm 2 generates the basic

2

1-frequent set of MBs 2 =

{ }

, ,

1

θ

A ρ

S where

0) 1, (2,

ρ= , θ =(1,0,1). It means that the family of 2

1-frequent set of MBs of A

is 2 = ( ) ( )

1

θ ρ L

A L

Φ .

4. Association and confidence

In our generalized model of market baskets we can find all associations with given confidence. For a set of items P, a set of MBs A⊆Ω and a threshold 0≤

ε

≤1 an

(6)

association α →β is ε-confident if confA(α→β)≥ε. The set of all

ε

-confident associations of A is denoted by CεA. We have the following

Theorem 4. For a set of products P a set of MBs A⊆Ω and 0≤

ε

≤1 an association α →β is ε-confident iff .

| ) (

|

| ) (

| ε

α β

α

A U

A

U

P r o o f: Remark that

|

|

| ) (

=| ) (

supp A

A U

A

∪β α ∪β

α and

|

|

| ) (

=| ) (

supp A

A U

A

α ∩

α . With these remarks the proof of the theorem is straightforward.

A natural question for cross marketing, store layout, ...(see, for example [1]) is to find all association rules with a given confidence. In our generalized model the following theorem shows in a sense an explicit representation of all association rules. More exactly, we show for a given MB

α

which set of MBs β may be associated to

α

with a given threshold of confidence.

For MBs ρ, σ where ρ ≤ σ, let us denote

. }

| {

= ) ,

(ρ σ η∈Ω ρ∪η≤σ M

It should be remarked that M(ρ,σ) can be represented explicitly. If ρ = (ρ1, ρ2, …, ρs), σ = (σ1, σ2, …, σs) then η = (η1, η2, …, ηs) ∈ M(ρ, σ) iff max(ρi, ηi) for all i=1,2,...,s, i.e., ηii in the case

ρ σ

i ≤/ i and ηi ≤σi in the case ρii.

Theorem 5 (Explicit representation of association rules). For a set of items }

..., , , {

= p1 p2 pn

P , a set of MBs A⊆Ω, an MB

α

∈Ω and a threshold 1

0≤

ε

≤ there exist α12,...,αk∈Ω such that ∀β∈Ω:α →β is ε-confident association rule iff ( , ).

1

= i

k

i M α α

β

U

P r o o f: Put s=⎡ε|U(α)∩A|⎤ by Theorem 4 we have that α →β is ε-confident association rule iff |U(α∪β)∩A|≥s. Let αi denotes inf(B), where

A

B⊆ , |B|≥s. One can verify that |U(α∪β)∩A|≥s iff β∈M(α,αi). The proof is completed.

Theorem 5 in a sense gives an explicit presentation for association rules. As a straightforward consequence, we have an algorithm to find all ε-confident association rules for given left side.

Algorithm 3 (Creating all ε-confident association rules α →β for given α).

Input. A set of items P, a set of MBs A⊆Ω, a thershold 0≤

ε

≤1 and an MB α

Output. ( , )

1

= i

k

i M α α

U

.

Step 1. B:=U(α)∩A={γ∈A|α ≤γ}.

(7)

Step 2. s:=⎡ε|B|⎤. k:=|{CB||C|≥s}|

For CB, |C|≥s, calculate αi =inf(C), i=1,2,...,k. EndFor

Step 3.

For i=1,2,...,k calculate M(

α

,

α

i) EndFor

Step 4.

Output ( , )

1

= i

k

i M

α α

U

.

End

Example 3. We continue the Example 1. For the set of MBs A (see Example 1), the MB σ =(1,1,0) and threshold

2

= 1

ε we should find all MB η such that σ →η is ε-confident association rule. We can see

0)}

2, (2, 1), 1, (1, 0), 1, {(2,

= )

( A

U σ ∩ and s:=⎡ε|U(α)∩A|⎤=2. By Step 2 in Algorithm 3 we have k=4 and α1=(1,1,0), α2=(2,1,0). The set of all MBs η such that σ →η is

2

1- confident association rule is

. 0)}

0, (2, 0), 1, (2, 0), 0, (0, 0), 1, (0, 0), 0, (1, 0), 1, {(1,

= ) , ( ) ,

(σ α1 M σ α2

M

As a result we see that besides the trivial association rules of the form σ

σ → ′, where σ′≤σ we got non-trivial association rules σ →(2,1,0) and 0)

0,

→(2,

σ . In words, among those customers A the ratio of customers who buy a and b also buy two a and one b items, as well the ratio of those who buy a and b also buy two a items, are more than 50%.

5. Conclusion

In this study we have proposed an algebraic approach to consider the MB model.

The well-known problems are analysed in new, more generalized setting. An explicit representation of frequent set of MBs, as well of association rules are presented. We define the set of basic frequent MBs which determines the set of frequent MBs and can be created in shorter time. We described algorithms that produces the set of frequent MBs and the set of basic frequent MBs. We described also an algorithm that produces the set of association rules for a given left side. The algebraic approach we propose here brings about a clearer representation of well- known results and appears to be a good tool for future study in market basket model.

(8)

R e f e r e n c e s

1 A g r a w a l, R., R. S r i k a n t. Fast Algorithms for Mining Association Rules. VLDB, 1994, 487-499.

2. B rũg g e r m a n n, T., P. H e d s t r õ m, M. J o s e f s s o n, Data Mining and Data Based Direct Marketing Activities, Book on Demand GmbH. Norderstedt, Germany, 2004.

3. H a n, J., M. K a m b e r. Data Mining: Concepts and Techniques. Second Edition. Morgan Kaufmann Publ., 2006.

4. M a n n i l a, H., H. T o i v o n e n. Discovering Generalized Episodes Using Minimal Occurrences. – In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’ 96). August 1996, AAAI Press, 146-151.

5. T o i v o n e n, H. Sampling Large Databases for Association Rules. Morgan Kaufmann Publ., 1996, 134-145.

6. P a s q u i e r, N., Y. B a s t i d e, R. T a o u i l, L. L a k h a l. Discovering Frequent Closed Itemsets for Association Rules – ICDT, 1999, 398-416.

7. H s u, P i n g-Y u, Y e n-L i a n g C h e n, C h u n-C h i n g L i n g. Algorithms for Mining Association Rules in Bag Databases. – Information Sciences, Vol. 166, 2004, Issues 1-4, 31-47.

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

The object of learning, or what the students are supposed to learn, of this study is to account for a change in the market price of a commodity by taking into consideration

1) The equation defining the decision to participate in secondary em- ployment was estimated by means of a probit analysis model. Taking into consideration the panel character of

In the same formalism defined in the previous sections an approach to the study of customer’s (or market basket’s) classification is proposed here, based on the quantities of the

Additionally, we show that the pairwise model and the analytic results can be generalized to an arbitrary distribution of the infectious times, using integro-differential equations,

In this section we present an application of the previously developed theory to Chebyshev type problems for generalized polynomials and generalized trigonometric polynomials,

Taking into consideration the previous researches, it seems that by using a more flexible approach (e.g. putting more emphasis on the role of national scale) the concept of

The mononuclear phagocytes isolated from carrageenan- induced granulomas in mice by the technique described herein exhibit many of the characteristics of elicited populations of

Motivated by an amazing identity by Ramanujan in his “lost notebook”, a proof of Ramanujan’s identity suggested by Hirschhorn using an algebraic identity, and an algorithm by Chen