Evaluation methods and measures - Methods applied and collections used in the experiments

2 Methods applied and collections used in the experiments

2.3 Evaluation methods and measures

In this section, a brief description is provided about the relevance effectiveness measures, which were used to compare the newly developed retrieval methods introduced in Chapters 5 - 6.

2.3.1 Main concepts

The effectiveness of an information retrieval system (or method) means how well (or bad) it performs. Effectiveness is numerically expressed by effectiveness measures which are elaborated based on different categories such as [46]:

• Relevance,

• Efficiency,

• Utility,

• User satisfaction.

Relevance effectiveness is the ability of a retrieval method or system to return relevant answers. The traditional (and widely used) measures are the following:

• Precision: the proportion of relevant documents out of those returned.

• Recall: the proportion of returned documents out of the relevant ones.

• Fallout: the proportion of returned documents out of those nonrelevant.

Attempts to balance these measures have been made, and various other complementary or alternative measures have been proposed [19][78][12]. In Subsection 2.3.2, the three above mentioned, widely accepted and used measures as well as the precision-recall measurement method are introduced, as they were used to measure the relevance effectiveness of the developed Entropy- and Probability-based retrieval methods introduced in Chapter 5.

The in vivo measurement of a Web search engine’s relevance effectiveness using traditional precision/recall measurement is known to be impossible [51]. Recall and fallout cannot be measured (however methods have been suggested to estimate it:

[32][17][42][18][64]), since we do not know all the documents on the Web. This

2 Methods applied and collections used in the experiments 12

means that the measurement of relevance effectiveness of search engines requires other measures than the traditional ones. The measurement of relevance effectiveness of a Web search engine is, typically (due to the characteristics of the Web), user centred [13]. It is an experimentally established fact that the majority of users examine, in general, the first two pages of a hit list [9][65]. Thus, the search engine should rank the most relevant pages in the first few pages. When elaborating such new measures, one is trying to use traditional measures (for example, precision which can be calculated also for a hit list of a search engine), and to take into account different characteristics of the Web. The methods used for evaluating the newly developed WebCIR Web search engine (introduced in Chapter 7) are described in Subsections 2.3.3 through 2.3.6.

2.3.2 Precision-recall method

The precision-recall measurement method is being used in the in vitro (i.e., under laboratory conditions, in a controlled and repeatable manner) measurement of relevance effectiveness [6]. In this measurement method, test collections are used (some introduced in Section 2.1).

Let D denote a collection of documents, q a query, and

• ∆ ≠ 0 denote the total number of relevant documents to query q,

• κ ≠ 0 denote the number of retrieved documents in response to query q,

• α denote the number of retrieved and relevant documents.

From the point of view of practice, it is reasonable to assume that the total number of documents to be searched, M, is greater than those retrieved, i.e., |D| = M > ∆. The usual relevance effectiveness measures are defined formally as follows:

1. Recall ρ is defined as ρ⁼

∆ α .

2. Precision π is defined as π⁼ κ α .

3. Fallout ϕ is defined as ϕ⁼

∆

−

− M

α κ .

Figure 2.1 shows a visual representation of these measures. From the above definitions 1., 2., 3., it follows that:

• 0 ≤ρ≤ 1; 0 ≤π≤ 1,

• ρ^{= 0}⇔ π^{= 0;}π^{= 1}⇔ ϕ^{= 0,}

• α⁼κ = ∆ ⇔ (ρ⁼π = 1 ∧ ϕ^{= 0).}

Figure 2.1 Visual representation of quantities which define precision, recall, fallout.

For every query, retrieval should be performed, using the retrieval method whose relevance effectiveness is to be measured. The hit list is then compared with the relevance list corresponding to the query under focus. The following recall levels are considered to be standard levels:

0.1; 0.2; 0.3; 0.4; 0.5; 0.6; 0.7; 0.8; 0.9; 1;

(these levels can also be given as %, for example 0.1 = 10%). For every query, pairs of recall and precision are computed. If the computed recall value is not standard (i.e.

it is not in the list above), it is approximated. The precision values corresponding to equal recall values are averaged.

When the computed recall value r is not equal to a standard level, the following interpolation method can be used to calculate the precision value p(r_j) corresponding to the standard recall value rj:

p(rj) = max p(r), j = 1,…,10.

rj-1<r≤rj

It is known from practice that the values p(rj) are monotonically decreasing. Thus, the value p(r0) is usually determined to have p(r0) ≥ p(r1). For all queries qi, the precision values pi(rj) can be averaged at all standard recall levels as follows:

∑

= ⁿ

j i

j p r

r n P

) 1 (

)

( , j = 0,...,10,

where n denotes the number of queries used. Figure 2.2 shows a typical precision-recall graph (for the test collection ADI).

2 Methods applied and collections used in the experiments 14

Figure 2.2 Typical precision-recall graph (for the test collection ADI)

The average of the values P(rj) is called MAP (Mean Average Precision). MAP can also be computed just at the recall values 0.3, 0.6, and 0.9.

2.3.3 MLS method

The MLS method [24], based on principles given in [42], measures the ability of a search engine to rank relevant hits within the first 5 or 10 hits, and involves user assessments. The MLS method is as follows:

1. Select search engine to be measured.

2. Define relevance categories, groups, and weights.

3. Give queries Q_i (i = 1,...,s).

4. Compute P5_i and/or P10_i for Q_i (i = 1,...,s).

5. The first 5/10-precision of the search engine is:

∑

= relevance. The first ten hits are grouped into three groups as follows:

1. group: the first two hits, 2. group: the next three hits, 3. group: the rest of five hits.

Groups 1 and 2 are based on the assumption that, in practice, the most important hits are the first five (usually on the first screen) [24]. Hits within the same group receive equal weights. The weights reflect the fact that the user is more satisfied if the relevant hits appear on the first screen. According to [24] the group weights were chosen as follows: 20, 17, and 10, respectively. The P10 measure is as follows:

)

r_hit_x-y denotes the number of relevant hits within ranks x through y, miss_hit_u-v denotes the number of missing hits within ranks u through v.

2.3.4 DCG method

The DCG (Discounted Cumulative Gain;[37]) method makes it possible to measure the cumulative gain a user obtains by examining the hits.. Given a ranked hit list H:

1,…, i,…,n, with the corresponding relevance degrees r₁,…,r_i,…,r_n. The gain is relevance degrees r1,…,ri,…,rn relevance judgements are required, thus this method also involves user’s assessments.

2.3.5 RC method

The RC (Reference Count; [86]) method allows ranking search engines without relevance judgements. Given a query Q and n search engines. Let Li = d1i, …, dji,..., dmi be the hit list returned by search engine i in response to Q (i = 1,…,n).

Let o(dji) denote the number of occurrences of dji in all other hit lists. The RCQ,i

measure is calculated for a given query Q and search engine i as follows:

RCQ,i = o(d1i) +…+ o(dji) +...+ o(dmi), i = 1,…,n (2.3)

2 Methods applied and collections used in the experiments 16

where s denotes the total number of queries. Finally, the search engines are ranked ascendingly on RC_i. In my experiments m = 5.

2.3.6 RP method

The RP method [24] can be used to compute a relative precision of a search engine compared to other (reference) search engine(s), without relevance judgement. Let q be a query. Let V be the number of hits returned by the search engine under focus, and T those hits out of these V that were ranked by at least one of the reference search engines within the first m of their hits. Then, RPq,m is calculated as follows:

RP_q,_m = T (2.4)

The value of relative precision should be computed for several queries, and an average should be taken. The steps for computing relative precision are as follows:

1. Select the search engine to be measured. Define queries qi, i = 1,...,n.

2. Define the value of m; typically m = 5 or m = 10.

3. Perform searches for every qi using the search engine as well as the reference search engine(s), i = 1,...,n.

4. Compute relative precision for qi using eq. (2.4).

5. Compute average:

∑

= n

m q_i

n 1RP

In my experiments m = 5.

3 A measure theoretic approach to information

In document Információ-visszakereső módszerek egységes keretrendszere és alkalmazásai (Pldal 22-28)