• Nem Talált Eredményt

5.3 The semantic profiles

5.3.1 Constructing the semantic profile for each node

Since the effectiveness of the information retrieval system is planned to be increased with the help of semantic information, we needed a tool with which the nodes can categorize the stored documents. In absence of dynamic information this can be used as initial data to describe the fields of interests of the user. Moreover, as the nodes should send the maximum number of known documents in each topic, this values should also be stored.

Definition 5.1 (Synset). A set of one or more concepts that are synonyms.

Definition 5.2 (Semantic profile). The semantic profile is a weighted taxonomy tree, where the nodes represent the different synsets describing the stored docu-ments, along with the number of occurrence of the given synset (Dt).

The whole structure of a taxonomy is organized according to the "is a" relation, thus, the child nodes are hyponyms and the parent node is a hypernym of the given synset. In our research, we used the generally accepted and up-to-date WordNet taxonomy to reveal the relationships of the stored keywords. WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theo-ries of human lexical memory [Fellbaum, 1998]. Different research projects in this

Chapter 5. The Semantic Profiles 57 field prefer using the various versions of WordNet, therefore, we also utilize this taxonomy throughout the chapter. However, any other taxonomy is suitable for this task, such as ID3 genre meta tag in MP3 music files or [Haase et al., 2004a].

The stored documents contain enough information for the peers to com-pose their own semantic profile. There are several methods to acquire even more complex ontology from documents ([Assadi, 1998], [Kietz et al., 2000], [IBM UIMA, 2005]), therefore, the user does not need to give the keywords for each document manually, this method can be fully automated. Another way is when the metadata is stored with the documents such as ID3 tags for music files or as the Dublin Core metadata [Dublin Core, 2006].

When a new document is stored by a node, it updates its semantic profile by increasing a counter for each keyword’s synset and all of its hypernyms. This is quite different from the related systems described in Subsection 3.3.2, because the semantic profile will contain information about concepts that are not directly gathered from the metadata of the documents, but deduced from them.

As an example, if a document has the word ’cat’ as a related metadata (key-word), the system will store this information, because probably cats are in the fields of interest of the given user. Differently from the existing systems, our algo-rithm searches for all the hypernyms of ’cat’ in the taxonomy, and also mark them as concepts in the fields of interest of the given user. In our specific example, these synsets will be ’feline, felid’, ’carnivore’, ’placental, placental mammal’, ’mammal’,

’vertebrate, craniate’, ’chordate’, ’animal’, ’living thing’, ’object’, ’entity’; moving from the more specific concept to the generalized ones, where ’entity’ is the most general noun in the taxonomy, that means, it is the root synset.

When we add a new document with a keyword ’dog’, the hypernym structure will be very similar, only on the10th level will branch the two concepts. We show this part of the taxonomy tree in Figure. 5.1.

With this approach, the nodes of the semantic profile will contain the approx-imate weight of the topic in the fields of interests of the user. The explanation is as follows. The numbers next to the concepts in Figure 5.1. are counters that show how many times a given concept occurs as a keyword or as a generalization (hypernym) of a keyword. The sum of these counters in each level has to be less than or equal to the number of all keywords inserted in the semantic profile:

Chapter 5. The Semantic Profiles 58

Figure 5.1. A taxonomy part

Nk

Ntl

X

i=1

Ctl

i, (5.3)

where Nk is the number of the inserted keywords, Ntl is the number of synsets at thelthlevel in the semantic profile, andCtl

i is the value of the occurrence counter for the ith synset at the lth level.

The sum of the counters at the lth level can be less than Nk, because it is possible that one or more keywords has been inserted at a higher level in the taxonomy, thus, neither the keyword nor any of its hypernyms do not appear at the lth level. Let N¯kl stand for the number of this "missing", that is, too general keywords at the level l. In that case the following is valid:

Nk =

Ntl

X

i=1

Ctl

i+N¯ki. (5.4)

If we suppose that the concepts that are extracted from the documents by any of the existing methods describe suitably the content of a document, the following

Chapter 5. The Semantic Profiles 59 statement is valid. If we randomly select a document from the store of the node, then the probability that it can be described by a selected synset will be equal to the occurrence counter of this synset divided by the number of all the keywords inserted into the profile:

Pt = NCt

k, (5.5)

where Pt is the probability of finding a document which can be described with the synsett, andCtis the value of the occurrence counter for synsett. This can be easily seen because the synset’s occurrence counter in the profile increases when exactly that keyword is inserted (this is the obvious case) or when such a keyword is inserted which is a hyponym of these concept in question (that follows from the definition of the hyponym/hypernym relation).

From Formula 5.4 and 5.5 follows that one level of the taxonomy forms a probability space:

Ntl

X

i=1

Ptl

i+N¯kl

Nk = 1. (5.6)

Assuming that the stored documents reflect the fields of interest of the user, this means that the semantic profile of the user describes these fields of interest on a specific generalization level as accurate as the extracted concepts cover the content of the documents. Therefore, later on we often use the expression (the probability of ) the semantic profile for a synset t as a name for Pt.

Our next contribution is to select the node’s neighbors in a more intelligent manner than Gnutella does - based on this semantic profile.

5.3.2 Constructing and maintaining the profile for the