• Nem Talált Eredményt

4.4 Modeling the Networks

4.4.1 Homogeneous Links

Chapter 4. An Analytic Model for Peer-to-Peer Systems with

Semantic Overlay Network 39

Cmod1 is the most accurate form when we want to describe a known network from the individual nodes’Cmod1,r values. This formula is very similar to the orig-inal clustering coefficient. However, the average Cmod2 value can be approximated in certain topologies. For this reason we use this form in the construction of the model.

Chapter 4. An Analytic Model for Peer-to-Peer Systems with

Semantic Overlay Network 40

variable with some distribution that represents the number of documents stored at a node, andE(Dn)is its expected value. For model testing or network simulations in mobile environment selectingDnas a constant might be a good choice because of the limited storage capacity of such devices, however, there are results to be found in [Kant and Iyer, 2003] that approximates this distribution in different contexts.

We suppose that the initiating links between the nodes are selected randomly with uniform distribution, as this is mostly the case with the bootstrapping agents (for example, webcaches).

We can approximate the probability of a successful query in case of Gnutella as shown below:

Psuccess,Gnutella ≤1−

1− 1 D

DEq

. (4.13)

In this formula, we compute the probability of not finding the requested docu-ment at any reached node, and subtract it from 1. We suppose that a node stores a specific document with 1/D probability. The expression in parenthesis is the probability of selecting any disinterested document from all the documents that exist in the network. Note that this (and also the next) formulas postulate that there can duplicates of documents exist at each node. In most networks this is not the case, however, when the number of stored documents at each node is signifi-cantly lower than the set of relevant documents available in the network (that is, all documents, or documents in a specific topic), it does not influence the results in noticeable manner, however, it makes our model more compact.

If the standard deviation of the number of stored documents at each node is low (for example, due to low storage capacity), the following form of (4.13) is more usable.

Psuccess,Gnutella ≤1−

1− 1 D

E(Dn)Eq

. (4.14)

Eq, the number of reached nodes can be calculated with the modified clustering coefficient in equation 4.10 as follows:

Eq =

T T L

X

i=1

[(1−Cmod2)k]i. (4.15)

Chapter 4. An Analytic Model for Peer-to-Peer Systems with

Semantic Overlay Network 41

Recall thatCmod2 is the modified clustering coefficient normalized for the prop-agation path with a given T T L value. Now we can see that with an accurate value of Cmod2 parameter, the number of reached nodes can also be accurately computed. During the following calculations, we regard the basic Gnutella net-work as a random mesh. As we show it later, the distribution of the number of in-edges in an unstructured network can be modeled rather with Zipf distribution.

However, our results in Chapter 6 show that this condition has no major influence in this calculation, therefore, we use the uniform distribution here to have more compact expressions. The following value describes the probability that a link is counterproductive in the propagation path:

P ≤

T T L

P

m=1

km

m−1 P

n=0

(kn) +km−1

+

T T L−1

P

m=1

km(km+1−k) V

T T L

P

i=0

ki

(4.16)

In this formula, we calculate the ratio of the number of possible counterproduc-tive links in the propagation path of a node and the count of all the links in each node’s independent propagation path. This shows the probability that a link in the network acts as a counterproductive link in a node’s propagation path. That suggests the average number of productive connections per node:

(1−Cmod2)k=k∗

 1−

T T L

P

m=1

km

m−1 P

n=0

(kn) +km−1

+

T T L−1

P

m=1

km(km+1−k) V

T T L

P

i=0

ki

(4.17) From 4.17, Cmod2 can be calcualted with the following fraction:

Cmod2 =

T T L

P

m=1

km

m−1 P

n=0

(kn) +km−1

+

T T L−1

P

m=1

km(km+1−k) V

T T L

P

i=0

ki

. (4.18)

Chapter 4. An Analytic Model for Peer-to-Peer Systems with

Semantic Overlay Network 42

In the following, we modify (4.13) to compute the probability that we could achieve with an optimal semantic protocol in an ideal environment. We define a relaxed problem where, by neglecting some properties of the real environment, it is easier to achieve the optimal result.

Definition 4.16 (The IdealSON relaxed problem). The IdealSON protocol (or IdealSON problem) is a semantic layer over an unstructured P2P network that has the following characteristics:

a. The nodes in the network are fixed: there are no joins and leaves.

b. The documents stored by each node never change.

c. The connections of the nodes are fixed.

d. The semantic connections of the nodes point to nodes with similar fields of interest.

e. The graph components formed by the nodes with similar interest are coher-ent.

It is straightforward why the the first three conditions result in an easier prob-lem than our original probprob-lem statement. The significance of the fourth condition (together with the third) is that there is no such situation in the IdealSON net-work when nodes with a given field of interest are unable to discover and connect to similar nodes, since these connections are given by definition.

In case of the IdealSON protocol, the aim of the nodes is to maximize the number of distinct documents achievable through a given semantic connection, in the given topic. Since the network is static, it follows that this maximization of the number of documents results in the highest achievable recall value with unstructured, message flooding protocol, with a given classification of the topics.

In the next chapters, we analyze the local decisions of the nodes in different situations. Our methodology often regards the network as it is constructed ac-cording to the IdealSON protocol, increasing its performance from the aspect of the analyzed node, for example, by replacing its connections, even if it breaks the rules (in this case, rule c.) of the relaxed problem. However, it is the duty of the

Chapter 4. An Analytic Model for Peer-to-Peer Systems with

Semantic Overlay Network 43

semantic protocol to keep recall values as high as possible even if the conditions of the IdealSON are not satisfied.

The benefit of the IdealSON protocol is that nodes with a given field of interest should search only in a set of nodes that surely contain documents in the specified topic. The IdealSON protocol can be optimal if it ensures that each node can achieve the maximum number of distinct documents in their fields of interest.

In the ideal and homogeneous situation, the nodes are connected only to other nodes that have the same field of interest. Therefore, a search is performed only in the set of documents related to only one topic. It is the duty of the protocol to maximize this probability. As one can see in the next chapter, the IdealSON pro-tocol should construct an overlay to achieve this optimal situation in a distributed manner. The approximate probability of a successful query in the circumstances is as follows:

Psuccess,IdealSON,t≤1−

1− 1 Dt

DEq

. (4.19)

In this formula, Dt is the number of documents in the topict, and we suppose that each document can be found with 1/Dt probability.

The upper limit is calculated similarly to the case of Gnutella:

Psuccess,IdealSON,t ≤1−

1− 1 Dt

E(Dn)Eq

. (4.20)

It is important to note that if the network topology for the overlay network is still a random mesh, the clustering coefficient also increases, because the number of nodesV in the denominator of (4.18) decreases toVt, the number of nodes that have the topictas a field of interest. This implies the decrease of the performance of the overlay protocol. Therefore, we might use a different topology for the overlay network. We propose a suitable topology that eliminates clustering in the query propagation graph for IdealSON in Section 6.2 .