Budapest University of Technology and Economics Department of Automation and Applied Informatics ANALYSIS AND PERFORMANCE EVALUATION OF MOBILE RELATED SOCIAL NETWORKS MOBIL KÉSZÜLÉKEKET TÁMOGATÓ KÖZÖSSÉGI HÁLÓZATOK TELJESÍTMÉNY- ÉS HATÉKONYSÁGVIZSGÁLATA

(1)

ANALYSIS AND PERFORMANCE EVALUATION OF MOBILE RELATED SOCIAL NETWORKS

MOBIL KÉSZÜLÉKEKET TÁMOGATÓ KÖZÖSSÉGI HÁLÓZATOK TELJESÍTMÉNY- ÉS HATÉKONYSÁGVIZSGÁLATA

Ph.D. Thesis

Péter Ekler

Advisor: Hassan Charaf Ph.D.

September 2010

(2)

The Internet based solutions have developed rapidly in the last decade. Online social networks are among the most popular solutions. The capabilities of mobile phones enable to reach social networks via different interfaces. However, it is an open question how to involve them efficiently in the social network architecture.

We should realize the fact that the phonebook of the mobile device describes the social relationships of its owner. Based on that, relationships can be discovered automatically. Content distribution is becoming more and more important in social networks. BitTorrent is one of the most efficient peer-to-peer based content distribution solution. Involving mobile phones into such network is challenging, however it has great benefits e.g. when applying such architecture in mobile related social networks. This thesis addresses these issues and proposes novel and efficient approach considering mobile environment. The results can be divided into three main parts.

The first result group proposes a model for mobile related social networks.

Based on that model, similarity detecting and handling algorithm between the social network and the phonebooks is proposed. We show the resource requirements of similarity handling with analytical models and measurements.

The second contribution focuses on performance and scalability of mobile related social networks. The number of identity links is a key measure since they determine the synchronizations between phonebooks and the social network. We propose a model to estimate the total number of identities in the network and we prove the accuracy of the model. This proof can be used in general in similar probability distributions. We verify the model with measurements based on real database.

The third result group examines the peer-to-peer file sharing on mobile devices. The analysis focuses on the BitTorrent technology. For the research the MobTorrent software package was created. MobTorrent supports feature phones with limited resources. In the result we introduce the algorithm, we give a model to estimate the energy consumption and we show that the proposed mobile Bit- Torrent engine adopts to the capabilities of the mobile devices. Furthermore, we show with measurements that the proposed solution operates well in real BitTor- rent networks while memory and energy consumption does not increase during the operation. Finally, a hybrid content sharing architecture is proposed which supports mobile devices as well. The goal of the solution is to decrease the load of the central server. We provide an analytical model to prove the advantages of the proposed architecture.

The results of the theses involved practical results, they were applied by Nokia Siemens Networks and Nokia Research Center.

ii

(3)

Az utóbbi évtizedekben az Internet alapú megoldások rohamos fejlődése volt megfigyelhető. A közösségi hálózatok az egyik legnépszerűbb megoldások közé tartoznak. A mobileszközök képességei lehetővé teszik, hogy a közösségi hálóza- tokat különböző módokon elérjék. Az azonban nyitott kérdés, hogy ezek a készülékek hogyan vonhatók be hatékonyan a közösségi hálózatok működésébe.

Fontos észrevenni, hogy a mobil készülékek telefonkönyvei a tulajdonosok szemé- lyes kapcsolatait tartalmazzák. Ez alapján olyan rendszerek készíthetők, ahol az is- meretségi kapcsolatok felderítése a telefonkönyvek alapján automatikusan megold- hatók. Napjainkban a BitTorrent az egyik leghatékonyabb tartalommegosztó technológia. A mobil készülékek peer-to-peer hálózatokba való bevonása szá- mos kihívást hordoz magával, ugyanakkor a megoldás komoly előnyökkel járna például közösségi hálózatokban való tartalommegosztás esetében. Az értekezés ezen kérdésköröket vizsgálja meg, és javaslatokat tesz új, hatékony megoldásokra.

Az eredmények három fő téziscsoportba oszthatók.

Az első téziscsoport egy modellt ad mobil eszközöket támogató közösségi hálózatok leírására. A modell alapján a hasonlóság felderítő és leíró algoritmus bemutatása következik, mely képes a hálózat és a telefonkönyvek közti kapcsolatok felderítésére. Bemutatjuk az algoritmus erőforrásigényét, valamint mérésekkel igazoljuk a működés helyességét.

A második téziscsoport a mobil alapú közösségi hálózatok skálázhatóságát vizs- gálja. Az azonosság élek száma egy kulcs mérőszám ilyen hálózatokban, hiszen meghatározza a hálózat és a mobiltelefonok között szükséges szinkronizációkat.

Modellt adunk az azonosságok számának becslésére, és bizonyítjuk a modell pon- tosságát. Az eredmények és a bizonyításban felhasznált lépések általánosan is felhasználhatók hasonló valószínűségi eloszlások esetén. Az eredményeket valós adatbázis alapján végzett mérésekkel is alátámasztjuk.

A harmadik téziscsoport a mobil eszközöket is támogató peer-to-peer tartalom- megosztás lehetőségét vizsgálja. A téziscsoport a BitTorrent technológiára irányul.

A kutatáshoz elkészítésre került a MobTorrent szoftvercsomagot, mely a korlá- tozott erőforrással rendelkező készülékeket is támogatja. A téziscsoportban be- mutatásra kerül a felhasznált algoritmus, modellt adunk az energiafelhasználás becslésére. Emellett mérésekkel alátámasztjuk, hogy a megoldás hatékonyan működik valós BitTorrent hálózatban, ugyanakkor a memóriaigény és az energiafo- gyasztás nem növekszik a felhasználás során. Végül a téziscsoport egy hibrid, mo- bilkészülékeket is támogató, tartalommegosztó architektúrát ismertet. A megoldás célja a központi szerver tehermentesítése. Az architektúra hatékonyságának leírására analitikus modellt készítünk, mellyel bizonyítjuk a terhelés csökkenését.

Az értekezés eredményei a gyakorlatban is felhasználásra kerültek a Nokia Siemens Networks, valamint a Nokia Research Center által.

iii

(4)

"It would appear that we have reached the limits of what it is possible to achieve with computer technology, although one should be careful with such statements, as they tend to sound pretty silly in 5 years."

(János Neumann) I would like to express my thanks to my scientific advisor, Hassan Charaf, who has motivated and supported me in every manner. I am also greatful to Bertalan Forstner who always helped me if I had questions. Imre Kelényi from the mobile team helped also my work in different ways. I would like to thank to István Vajk and all of my Colleagues at the Department of Automation and Applied Informatics for the really inspiring environment and the lot of common work.

I also would like to express may gratitude to Balázs Bakos, Zoltán Ivánfi, Attila Kiss and Szabolcs Fodor from Nokia Siemens Networks and to Jukka Nurminen from Nokia Research Center for their invaluable support.

This research could not have been written without the support of my family.

They always tolerated the hard work and helped me in every manner. Last but not least, I am thankful to my grandparents who were always patient and supportive.

I am very greatful to all of them.

This work is connected to the scientific program of the "Development of quality- oriented and harmonized R+D+I strategy and functional model at BME" project.

This project is supported by the New Hungary Development Plan (Project ID:

TÁMOP-4.2.1/B-09/1/KMR-2010-0002).

iv

(5)

Dedication

The content of this thesis is a product of the author’s original work except where explicitly stated otherwise.

Nyilatkozat

^∗

Alulírott Ekler Péter kijelentem, hogy ezt a doktori értekezést magam készítettem, és abban csak a megadott forrásokat használtam fel. Minden olyan részt, amelyet szó szerint, vagy azonos tartalomban, de átfogalmazva más forrásból átvettem, egyértelműen, a forrás megadásával megjelöltem.

Budapest, 2010. Szeptember

(Ekler Péter)

∗A bírálatok és a védésről készült jegyzőkönyv a későbbiekben a Dékáni Hivatalban elérhetőek.

v

(6)

Acknowledgments iv

List of Figures ix

List of Tables xi

Chapter 1

Introduction 1

1.1 Thesis Outline . . . 4

Chapter 2 Motivations and Open Problems 6 Chapter 3 Related Work and Background Information 8 3.1 Overview . . . 8

3.2 Social Networks . . . 8

3.2.1 Social Network Structure . . . 8

3.2.2 Performance and Functionality of Social Networks . . . 10

3.3 Information Retrieval and Similarity Detecting . . . 13

3.3.1 Recall and Relevance . . . 14

3.3.2 Retrieval Strategies and Similarity Detecting . . . 16

3.4 Analyzing Dynamically Evolving Large Networks . . . 19

3.4.1 Power Law Distribution . . . 20

3.4.2 Examples for Power Law Distribution . . . 20

3.5 Queuing Theory . . . 22

3.5.1 Evolution Equation . . . 22

3.5.2 Calculating Queue Length . . . 22

3.6 Peer-to-Peer Content Sharing . . . 23

3.6.1 Peer-to-Peer Systems . . . 23

3.6.2 The BitTorrent Protocol . . . 25 vi

(7)

3.6.5 Fairness with Credit Based Extension . . . 31

Chapter 4 Similarity Detecting and Handling in Mobile Related Social Networks 34 4.1 Introduction . . . 34

4.2 Problem Statement . . . 35

4.3 Structure and Analysis of Similarities . . . 36

4.3.1 Structure of Mobile Related Social Networks . . . 36

4.3.2 Dealing with Similarities in Real Environment . . . 44

4.3.3 Similarity Detecting Algorithm . . . 45

4.3.4 Queuing Model . . . 51

4.4 Conclusions . . . 56

Chapter 5 Scalability and Performance Planning in Mobile Related Social Networks 59 5.1 Introduction . . . 59

5.3 Distribution Measurements in Mobile Related Social Networks . . . 60

5.3.1 In-Degree Distribution . . . 60

5.3.2 Out-Degree Distribution . . . 62

5.3.3 Distribution of Similarities . . . 65

5.4 Estimating the Total Number of Identities . . . 68

5.5 Accuracy of the Identity Model . . . 73

5.5.1 Calculating the Variance of Similarities . . . 73

5.5.2 Applying Central Limit Theorem for the Distribution of Sim- ilarities . . . 77

Chapter 6 Efficient Content Sharing in Mobile Environment 82 6.1 Introduction . . . 82

6.3 Limitations on Feature Phones . . . 84

6.3.1 Network Handling Limitations . . . 85

6.3.2 Processing Power Limitations . . . 86

6.3.3 File Handling Limitations . . . 86

6.4 BitTorrent in Mobile Environment . . . 87

6.4.1 Considering Feature Phone Limitations in BitTorrent Engine 87 6.4.2 MobTorrent Engine and Parameters . . . 91

vii

(8)

6.5.2 Memory Requirements . . . 94

6.5.3 Energy Consumption . . . 95

6.6 Content Distribution via Hybrid Peer-to-Peer Architecture . . . 96

6.6.1 Bringing Mobile BitTorrent into Content Sharing Systems . 97 6.6.2 Architecture Description . . . 98

6.6.3 Analysis of Peers . . . 100

6.7 Efficiency of the Hybrid Architecture . . . 101

6.7.1 Overhead Measurements . . . 101

6.7.1.1 Storage Overhead . . . 101

6.7.1.2 Network Overhead . . . 102

6.7.2 Benefits of the Solution . . . 104

6.7.3 Extension with Local Cooperation Support . . . 107

Chapter 7 Practical Application of the Scientific Results 112 Chapter 8 Conclusions 114 8.1 Summary . . . 114

8.1.1 Thesis I . . . 114

8.1.2 Thesis II . . . 115

8.1.3 Thesis III . . . 116

8.2 Future work . . . 117

Bibliography 130

viii

(9)

3.1 Social network structure . . . 9

3.2 SMS based phonebook search mechanism . . . 12

3.3 Google Scholar precision based on the number of search results . . . 14

3.4 Document categories to an issued query . . . 15

3.5 Power law distribution . . . 21

3.6 Typical architecture of peer-to-peer and client-server systems . . . . 23

3.7 Internet traffic from the perspective of different protocols . . . 24

3.8 BitTorrent protocol . . . 26

3.9 Torrent file structure . . . 26

4.1 Basic structure of mobile related social networks . . . 37

4.2 Mobile related social networks with detected similarities . . . 38

4.3 Final structure of mobile related social networks . . . 39

4.4 Multiple similarities . . . 45

4.5 Merge interface for similarity resolving . . . 45

4.6 Size of phonebooks in Phonebookmark . . . 53

4.7 Size of phonebooks with logarithmic scale . . . 54

4.8 Queue length for similarity calculation with2C∗N_M and2.5C∗N_M processing rates . . . 56

4.9 Normalized queue length for similarity calculation with 2C ∗N_M and 2.5C∗N_M processing rates . . . 56

5.1 Distribution of in-degree . . . 61

5.2 Distribution of in-degree using log-log scale . . . 62

5.3 Distribution of in-degree using log-log scale and logarithmic binning procedure . . . 63

5.4 Distribution of out-degree . . . 63

5.5 Distribution of out-degree using log-log scale . . . 64

5.6 Distribution of out-degree using log-log scale and logarithmic binning procedure . . . 65

5.7 Distribution of similarities . . . 67 ix

(10)

ning procedure . . . 69

5.10 Trend of identities and members . . . 71

5.11 Average identities per user in Phonebookmark . . . 72

5.12 Trend of private contacts . . . 72

5.13 Staged estimation function . . . 75

6.1 Download speed in real environment . . . 94

6.2 Memory requirements in real environment . . . 95

6.3 Energy consumption in real environment . . . 96

6.4 Objectives of the central element in hybrid architecture . . . 99

6.5 High level architecture . . . 100

6.6 Relative cost of content sharing in hybrid architecture with different content sizes . . . 107

6.7 Extending hybrid architecture with local cooperation support . . . 108

6.8 MobTorrent engine class diagram . . . 111

x

(11)

4.1 Conditional probabilities calculated during the operational period

of Phonebookmark . . . 49

4.2 Similar terms table . . . 51

6.1 File read performance . . . 87

6.2 Download speed measurement via 3G network . . . 93

6.3 Download speed measurement via WLAN network . . . 93

6.4 Size of torrent files . . . 102

6.5 Network overhead in the proposed hybrid solution . . . 103

6.6 Relative overheads . . . 104

xi

(12)

Chapter 1 Introduction

Nowadays social networking solutions are one of the most popular sites on the In- ternet. These systems attract millions of users and their functionality is increasing rapidly. The basic idea behind such networks is that users can manage personal relationships online on the web user interface. Since their introduction, social network sites such as Facebook, Myspace and LinkedIn have attracted millions of users, many of whom have integrated these sites into their daily practices and they even visit them multiple times a day. These sites become an environment, created by the people who are using it.

When talking about developing technologies and also social networks, we should pay attention to mobile devices. The capabilities of mobile devices and mobile phones allow them to participate in rich Internet applications. Facebook statistics [Facebook statistics, 2010] show that there are more than 65 million active users currently accessing Facebook through their mobile devices. People that use Facebook on their mobile devices are almost 50% more active on Facebook than non-mobile users. Mobile phone support in general social networks are usually limited mainly to photo and video upload capabilities and access to the social network site using the mobile web browser. A new range of solutions, like integrated applications into the mobile platforms, have appeared in the last 2 or 3 years.

In this research we were investigating how to close up online and mobile relationships. For this, we should consider the fact, that the phonebook of the mobile device describes social relationships of its owner. Given a mechanism that allows for the system to detect that some of my private contacts in the phonebook are

(13)

similar to other registered members of the social network (i.e. may identify the same person), can discover social relationships automatically. In addition to that a synchronization mechanism allows keeping phonebooks always up-to-date with information provided by the social network members.

Discovering relationships in social networks is beneficial both for users and the service provider, since this is one of the key characteristics of the network. In order to enable such discovery, an efficient and narrow similarity detecting and handling mechanism is needed. When similarities are accepted, identity links represent the connection between phonebooks and the social network. Therefore the performance and scalability of such social networks mainly depend on the number of identity links, since they indicate the number of required synchronizations.

Content sharing is a common feature of social networks nowadays. Members are able to share photos, photo albums or even videos with their friends. In case of mobile related social networks mobile phones are involved as individual entities, but how can we involve them into content sharing efficiently if the goal is also to decrease the load of the server?

Another objective of this research is to investigate the role of mobile devices in content sharing. One of the most efficient content sharing solutions nowadays is the peer-to-peer (P2P) based BitTorrent technology. Involving mobile phones into such large networks is challenging if we consider the limitations of mobile phones. Previously we did not apply any constraints to the mobile phones when we considered involving them in the social network. In this investigation our goal is also to support as many mobile devices, even with different platforms and limited resources, as possible.

The result of this research can also be considered as a proof of concept that mobile phones even with limited resources have an important role in social networks and they are also able to participate in large peer-to-peer networks like BiTorrent.

The novel results of this research are divided into the following three theses.

Thesis I considers social networks as graphs and gives a model for mobile related social networks with node type and edge rule definitions. After, it proposes a mechanism which automatically discovers similarities between members of the network and phonebook entries. Then it gives a description about the learning based similarity handling algorithm and proves that the learning methods do not decrease

(14)

the performance. Finally it shows that the distribution of phonebook sizes is exponential and gives a queuing based analytical model for the performance of the algorithm. Based on that, it shows the requirement of stability and the expected length of the similarity processing queue. The results are demonstrated with measurements.

Thesis II shows with measurements that mobile related social networks have same characteristics with general social networks in case of in- and out degree distribution. After that by focusing on scalability and performance it gives a measurement method for estimating the distribution of similarities and shows, that it follows power law distribution. It gives an analytical model to calculate the expected number of identities in the network. Next it gives a general model for calculating the variance of power law distribution in a special case when upper bound exists. Finally, this model is used to prove the accuracy of the identity model.

Thesis III focuses on peer-to-peer based content distribution considering mobile phones. For the content distribution BitTorrent protocol was chosen, because of its popularity and efficiency. First it introduces our complete BitTorrent software package and implementation which runs on feature phones with limited resources and is able to participate in any BitTorrent networks. It describes the algorithm which was applied in real environment and it shows how it adopts to the capabilities of the mobile devices. Next, it gives a model to estimate the energy consumption.

Then, it shows measurements related to resource requirements (memory requirements, power consumption, etc.) and performance (download speed). It shows that the proposed mobile BitTorrent engine is general and is applicable on other platforms as well. The measurements are authoritative for any mobile based peer- to-peer solutions. After it, an efficient hybrid content distribution architecture is proposed which supports mobile devices as well. The goal of the solution is to decrease the load of the server. The analysis shows that the proposed architecture is applicable in social networks and proves that it decreases the load of the server.

However the architecture is still operable in case of zero share ratio because of the included backend functionality. The Thesis proposes a model to calculate the cost of the solution and compares it with the general client-server based architecture,

(15)

considering less seed ratio for mobile peers.. It proves that the efficiency of the solution increases when share ratio increases.

The theses involved practical results as well. The results were applied at Nokia Siemens Networks and Nokia Research Center. Similarity detecting and handling proposed in Thesis I and identity calculation in Thesis II was applied in the Phonebookmark project. Phonebookmark is a mobile related social network which supports different type of mobile clients as well. Before public introduction Phonebookmark was available for a group of users from April to December of 2008. During this period we have examined the operation of the system and we have collected and measured different type of anonym data related to the social network behavior. The mobile BitTorrent software package for feature phones, introduced in Thesis III, was a main part of the peer-to-peer research project at Nokia Research Center and it was used in several additional measurements as well.

Finally a hybrid peer-to-peer content distribution architecture was implemented in the Swarm project where the proposed cost model of Thesis III was applied. We would like to emphasize that, based on feedbacks, both the similarity detecting algorithm and the mobile BitTorrent solution operates well in real environment.

1.1 Thesis Outline

The organization of this thesis is as follows.

• Chapter 2 introduces the motivations of this research. It shows the problem statements and enumerates the open issues.

• Chapter 3 presents the related work and the applied background knowledge.

It also highlights open problems with references.

• Chapter 4 presents the model of mobile related social networks and efficient similarity detecting and handling. It also contains a model for the stability requirement in case of similarity handling. Measurements in this chapter were made by using a real database.

• Chapter 5 is devoted to the performance and scalability of mobile related social networks. Measurements from real database verify common character-

(16)

istics with general social networks, which also confirm the further analysis.

The proposed mathematical model is applicable in case of similar distributions.

• In Chapter 6 the algorithm for the mobile BitTorrent solution is found and it also contains design considerations and measurements related to performance, memory- and energy consumption. After it, the chapter contains the architecture of a general hybrid peer-to-peer content sharing solution and highlights that it is applicable even in mobile related social networks. An analytical model for cost efficiency is also proposed in this chapter.

• Chapter 7 highlights the practical results of the research and solutions, where they were actually used.

• Chapter 8 summarizes the results and proposes future work related to mobile based social networks and peer-to-peer solutions.

(17)

Chapter 2 Motivations and Open Problems

As social networks developed rapidly more and more mobile clients appeared. How- ever the role of mobile devices in social networks is still an open question. We should consider the fact, that the phonebook of the mobile device describe the social relationships of its owner. Based on this observation the social networking system can detect that some of my private contacts in the phonebook is similar to another registered user of the social network (i.e. may identify the same person).

Such similarities accepted by users are called identities. With the help of these identities, the system can keep the phonebooks always up-to-date with information provided by the users.

These type of social networks however raise interesting research problems. First for their operation they require an efficient similarity detecting and handling algorithm which enables to detect possible relations between users of the network and phonebook contacts. Secondly the performance of such networks should also be investigated from two different perspectives: (1) performance of the similarity detecting algorithm and after the similarity resolution, the system should ensure (2) reliable synchronization between phonebooks and profiles in the system.

Content sharing is a common feature of social networks nowadays. Members are able to share photos, photo albums or even videos with their friends. In case of previously described mobile related social networks, mobile phones are involved as individual entities, but how can we involve them into content sharing efficiently if the goal is also to decrease the load of the server? One of the most efficient content sharing solutions nowadays is BitTorrent. Previously it was shown

(18)

that smartphones can participate in P2P networks like Gnutella or BitTorrent [I. Kelényi, 2007]. However these solutions do not consider feature phones. The term of feature phone is applied for any mobile phone that is not a smartphone or PDA phone. These devices usually have limited memory capacity, processing power and network capabilities.

As a conclusion we have defined the following open problems:

• How does the exact structure of mobile related social networks look like, what edge rules should be defined? What are the resource requirements of the user initiated operations from the perspective of similarities? How to deal with multiple similarities?

• How should the similarity handling and detecting algorithm operate? What is the accuracy of the algorithm? How can we model efficiency, performance and resource requirement of this algorithm?

• What are the most important characteristics of mobile related social networks, which properties have influence to the performance and scalability?

• How can the expected number of identities be calculated, what is the accuracy of the model?

• Are feature phones with limited resources able to participate in large peer- to-peer networks like BitTorrent? What are the characteristics of such solutions?

• How can we create efficient content distribution system for mobile related social networks? What are the benefits of such solutions?

(19)

Chapter 3 Related Work and Background Information

3.1 Overview

This chapter is intended to give a brief overview and show related work in the field of social networks and peer-to-peer systems. We discuss about the structure of social networks and how important is to consider resource requirements in such networks. Then the power law distribution is introduced which occurs quite often in case of social networks and peer-to-peer systems. We also show several examples where such distribution was discovered. After that we give an introduction to the BitTorrent protocol, which is considered as one of the most efficient content sharing solution nowadays. At the end of this chapter we give an overview about how BitTorrent can be applied in social networks.

3.2 Social Networks

3.2.1 Social Network Structure

In the last decade the Internet related technologies developed rapidly. As reasons of this growth new type of solutions and applications have appeared. One of the most popular solutions are the social network sites (SNSs).

(20)

In [Boyd and Ellison, 2007] the authors have defined social network sites as web-based services that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection, and (3) view and traverse their list of connections and those made by others within the system. The structure of such social networks can be described with a graph:

G_{SN S} = (U_U, E_{U U})

E_{U U} ⊆ {(u_U, u⁰_U):u_U, u⁰_U ∈U_U, u_U 6=u⁰_U},

where U is the set of users of the system and E is the set of edges represent- ing relationships between users. The graph structure of such social networks is illustrated on Figure 3.1.

Figure 3.1. Social network structure

The nature and nomenclature of these connections may vary from site to site.

The basic idea behind such networks is that users can manage personal relationships online on the web user interface. According to this definition, the first recognizable social network site, SixDegrees.com was launched in 1997.

Profiles previously existed on most major community sites. AIM and ICQ buddy lists supported lists of Friends, although those Friends were not visible to others. Classmates.com allowed people to affiliate with their high school or college and surf the network for others who were also affiliated, but users could not create profiles or list friends until years later. SixDegrees was the first to combine these features.

After that social networks have developed rapidly and the number of features increased. Nowadays most of the sites support the maintenance of pre-existing social networks, but others help strangers connect based on shared interests, po-

(21)

litical views, or activities. Some sites cater to diverse audiences, while others attract people based on common language or shared racial, sexual, religious, or nationality-based identities. Sites also vary in the extent to which they incor- porate new information and communication tools, such as mobile connectivity, blogging, and photo/video-sharing.

Nowadays there are even different types of techniques to develop applications on top of a social network that basically consider the network as a framework. One of the most popular techniques from this perspective is OpenSocial.

"Unlike the physical world where social ecosystems are formed from the integrated and managed relationships between individuals and organizations, the online digital world consists of many independent, isolated and incompatible social networks established by organizations that have overlapping and manually managed relationships. To bring the online digital world in-line with the physical world, integration of social networks, identification of overlapping relationships in social networks, and automation of relationship management in social networks are required. OpenSocial is a framework that enables social networks to interlink and self-organize into a social ecosystem guided by the policies of individuals and organizations." ([Julianna et al., 2007])

3.2.2 Performance and Functionality of Social Networks

As the functions of the SNSs flared, the number of users increased rapidly. Han- dling the extending number of users efficiently in SNSs is a key issue as it was visible in case of Friendster. Friendster was launched in 2002 as a social complement to Ryze. It was designed to help friends-of-friends to meet. As Friendster’s popularity surged, the site encountered technical and social difficulties. Friend- ster’s servers and databases were ill-equipped to handle its rapid growth, and the site faltered regularly, frustrating users who replaced email with Friendster.

Since their introduction, social network sites such as Facebook, Myspace and LinkedIn have attracted millions of users, many of whom have integrated these sites into their daily practices and they even visit these multiple times a day.

The sites become an environment, created by the people who are using them.

(22)

These popular online social networks are among the top ten visited websites on the Internet [Top Sites, 2010].

Facebook was launched in early 2004. Nowadays, according to new statistics [Facebook statistics, 2010], Facebook has more than 400 million active users, 50%

of the active users log on to Facebook in any given day, more than 35 million users update their status each day and an average user spends more than 55 minutes per day on Facebook. These examples show that popular social networks can have a huge growth which has to be considered during the design of any SNS.

Newman et al. [Newman et al., 2002] describe some novel uniquely solvable models of the structure of social networks based on random graphs with arbitrary degree distributions. They give models both for simple unipartite networks such as acquaintance networks and bipartite networks such as affiliation networks. They compare the predictions of their models to data from a number of real-world social networks and find that in some cases, the models show high correlation with the data, whereas in others the correlation is lower, perhaps indicating the presence of additional social structure in the network that is not captured by the random graph.

Bakos et al. [Bakos et al., 2005] have used the phonebook of mobile phones for a socially relevant search. They found that search engines generally lack the trust and level of personalization needed for recommendation systems to answer searches like: I need a reliable plumber close to my house. In order to achieve personalization, social relevance and an acceptable level of privacy, the search database itself needs to be personalized. One possible dimension of personalization is the social neighborhood of the searcher. In particular, phonebook links represent a readily available infrastructure for such search. They have demonstrated their concept via a novel search engine algorithm for social networks that operates on S60 and uses SMS messages to communicate.

This search mechanism is shown on Figure 3.2 in more detail. An example of query could be: search for a person whose job is plumber and whose address contains the string Budapest. The user introduces the desired query in his phone (step 1). In the next phase (step 2) the phone sends the query with the following parameters:

(23)

• to the predefined contacts (in the settings it could be set to "everybody",

"predefined group", and "nobody")

• with a preset time to live (TTL) field, having as effect a larger or smaller propagation horizon in the social neighborhood

Figure 3.2. SMS based phonebook search mechanism

The smart phones receiving the query check a match with their profile (step 3). If there is a match, a query hit message is returned to the originator (step 5).

In all cases, if theTTLis not expired (value = 1), it is decremented and the query is forwarded to the contacts of this phone (step 4). The query will reach all the phones within the rangeTTL of contacts of the query initiator and all query hits will be returned to the originator.

It is important to note that the returned query hit (step 4) does not reveal who were the persons who had as contact the person matching the query, so the privacy of these people is not violated. It is also emphasized that this works automatically: no user intervention is required (on the other hand the user might set the application to reply and forward only if he explicitly chooses to reply and/or forward).

In a social network nodes and links represent participants and their relationships, respectively. Tomiyasu et al. [Tomiyasu et al., 2006] have designed and implemented a query propagation mechanism and its applications to realize a social network composed by cellular phone users. In these applications, users can

(24)

retrieve information on their friends or their friends’ friends by propagating the query in the network. To propagate a query in a wide range and improve the query success ratio most users who receive the query must relay it to all their friends.

However, this increases network traffic. In their paper they have proposed a query routing method to decrease the number of communication packets by using user profiles.

Duncan et al. [Watts and Dodds, 2002] present a model that offers an expla- nation of social network searchability in terms of recognizable personal identities defined along a number of social dimensions. Their model defines a class of search- able networks and a method for searching them that may be applicable to many network search problems including the location of data files in peer-to-peer networks, pages on the World Wide Web, and information in distributed databases.

In [Eagle and Pentland, 2005] Nathan et al. propose the Serendipity system that senses a social environment and cues informal interactions between nearby users who might know each other. Their system uses Bluetooth addresses to detect and identify proximate people and matches them from a database of user profiles. They show how inferred information from the mobile phone can augment existing profiles, and they present a novel architecture for investigating face-to-face interaction designed to meet various levels of privacy requirements.

In [Forstner and Kelényi, 2009] the authors discuss about that nowadays social networking on mobile phones is not only a buzz term for today’s enthusiasts but also provides real possibilities to the users.

Previously we investigated [Ekler and Charaf, 2008c] how mobile devices can connect to a web-based social network. We outlined an architecture where mobile devices connect to a social network via web services. Additionally we demonstrated this solution with a web-based social network application that offered all of its main functionality to mobile clients through web-service interface.

3.3 Information Retrieval and Similarity Detecting

In [Grossman and Frieder, 2004] the authors summarized information retrieval statistics and related algorithms. Following we cite from this book. The term information retrieval refers to a search that may cover any form of information:

(25)

structured data, text, video, image, sound, musical scores, etc. Their work was fo- cused to ad hoc information retrieval problems. Ad hoc information retrieval allows users to search for documents that are relevant to user-provided queries. It may appear that systems such as Google have solved this problem, but still the results can be improved. According to [TRE, 2003] typical systems have an effectiveness (accuracy) of, at best, forty percent. A new measurement [Walters and H., 2009]

showed that 55 percent of the first 20 records retrieved by Google Scholar are relevant. As shown in 3.3, the precision of Google Scholar remains relatively high even after the first 50 hits. Within the first 100 search results, 39 percent of GS records are relevant.

Figure 3.3. Google Scholar precision based on the number of search results

Figure 3.3 also reveals that the utility of GS could be improved if relevant results were concentrated more heavily within the first 20 or 30 hits rather than the first 50 or 100.

3.3.1 Recall and Relevance

Information retrieval is devoted to finding relevant documents, not finding simple matches to patterns. Earlier when information retrieval systems are evaluated, they are found to miss numerous relevant documents [Blair and Maron, 1985].

Moreover, users have become complacent in their expectation of accuracy of information retrieval systems [Gordon, 1997].

Nowadays information retrieval is developing rapidly, new methods are introduced and new solutions often use the specialties of the operational environment,

(26)

however the accuracy still can be increased. In [Buckley, 2009] information retrieval engines were examined from failure point of view. The analysis showed that the top-retrieved documents are not reflecting some aspect at all. In document retrieval systems, relevance is provided by scoring the match between individual documents and the query. For example, considering the context of website advertisement, in [Chakrabarti et al., 2008] the authors show how the match can be improved significantly by augmenting the ad-page scoring function with extra parameters from a logistic regression model on the words in the pages and ads. A key property of their model is that it can be mapped to standard cosine similarity matching and is suitable for efficient and scalable implementation over inverted indexes. In best case their model achieves a 25% lift in precision relative to a traditional information retrieval model.

Figure 3.4 illustrates in general the document categories that correspond to any issued query. Namely, in the collection there are documents which are retrieved and there are those documents that are relevant.

Figure 3.4. Document categories to an issued query

In a perfect system, these two sets would be equivalent; we would only retrieve relevant documents. In reality, systems retrieve many non-relevant documents. To measure effectiveness, two ratios are used: precision and recall.

Definition 3.1 (Precision). Precision is the ratio of the number of relevant documents retrieved to the total number retrieved.

(27)

P recision= RelevantRetrieved

Retrieved (3.1)

Precision provides an indication of the quality of the answer set. However, this does not consider the total number of relevant documents. A system might have good precision by retrieving ten documents and finding that nine are relevant (a 0.9 precision), but the total number of relevant documents also matters. If there were only nine relevant documents, the system would be a huge success; however if millions of documents were relevant and desired, this would not be a good result set.

Recall considers the total number of relevant documents.

Definition 3.2 (Recall). Recall is the ratio of the number of relevant documents retrieved to the total number of documents in the collection that are believed to be relevant.

Recall= RelevantRetrieved

Relevant (3.2)

Computing the total number of relevant documents is non-trivial. The only sure means of doing this is to read the entire document collection. A good survey of effectiveness measures as well as brief overview of information retrieval is found in [Kantor, 1994].

3.3.2 Retrieval Strategies and Similarity Detecting

Retrieval strategies assign a measure of similarity between a query and a document.

These strategies are based on common notion that the more terms are found in both the query and the document the more it is relevant. Some of these strategies employ counter measures to alleviate problems that occur due to ambiguities inheretent in language. The reality is that the same concept can often be described with many different terms (e.g. Joe, Joseph).

A retrieval strategy is an algorithm that takes a queryQ and a set of documents D₁, D₂, ..., D_n and identifies the Similarity Coefficient SC(Q, D_i) for each of the documents 1≤i≤n.

The retrieval strategies collected in [Grossman and Frieder, 2004] are:

(28)

• Vector Space Model: Both the query and each document are represented as vectors in the term space. A measure of the similarity between the two vectors is computed.

• Probabilistic Retrieval: A probability based on the likelihood that a term will appear in a relevant document is computed for each term in the collection. For terms that match between a query and a document, the similarity measure is computed as the combination of the probabilities of each of the matching terms.

• Language Models: A language model is built for each document, and the likelihood that the document will generate the query is computed.

• Inference Networks: A Bayesian network is used to infer the relevance of a document to a query. This is based on the evidence in a document that allows an inference to be made about the relevance of the document. The strength of this inference is used as the similarity coefficient.

• Boolean Indexing: A score is assigned such that an initial Boolean query results in a ranking. This is done by associating a weight with each query term so that this weight is used to compute the similarity coefficient.

• Latent Semantic Indexing: The occurrence of terms in documents is represented with a term-document matrix. The matrix is reduced via Singular Value Decomposition to filter out the noise found in a document so that two documents which have the same semantics are located close to one another in a multi-dimensional space.

• Neural Networks: A sequence of neurons, or nodes in networks, that fire when activated by a query triggering links to documents. The strength of each link in the network is transmitted to the document and collected to form a similarity coefficient between the query and the document. Networks are trained by adjusting the weights on links in response to predetermined relevant and irrelevant documents.

• Genetic Algorithms: An optimal query to find relevant documents can be generated by evolution. An initial query is used with either random or esti-

(29)

mated term weights. New queries are generated by modifying these weights.

A new query survives by being close to know relevant documents and queries with less fitness are removed from subsequent generations.

• Fuzzy Set Retrieval: A document is mapped to a fuzzy set (a set that contains not only the elements but a number associated with each element that indicates the strength of membership). Boolean queries are mapped into fuzzy set intersection, union and complement operations that result in a strength of membership associated with each document that is relevant to the query.

This strength is used as a similarity coefficient.

For a given retrieval strategy many different utilities are employed to improve the results of the retrieval strategy.

In Phonebookmark we needed a similarity detecting algorithm. Our goal was to find the relevant similarities, provide a proper ordering for users and design an efficient resolving mechanism. In case of similar person structure detecting an efficient similar term handling was also needed which detects for example similar names like Sam and Samantha. Details are discussed in Chapter 4.

During the development period of Phonebookmark, we have observed other phonebook related solutions on the web. Zyb [Zyb, 2010] and Plaxo [Plaxo, 2010]

allow for synchronizing with mobile phones and managing the contacts using a web browser. Xing [Xing, 2010] has also mobile access, but focuses more on busi- ness relationships. Automatic similarity detection and efficient identity handling is missing from these system and there is no automatic notification like in Phone- bookmark when one of a user’s phonebook contacts becomes (or already is) a user of the system. However we can say that these networks are very similar to Phone- bookmark comparing their structure and functionality, which shows the spreading of mobile based social networks.

The convergence between social networks and phonebooks is also noticeable in [Android 2.0, 2010] if we consider the new Android 2.0 platform, which supports sync adapters that provide synchronization with additional data sources like social networks. Besides there are already ongoing implementations for enabling synchronization between Android based phonebooks and Facebook. Similar ap- proaches can be observed on other mobile platforms. For example on the Maemo

(30)

[Maemo Synchronization Support, 2010] platform we can see our contacts from Skype, GTalk, MSN, etc., but the merge between phonebook contacts should be done manually.

The existing solutions for synchronizing contacts lists and social networks usually do not pulicate their algorithm for finding similar entries. In the [Cheng and Lin, 2006] patent, authors proposed a method for synchronizing two sets of contact data of two storage media in a communications system. The method updates the content of the contact data of one storage medium with the content of the contact data of the other storage medium according to the corresponding mapping relations between the two contact data arrays and to the choice of an operator that can update on of a plurality of contact data with one or more of a plurality of contact data. However this solution mainly considered only names and phone numbers.

There are active discussions that in many cases the synchronization algorithms between contact lists and social networks are not accurate. Usually the problem is that details are entered with different syntax in the systems, names are written differently or simply there is not enough information for matching contacts for 100%. In addition to that language differences are also a major problem.

3.4 Analyzing Dynamically Evolving Large Net- works

Huge amount of papers and popular books, such as Barabási’s Linked [Barabási, 2002] study the structure and principles of dynamically evolving large scale networks like the Internet and networks of social interactions.

There has been a great deal of theoretical work on designing random graph models that result in a Web-like graph. Barabási and Albert [Barabási and Albert., 1999] describe the preferential attachment model, where the graph grows continuously by inserting nodes, where new node establishes a link to an older node with a probability which is proportional to the current degree of the older node. Bollobás et al. [Bollobás et al., 2001] analyze this process rigorously and show that the degree distribution of the resulting graph follow a

(31)

power law. Another model based on a local optimization process is described by Fabrikant et al. [Fabrikant et al., 2002].

Kumar et al. [Kumar et al., 2006] presented measurements related to online social networks. They analyzed data of five million people and ten million friend- ship links, annotated with metadata capturing the time of every event in the life of the network. Their measurements expose a surprising segmentation of these networks into three regions: singletons who do not participate in the network; isolated communities which overwhelmingly display star structure; and a giant component anchored by a well-connected core region which persists even in the absence of stars. They also presented a simple model of network growth which captures these aspects of component structure. Such models are important if we consider analyzing scalability of such networks. However the structures of these social networks do not consider phonebook support of mobile devices.

3.4.1 Power Law Distribution

Following the terminology in [Fabrikant et al., 2002] a nonnegative random vari- able X is said to have a power law distribution ifPr[X ≥x]∼cx^−α, for constant c >0andα >0. In a power law distribution asymptotically the tails fall according to the power α, which leads to much heavier tails than other common models.

Mitzenmacher [Mitzenmacher, 2001] gives an excellent survey on the history and generative models for power law distributions.

A typical power law distribution is shown on Figure 3.5.

3.4.2 Examples for Power Law Distribution

Distributions with an inverse polynomial tail have been first observed in 1896 by Pareto [Pareto, 1896] (see. [Ripeanu et al., 2002]), while describing the distribution of income in the population. In 1935 Zipf [Zipf, 1935] and Yule in 1944 [Yule, 1944] investigated the word frequencies in languages and based on empirical studies he stated that the frequency of the n-th frequent word is proportional to 1/n. Zipf also observed similar statistical behavior like the one reported by Pareto in the distribution of inhabitants in cities [Zipf, 1949].

(32)

Figure 3.5. Power law distribution

Mislove et al. [Mislove et al., 2007] studied the graph properties of several online real-world social networks. They present a large-scale measurement study and analysis of the structure of multiple online social networks. They examined data gathered from four popular online social networks: Flickr, YouTube, LiveJournal, and Orkut. They crawled the publicly accessible user links on each site, obtaining a large portion of each social network’s graph. Their data set contains over 11.3 million users and 328 million links. Their measurements show that high link symmetry implies in-degree equals out-degree; users tend to receive as many links as they give, the observed networks are power-law with high symmetry.

In [Broder et al., 2000], the graph structure of the Web has been investigated and it was shown that the distribution of in- and out-degree of the web graph and the size of weakly and strongly connected components are well approximated by power law distributions. Nazir et al. [Nazir et al., ] showed that the in- and out- degree distribution of the interaction graph of the studied Myspace applications also follow such distributions. In [Ripeanu et al., 2002] the authors also discovered power law behavior regarding to the degree distribution of the Gnutella network and Crovella et al. [Crovella et al., 1998] also observed similar distributions in the sizes of files and transmission times in the Internet.

(33)

3.5 Queuing Theory

Queuing theory is basically a model for waiting lines. Queuing theory enables the mathematical analysis of different processes, including arriving, waiting and being served in the queue. In general the theory can be used to calculate wait queue length and average wait time in the system.

In 1909 Agner Krarup Erlang [Erlang, 1909] applied the theory of probabilities to problems of telephone traffic. He proved that telephone calls distributed at random follow Poisson law of distribution. Notation for describing the characteristics of a queuing model was first suggested by David G. Kendall in 1953 [Kendall, 1953]. Kendall introduced an A/B/C queuing notation that can be found in all standard modern works on queuing theory. As converging to com- puters and networks Leonard Kleinrock published the theory of queuing systems [Kleinrock, 1975] in 1975.

3.5.1 Evolution Equation

Under the term of evolution equation we consider the following equation:

X_n+1 = (X_n−V_n+1)⁺+Y_n+1, (3.3) where X_n is the length of the serving queue,V_n the number of served requests in n time period, and Y_n is the number of newly arrived requests in n period.

X_n is considered as a homogeneous Markov chain when V_i are independent with common distribution, Y_i are independent with common distribution and V_n and Y_n are also independent.

By using Foster’s criteria [Foster, 1953] the following statement can be proven for the stability: When E[Y] < E[V] and p_ij > 0 (transition probability, when

|i−j|= 1) →X_n is stable.

3.5.2 Calculating Queue Length

TheM/M/1model [Kleinrock, 1975] shows the basic ideas and methods of Queu- ing Theory. In this case the arrival of requests is modeled with aλ Poisson arrival,

(34)

the serving time has exponential distribution with µ parameter and there is one serving unit.

Consider N(t)as the wait queue length at t, therefore N(t) can be considered as a Markov chain. In case of infinite state the condition of the stability is:

λ

µ <1 (3.4)

It was also proved by Kleinrock that the expected queue length can be calculated as:

E[N(t)] = λ

λ−µ (3.5)

3.6 Peer-to-Peer Content Sharing

3.6.1 Peer-to-Peer Systems

The following paradigm is usually used for peer-to-peer systems: "A network service, where the entities are using the resources of each other (e.g.: network bandwidth, storage capacity, processing power) instead of a central component.".

Peer-to-peer system differs greatly from client-server architectures, where the service is provided by servers and clients only use those services. In case of peer- to-peer solution service providing is distributed between peers. This way, in many cases peers behave as clients and servers at the same time. Figure 3.6 shows the typical architecture of peer-to-peer and client server systems.

Figure 3.6. Typical architecture of peer-to-peer and client-server systems Nowadays peer-to-peer systems are used in different situations:

(35)

• Content sharing

• Audio- and video streaming

• Communication

• Distributed search

• Distributed computing

Peer-to-peer solutions are very efficient from content distribution point of view (especially if the content is popular). Pure peer-to-peer solutions include Gnutella (LimeWire, BearShare), Napster, Kazaa. In these systems there is no need for large central servers, the participants (usually PCs connected through the Internet) transfer the content between one another. In these systems the infrastructure and operational costs (internet access, energy consumption) are implicitly divided among the end-users. Content search is also conducted in a distributed manner.

Peer-to-peer systems are usually mentioned as illegal services, the reason of this is that in many cases the technology is used to distribute protected contents illegally. However the technology has many advantages, therefore it was already applied in several solutions and in many cases the architecture is hidden from user perspective. The major advantages of the technology are scalability, fault toler- ance and robustness. Figure 3.7 shows that according to a measurement in 2008 [Internet traffic, 2010] peer-to-peer traffic has a huge part in the overall Internet traffic and these numbers are growing.

Figure 3.7. Internet traffic from the perspective of different protocols

From node types and roles point of view peer-to-peer systems are usually divided into two main categories:

(36)

• Pure peer-to-peer: every node is equal with the same functionality, there is no central server or router. This decentralized topology follows the basis of the peer-to-peer architecture the most. Such systems are the original Gnutella, where network administration and search is also the responsibility of peers.

• Hybrid peer-to-peer: There are central elements in the networks that manage and overview the peers. There are also multi-level architectures where the group of the peers have special roles and they form an additional sub- network. These entities are usually called as supernodes or superpeers. The new Gnutella uses also this architecture.

Following we consider content sharing peer-to-peer systems. BitTorrent is one of the most efficient content sharing solutions nowadays. In the next subsection we briefly introduce the protocol, since it is used in Chapter 6.

3.6.2 The BitTorrent Protocol

BitTorrent concentrates on efficient, distributed file transfer [BitTorrent 1.0, 2010], it is designed to distribute large amounts of data without incurring the corresponding consumption in costly server and bandwidth resources; hence, it can be adequate for mobile file-sharing.

In BitTorrent, when peers (clients) are downloading the same content simul- taneously, they also send different pieces (small parts of the whole content) of the already downloaded content to each other. This behavior is one of the major advantages of the protocol.

Sharing files over BitTorrent requires at least one dedicated node in the network which is called as "Tracker". The tracker coordinates file distribution and can be asked for the shared resources which are under its supervision. If a peer requests for a specific content from the tracker, it returns several IP addresses which belongs to other peers who have the whole content (seeder) or parts of it. The network can also contain several trackers in order to distribute the traffic further.

Figure 3.8 illustrates how download and upload operates at the same time in BitTorrent protocol. The percentages on the figure illustrate how much of the content have already been downloaded by the relevant peers. Note, that nodes marked

(37)

as Seeder have already downloaded the full content and their current objective is to share it with the other peers.

Figure 3.8. BitTorrent protocol

The process of content sharing via BitTorrent begins with the creation of a torrent file. This file contains metadata related to the tracker and the files to be shared. Figure 3.9 illustrates the structure of a torrent file.

Figure 3.9. Torrent file structure

The size of a torrent file can be calculated with the following formula:

ST orrentF ile=S_gi+S_ci+S_ctrli (3.6)

(38)

A torrent file has three main parts. The first one is where general information is stored. This part has constant size (Sgi), it contains the HTTP address of the tracker, and information about the creator. The second part contains the names, the sizes and the directory hierarchy related to the shared files. The size of it is also small (S_ci), it depends from the amount of shared files.

The shared content in BitTorrent is transferred in small pieces. The size of the pieces is pre-defined; it is usually around 64-256 KB. The third part of the torrent file contains theSHA1 hash values in 20 bytes of these pieces. During the operation, BitTorrent calculates another 20 byte SHA1 value from the last two part of the torrent file, which is used to identify the torrent in the network.

The piece hashes are used during the protocol to check the consistency of the downloaded pieces. This way the size of the third part (S_ctrli) depends from the size of content. For example, when the size of the shared content is2048 KB and it is divided into 64 KB pieces then,

S_ctrli = 2048KB

64KB ∗20B = 640B (3.7)

We can see from the calculations that the size of a torrent file is relatively small compared to the whole content, thus transferring it over the network does not consume significant amount of bandwidth. A torrent file can be transferred using numerous ways; however it is usually hosted on a web server.

In order to share content via BitTorrent, the torrent file related to the content needs to be registered to a tracker; afterwards, any client which obtains the torrent file can connect to the swarm and download or upload the content. Peers are required to periodically check in with the tracker (this process is called "an- nouncing"); thus, the tracker can maintain an up-to-date list of the participating clients.

Concerning legal issues, BitTorrent, similarly to any other file transfer protocol, can be used to distribute files without the permission of the copyright holder.

However, a person who wishes to make a file available must run a tracker on a specific host and distribute the tracker’s address in the torrent file. This feature of the basic protocol also makes possible to locate the trackers who are responsible for illegal contents. It is far easier to request the service provider of the tracker to

(39)

shut the server down than finding every user sharing a file on a fully decentralized peer-to-peer network.

3.6.3 Analysis of BitTorrent

Despite its popularity the actual behavior of BitTorrent over prolonged periods of time is still poorly understood. Pouwelse et al. [Pouwelse et al., 2007] present a detailed measurement study over a period of eight months of BitTorrent. They show measurement results of the popularity and the availability of BitTorrent, of its download performance, of the content lifetime, and of the structure of the community responsible for verifying the uploaded content. The results are that the system is quite popular, but the number of active users in the system is strongly influenced by the availability of the central components. They also found that 90% of the peers experienced speeds were below 65 kB/sec. From the lifetime point of view they showed that only 9,219 out of 53,883 peers (17 %) have an uptime longer than one hour after they finished downloading. For 10 hours this number has decreased to only 1,649 peers (3.1 %), and for 100 hours to a mere 183 peers (0.34 %).

Guo et al. [Guo et al., 2005] show that existing studies on BitTorrent systems are single-torrent based, while more than 85% of all peers participate in multiple torrents according to their trace analysis. They present measurements and analysis for multiple torrent environments. In a BitTorrent system, the service policy of seeds favors peers with high downloading speed in order to improve the seed production rate in the system. They demonstrate with measurements that the higher the downloading performance peers have, the less uploading service they actually contribute. This indicates that peers with high speed finish downloading quickly and then quit the system soon, which defeats the design purpose of the seed service policy.

Free-riders in BitTorrent systems are those who download but do not upload any data. This may happen when the user specially configures or modifies his client software. In our case this behavior is not tolerated because this way, users are able to manually decrease the possibility of cooperation between each other. In [Andrade et al., 2005] the authors present collected BitTorrent usage data across

(40)

multiple file-sharing communities and analyze the factors that affect users’ cooperative behavior. They found evidence that the design of the BitTorrent protocol results in increased cooperative behavior over other P2P protocols used to share similar content. They also showed that in torrents with a relatively low number of seeders, BitTorrent is successful in penalizing free-riding, in effect by increasing the download times of peers that free-ride. However, in torrents where seeders are plentiful, i.e., torrents with high seeding ratios, free-riders may download faster than collaborating peers.

In [Barbera et al., 2005] the goal of the authors was to develop an analytical model of a free-rider in a BitTorrent network. They derived a continuous-time Markov model of a free-rider. Unlike previous analytical models which capture the behavior of the network as a whole, their proposed model is able to analyze the performance from the user’s perspective.

While it is well-known that BitTorrent is vulnerable to selfish behavior, Locher et al. [Locher et al., 2006] demonstrate that even entire files can be downloaded without reciprocating at all in BitTorrent. To this end, they present BitThief, a free-riding client that never contributes any real data. They showed that simple tricks suffice in order to achieve high download rates, even in the absence of seeders.

They also illustrated how peers in a swarm react to various sophisticated attacks.

Minglu Liy, Jiadi Yuy and Jie Wu [Li et al., 2007] present a fluid model with two different classes of peers to capture the effect of free-riding on BitTorrent- like systems. With the model, they have found that the incentive mechanism of BitTorrent is successful in preventing free-riding in a system without seeds, but may not succeed in producing a disincentive for free-riding in a system with a high number of seeds.

3.6.4 Applying BitTorrent in Social Networks

The functioning of any peer-to-peer system relies on cooperation between the peers. In a social P2P system, communication occurs almost exclusively between nodes whose owners have a trustworthy social relationship. These systems are also known as darknets [Biddle et al., 2002], [Rogers and Bhatti, 2007] or friend- to-friend (F2F) systems [Li and Dabek, 2006]. The basic assumption behind such

(41)

systems is that just as people interacting in the real world are much less likely to cheat against friends than strangers. Peers in an F2F system are likely to behave cooperatively towards other peers knowing that they belong to friends. There is also a need for additional cooperation enforcing mechanisms, like the one described in Section 3.6.5.

There are several P2P systems that use the concept of social networks in one way or another. Li et al. [Li and Dabek, 2006] propose an F2F system for scalable and durable storage. The Freenet project [Clarke et al., 2002] turns a social network into a privacy-preserving file sharing application with focus on efficient search [Sandberg, 2006]. Tribler [Pouwelse et al., 2008], another file-sharing application, allows the creation of social groups. Peers can recruit their friends from the group as download helpers to improve the performance. The helpers contribute their bandwidth twice, downloading from the source and then uploading to the friend that requested help. In contrast, in our system there is no recruitment but we do take advantage of the fact that two friends are downloading the same content.

None of the above work looks at the social network topology as the sole means of content distribution and does not clearly measure the performance benefits gained from using the social links.

The popular BitTorrent client, Vuze, has recently added the friends feature.

Users can allocate more upload bandwidth to their friends and easily share torrent links with them. The eMule client [Kulbak and Bickson, 2010] has friend lists as well. Users can allocate upload slots to friends so they do not need to wait in the queue.

In [Galuba et al., 2010] the authors evaluate a socially-aware BitTorrent, in which peers connect both to the peers obtained from the trackers and to the peers belonging to the user’s friends. The three core problems that they address are: (1) whether the social network topology alone can function as a scalable and efficient content distribution medium, (2) whether it is possible to take advantage of the social links for solving the freeriding problem without sacrificing performance and (3) what is the number of downloaders in the social network at which the benefit of using the social links becomes significant. They found that a hybrid solution in which peers download from both their friends and other peers obtained from