• Nem Talált Eredményt

4.3 Structure and Analysis of Similarities

4.3.3 Similarity Detecting Algorithm

This detecting algorithm runs in the background continuously when operations (1), (2) or (3), described in 4.3.1 occur. Algorithm 4.1 describes the pseudo code of the similarity detecting algorithm. We highlight before each function to which

operation it belongs. It can be noticed that member registration calls operation (2) and (3).

Functions (1), (2) and (3) basically operate as it was discussed in subsection 4.3.1. FunctionsearchForSimilarities first goes through the phonebook of the new member and checks whether some of the private contacts in this phonebook are similar to some members in the system by calling the searchForSimilaritiesToPC function. After thatsearchForSimilaritiesToM function is being called which goes through every other phonebook in the system and checks similarities for this mem-ber.

Proposition 4.12. According to measurements, we have shown that the precision rate of the 4.1 similarity detecting algorithm is above 90%.

Proof. During the operational period of Phonebookmark we collected statistics about similarities. From precision point of view it is even whether it relates to a duplication or a similarity between a member and a private contact, since in both cases person details are compared.

According to these statistics 1565 duplications and similarities were detected during the operational period and 1417 of them were accepted by members. This shows that 90.54% of the detected similarities were accurate.

Following we refer to this similarity accuracy rate asPR∼0.9. This parameter can be changed based on the accuracy of the applied similarity detecting algorithm.

On the description of Algorithm 4.1 we can see that in order to check whether a private contact and a member is similar, thegetSimilarityProbability function is being called. This function is basically the core of the similarity detecting, it is able to compare two person entries and calculate a similarity probability value to them. If this probability is greater than 0, the similarity is stored and displayed for the related member. This probability value is also used to propose an ordering in case of multiple similarities.

For the similarity detecting in mobile related social networks, we consid-ered person profiles as structures. The attributes of this structure are profile details like phone number, e-mail address, birthday, etc. In Phonebookmark we selected those attributes which are available on mobile phone as well. Let

Algorithm 4.1 Similarity detecting

1: // Array of members who are already in the network

2: var members[]=ArrayOfMembers;

3: // Array of relevant similarity events (M)

4: // (1)

5: function searchForSimilarities(newMember)

6: for all privateContact (aPC) in newMember.phonebook do

7: searchForSimilaritiesToPC(aPC)

8: searchForSimilaritiesToM(newMember)

9: members.add(newMember)

10:

11: // (2)

12: function searchForSimilaritiesToPC(privateContact)

13: for all member (M) in members do

14: var similarityProbability = getSimilarityProbability(privateContact,M)

15: if simProbability > 0then

16: // store the similarity

17: storeSimilarity(privateContact,M,similarityProbability)

18:

19: // (3)

20: function searchForSimilaritiesToM(updatedMember)

21: for all member (M) in members do

22: for all privateContact (PC) in M.phonebook do

23: var similarityProbability = getSimilarityProbability(updatedMember,PC)

24: if similarityProbability > 0 then

25: // store the similarity

26: storeSimilarity(PC,M,similarityProbability)

27:

28: function getSimilarityProbability(person1,person2)

29: // The array stores similarity events which are true for person1 and person2 (H)

30: var relevantEventsH[] = new Array();

31: for all matchterm (M) in matchTermsAllM do

32: // check wheather M is true for person1 and person2

33: if checkMatchTerm(M,person1,person2) then

34: relevantEventsH.add(M);

35: if relevantEventsH.size > 0 then

36: // get the current probability (weight) for the relevant events

37: return getEventsProbability(relevantEventsH)

38: else

39: return 0

M = {M1, M2, ..., M k}. We define the following events for two profile compari-son: M ={M1, M2, M3, M5, M5}. Notice that, this list is expendable if needed:

• M1: they have the same first name and last name,

• M2: they have the same private phone number,

• M3: they have the same public phone number (fax, office number, etc.),

• M4: they have the same e-mail address,

• M5: they have the same birth day.

We refer to the event of identity (accepted similarity) as I. Based on the fre-quency of the events in the past, for each subsetH ⊆M, the algorithm maintains the conditional probability Pr[I|H], i.e. the conditional probability that the sim-ilarity becomes accepted if the set of events H occurs. For each subset H ⊆ M, Pr[I|H]can be calculated as follows:

Pr[I|H] = Pr[I∧H]

Pr[H] =

nIH

(|UM|−1)|UM| nH

(|UM|−1)|UM|

= nIH

nH (4.2)

In (4.2) nIH is the number of cases where the events of H occur and the similarity has been accepted andnH is the number of all cases where the events of H occur. (|UM| −1)UP C is the number of private contact and member pairs that may define a similarity in the mobile related social networks.

Based on this consideration we implemented the algorithm with learning meth-ods. For each H ⊆ M, the values nIH and nH can be initialized from a collected data set. During the operational period of Phonebookmark (subsection 4.3.2) we collected a base data set which can be used as starting values for any mobile related social network. Table 4.1 contains the collected base data set of Phonebookmark.

During the operational period, when the system is used actively, these condi-tional probabilities can be dynamically maintained based on the decisions of the members, whether they accept or ignore the proposed similarities. We obtain the following proposition:

Proposition 4.13. In a mobile related social network that applies the (4.2) model, the overall time for maintaining the conditional probabilities for the similarity han-dling algorithm is O(1).

Table 4.1. Conditional probabilities calculated during the operational period of Phone-bookmark

H Pr[I|H]

M1,M2,M3,M4,M5 1 M1,M2,M3,M4 0.9784 M1,M3,M4,M5 0.9897 M1,M2,M4,M5 0.9913

... ...

M1,M2,M4 0.9605

M1,M2,M5 0.9090

... ...

M2 0.8730

M1 0.8794

Proof. If a member accepts a proposed similarity, where the events of H occur, then the valuesnSH and nH both increase by one. Therefore,

Pr[S|H] = nSH + 1

nH + 1 (4.3)

If the user rejects the proposed similarity where the events of H occur, then nH increases by one and nSH does not change. Therefore Pr[I|H] will be:

Pr[I|H] = nIH

nH + 1 (4.4)

Therefore in case of acceptance or ignorance, we have to modify the value of one variable, which can be done in constant time.

In several cases, string comparison is not enough for similarity detecting. We have created an extension method to Algorithm 4.1, to deal with similar terms, e.g. similar names like Joe and Joseph efficiently. This extension can be applied for other similar details, e.g.: nickname, last name, etc.

For example, the similar name handling uses the following procedure:

Algorithm 4.2 Learning and handling similar terms

1: function storeSimilarity(person1, person2, simProbability)

2: for term in person1 and person2do

3: if term in possibleSimilarTerms and person1.term==person2.term then

4: // stores or updates the counter of the similarity term

5: storeOrUpdateSimilarityTerm(person1.term,person2.term);

6: // get the counter for the similarity term

7: var simTermCounter = getSimTermCounter(person1.term,person2.term);

8: if simTermCounter >= KSimilarityToIdentityLimit then

9: // adds a new match term to matchTermsAllM

10: addMatchTermToM(new MatchTerm(person1.term,person2.term));

11: // save the similarity in DB

12: saveSimilarity(person1,person2,simProbability);

1. When members use the system and accepts a similarity then the algorithm checks whether the corresponding phonebook contact and the member have the same first name.

2. If not, but still the identity was marked between these two people, the similar name pair is saved.

3. If this name pair has never been detected before, a new entry is stored in the similar terms table with weight value 1.

4. If this name pair is already in the similar terms table, its weight value is incremented.

5. When the weight value of a similar name pair exceeds a certain limit, the related names are considered as identical.

Table 4.2 illustrates the similar terms table. The content of table 4.2 is filled with example data because of anonymity issues related to Phonebookmark database, but the real structure looks like the same.

The proposed similar term handling method is also applicable in case of inter-national environment, when some people enter names without accent, e.g.: Zoltan - Zoltán. This way the algorithm is also able to deal with international accents.

Table 4.2. Similar terms table

similar term 1 similar term 2 weight

Joe Joseph 9

Sam Samantha 3

Katharine Kate 6

Proposition 4.14. The 4.3.3 similar term handling method increases the number of detected relevant similarities while increases the person comparison time only with constant.

Proof. Let M = {M1, M2, ..., M k}. Without any restrictions assume that M1 relates to the name match event. The number of detected similarities, because of M1 condition is SM1. The similar term handling extension means formally that we change M1 to M1, where M1 means that the algorithm checks whether the two profile structure contains the same name or the names are in the similar terms table. The number of detected similarities, because of M1 condition is SM1. Since P[M1]≤P[M1], SM1 ≤SM1.

The execution time of the algorithm increases only with a table select step when the similar terms table is checked. Only those rows are relevant where the weight value is greater than a specific value. If we apply similar term handling for other profile details, the comparison time increases linearly.

Similarity handling is a key issue in mobile related social networks, since we pro-posed a semi-automatic similarity handling solution, users have to decide whether to accept or ignore detected similarities. In order to operate this solution efficiently the algorithm should be as precise as possible to make the decision for users easier.

This way the objective of the similarity detecting algorithm is to detect all relevant similarities with as high precision rate as possible.