Hubs and Authorities ?

In document DATAMINING GÁBORBEREND (Pldal 186-191)

? 6.1 The curse of dimensionality

Exercise 7.1. Suppose you have a collection of recipes including a list of ingredients required for them. In case you would like to find recipes that are

8.4 Hubs and Authorities ?

Remember that the stationary distribution we obtained earlier for the Markov chain without applying a damping factor was

[0.333, 0.222, 0.222, 0.222]

as also included in Table8.1. This time, however, when applying β = 0.8 and allowing restarts from state1and state2alone, we obtain a stationary distribution of

[0.349, 0.274, 0.174, 0.203].

As a consequence of biasing the random walk towards states{1, 2}, we observe an increase in the stationary distribution for these states.

Naturally, as the amount of probability mass which moved to the favored states is now missing in total from the remaining two states.

Even though originallyp3 = p4held, we no longer see this in the

stationary distribution of the personalized PageRank. Can you explain why does the probability in the stationary distri-bution for state3drop more then that of state4when we perform the personalized PageRank algorithm with restarts over states{1, 2}?

8.4 Hubs and Authorities ?

TheHubs and Authorities11algorithm (also referred as the

Hyper-11Kleinberg1999

link Induced Topic SearchorHITS algorithm) resembles PageRank in certain aspects, but there are also important differences between the two. The HITS and PageRank algorithms are similar in that both of them determine some importance scores to the vertices of a com-plex network. The HITS algorithm, however, differs from PageRank in that it calculates two scores for each vertex and that it operates directly on the adjacency matrix of the network.

data m i n i n g f ro m n e t w o r k s 187

The fundamental difference between PageRank and HITS is that while the former assigns a single importance sore to each vertex, HITS characterizes every node from two perspectives. Putting the HITS algorithm into the context of web browsing, a website can be-come prestigious by

1. directly providing relevant contents or

2. providing valuable links to websites that offer relevant contents.

The two kind of prestige scores are tightly coupled with each other since a page is deemed as providing relevant contents if it is refer-enced by websites that are deemed as referencing relevant websites.

Conversely, a website is said to reference relevant pages if the direct hyperlink connections it has point to websites with relevant contents.

As such, the two kinds of relevance scores mutually depend on each other. We call the first kind of relevance as theauthorityof a node, whereas we call the second type of relevance as thehubnessof a node.

The authority score of some nodej∈Vis defined as aj =

(i,j)∈E

hi. (8.7)

Hubness scores are for nodej∈Vare defined analogously as hj =

(j,k)E

ak. (8.8)

Recall, however, that the calculation of the authority scores for node j ∈ Vinvolves a summation over its incoming edges, whereas its hubness score depends on the authority scores of its neighbors acces-sible by outgoing edges.

The way we can calculate these two scores is pretty reminiscent to the calculation of the PageRank scores, i.e., we follow an iterative algorithm for that. Likewise to the calculation of PageRank scores, we also have aglobally convergentalgorithm for calculating the hubness and authority scores of the individual nodes of the network.

One difference compared to the PageRank algorithm is that HITS relies on the adjacency matrix during the calculation of the hub and authority scores of the nodes instead of a row stochastic transition matrix that is used by PageRank. Theadjacency matrixis a simple square binary matrixA ∈ {0, 1}n×n which contains whether there exists a directed edge for all pairs of vertices in the network. An aij = 1 entry indicates that there is a directed edge between nodei andj. Analogously,aij =0 means that no directed edge exist between nodeiand j.

For the globally convergent solution of (8.7) and (8.8) we have h =ξAaanda =νA|hfor some appropriately chosen scalarsξ andν. What it further implies is that for the solutions we have

h=ξνAA|h

a=νξA|Aa. (8.9)

According to (8.9), we can obtain both the ideal hubness and au-thority vectors as an eigenvector for the matrices AA|and A|A, respectively. The problem with this kind of solution is that even though matrices – and their respective adjacency matrix – are typi-cally sparse, by forming theGram matrices AA|or A|A, we would likely obtain matrices that are no longer sparse. The sparsity of ma-trices, however, is something that we typically value and we want to sacrifice it seldomly. In order to overcome this issue, we rather employ an asynchronous strategy for obtaininghanda.

Algorithm4gives the pseudocode for HITS following the princi-ples of asynchronicity. Line5and7of Algorithm4reveals that after each matrix multiplication performed in order to get an updated so-lution forhanda, we employ a normalization step by ensuring that the largest element within these vectors equal to one. Without these normalization steps, the values inhandawould increase without bound and the algorithm would diverge. The normalization step divides each element of vectorshandaby the largest included in them.

Algorithm4: Pseudocode for the HITS algorithm.

Require: adjacency matrixA

Ensure: vectorsa,hstoring the authority and hubness of the vertices of the network

1: functionHITS(A)

2: h=1 // Initialize the hubness vector to all ones 3: whilenot convergeddo

4: a= A|h

5: a=a/max(a) // normalizeasuch that all its entries are≤1.

6: h= Aa

7: h=h/max(h)

8: endwhile

9: return a,h

10: endfunction

Table8.5contains the convergence of HITS that we get when ap-plying Algorithm4over the sample network structure presented in Figure8.10. We can see that vertices indexed as 3 and 5 are totally useless as hubs. This is not surprising as a vertex with a good hub-ness score needs to link to vertices which have high authority scores.

data m i n i n g f ro m n e t w o r k s 189

(b) Adjacency matrix of the network.

Figure8.10: Example network to perform HITS algorithm on.

Table8.5: Illustration of the conver-gence of the HITS algorithm for the example network included in Fig-ure8.10. Each row corresponds to a vertex from the network.

Even though vertex 3 has an outgoing edge, it links to such a vertex which is not authoritive. The case of vertex 5 is even simpler as it has no outgoing edges at all, hence it is completely incapable of linking to vertices with high authority scores (in particular it even fails at linking to vertices withanynon-zero authority score).

Vertex 5 turns out to be a poor authority as well. This is in accor-dance to the fact that it receives a single incoming edge from vertex 3 which has been assigned a low hubness score.

Although vertex 3 is one of worst hubs, it is also one of the vertices with the highest authority score at the same time. Indeed, these are vertices 2 and 3 that obtain the highest authority scores. It is not surprising at all that vertices 2 and 3 have the same authority scores, since they have the same incoming edges from vertices 1 and 4. As a consequence of vertices 2 and 3 ending up to be highly authoritive, those vertices that link them manage to obtain high hubness. Since those vertices that are directly accessible from vertex 1 are a superset of those of vertex 4, the hubness score of vertex 1 surpasses that of vertex 4. As the above example suggests, theaandhscores from the HITS algorithm got into an equilibrial state even after a few number of iterations.

8.5 Further reading

PageRank and its vast number of variants, such as12, admittedly 12Wu et al.2013,Mihalcea and Tarau 2004,Jeh and Widom2002

belong to the most popular approaches of network analysis. As men-tioned previously, we can conveniently model a series of real-world processes and phenomena with the help of complex networks,

in-cluding but not limited to social interactions13, natural language14 13Newman2001,Hasan et al.2006, Leskovec et al.2010,Backstrom and Leskovec2011,Matakos et al.2017, Tasnádi and Berend2015

14Erkan and Radev2004,Mihalcea and Tarau2004,Toutanova et al.2004

or biomedicine15. 15Wu et al.2013,Ji et al.2015

Liu[2006] provides a comprehensive overview of web data mining andManning et al.[2008] offers a thorough description of informa-tion retrieval techniques.

8.6 Summary of the chapter

Many real world phenomena can be conveniently modeled as a net-work. As these networks have a potentially enormous number of vertices, efficient ways of handling these datasets is of utmost impor-tance. Efficiency is required both in accordance of the storage of the networks and the algorithms employed over them.

In this chapter we reviewed typical ways, networks can be rep-resented and multiple algorithms that assign relevance score to the individual vertices of a network. Readers are expected to understand the working mechanisms behind these algorithms.

9 | C LUSTERING

Th i s c h a p t e rprovides an introduction to clustering algorithms.

Throughout the chapter, we overview two main paradigms of cluster-ing techniques, i.e., algorithms that perform clustercluster-ing in a hierarchi-cal and partitioning manner. Readers will learn the about the task of clustering itself, understand their working mechanisms and develop an awareness of their potential limitations and how to mitigate those.

In document DATAMINING GÁBORBEREND (Pldal 186-191)