** ? 6.1 The curse of dimensionality**

**Exercise 7.1. Suppose you have a collection of recipes including a list of ingredients required for them. In case you would like to find recipes that are**

**8.4 Hubs and Authorities ?**

Remember that the stationary distribution we obtained earlier for the Markov chain without applying a damping factor was

[0.333, 0.222, 0.222, 0.222]

as also included in Table8.1. This time, however, when applying
*β* = 0.8 and allowing restarts from state1and state2alone, we
obtain a stationary distribution of

[0.349, 0.274, 0.174, 0.203].

As a consequence of biasing the random walk towards states{^{1, 2}}^{,}
we observe an increase in the stationary distribution for these states.

Naturally, as the amount of probability mass which moved to the favored states is now missing in total from the remaining two states.

Even though originally**p**^{∗}_{3} = **p**^{∗}_{4}held, we no longer see this in the

stationary distribution of the personalized PageRank. Can you explain why does the
probability in the stationary
distri-bution for state3drop more then
that of state4when we perform the
personalized PageRank algorithm
with restarts over states{^{1, 2}}^{?}

*8.4* Hubs and Authorities ?

The**Hubs and Authorities**^{11}algorithm (also referred as the

Hyper-11Kleinberg1999

**link Induced Topic Search**or**HITS algorithm) resembles PageRank**
in certain aspects, but there are also important differences between
the two. The HITS and PageRank algorithms are similar in that both
of them determine some importance scores to the vertices of a
com-plex network. The HITS algorithm, however, differs from PageRank
in that it calculates two scores for each vertex and that it operates
directly on the adjacency matrix of the network.

data m i n i n g f ro m n e t w o r k s 187

The fundamental difference between PageRank and HITS is that while the former assigns a single importance sore to each vertex, HITS characterizes every node from two perspectives. Putting the HITS algorithm into the context of web browsing, a website can be-come prestigious by

1. directly providing relevant contents or

2. providing valuable links to websites that offer relevant contents.

The two kind of prestige scores are tightly coupled with each other since a page is deemed as providing relevant contents if it is refer-enced by websites that are deemed as referencing relevant websites.

Conversely, a website is said to reference relevant pages if the direct hyperlink connections it has point to websites with relevant contents.

As such, the two kinds of relevance scores mutually depend on each
other. We call the first kind of relevance as the**authority**of a node,
whereas we call the second type of relevance as the**hubness**of a
node.

The authority score of some nodej∈Vis defined as
**a**j =

### ∑

(i,j)∈E

**h**i. (8.7)

Hubness scores are for nodej∈^{V}are defined analogously as
**h**j =

### ∑

(j,k)∈E

**a**_{k}. (8.8)

Recall, however, that the calculation of the authority scores for node
j ∈ ^{V}involves a summation over its incoming edges, whereas its
hubness score depends on the authority scores of its neighbors
acces-sible by outgoing edges.

The way we can calculate these two scores is pretty reminiscent
to the calculation of the PageRank scores, i.e., we follow an iterative
algorithm for that. Likewise to the calculation of PageRank scores,
we also have a**globally convergent**algorithm for calculating the
hubness and authority scores of the individual nodes of the network.

One difference compared to the PageRank algorithm is that HITS
relies on the adjacency matrix during the calculation of the hub and
authority scores of the nodes instead of a row stochastic transition
matrix that is used by PageRank. The**adjacency matrix**is a simple
square binary matrixA ∈ {^{0, 1}}^{n}^{×}^{n} which contains whether there
exists a directed edge for all pairs of vertices in the network. An
aij = 1 entry indicates that there is a directed edge between nodei
andj. Analogously,aij =0 means that no directed edge exist between
nodeiand j.

For the globally convergent solution of (8.7) and (8.8) we have
**h**^{∗} =*ξ*Aa^{∗}and**a**^{∗} =*ν*A^{|}**h**^{∗}for some appropriately chosen scalars*ξ*
and*ν. What it further implies is that for the solutions we have*

**h**^{∗}=*ξνAA*^{|}**h**^{∗}

**a**^{∗}=*νξ*A^{|}Aa^{∗}. (8.9)

According to (8.9), we can obtain both the ideal hubness and
au-thority vectors as an eigenvector for the matrices AA^{|}and A^{|}A,
respectively. The problem with this kind of solution is that even
though matrices – and their respective adjacency matrix – are
typi-cally sparse, by forming the**Gram matrices** AA^{|}or A^{|}A, we would
likely obtain matrices that are no longer sparse. The sparsity of
ma-trices, however, is something that we typically value and we want
to sacrifice it seldomly. In order to overcome this issue, we rather
employ an asynchronous strategy for obtaining**h**^{∗}and**a**^{∗}.

Algorithm4gives the pseudocode for HITS following the
princi-ples of asynchronicity. Line5and7of Algorithm4reveals that after
each matrix multiplication performed in order to get an updated
so-lution for**h**and**a, we employ a normalization step by ensuring that**
the largest element within these vectors equal to one. Without these
normalization steps, the values in**h**and**a**would increase without
bound and the algorithm would diverge. The normalization step
divides each element of vectors**h**and**a**by the largest included in
them.

**Algorithm**^{4}: Pseudocode for the HITS
algorithm.

**Require:** adjacency matrixA

**Ensure:** vectors**a,h**storing the authority and hubness of the vertices
of the network

1: **function**HITS(A)

2: **h**=**1** // Initialize the hubness vector to all ones
3: **while**not converged**do**

4: **a**= A^{|}**h**

5: **a**=**a/max**(**a**) // normalize**a**such that all its entries are≤^{1}^{.}

6: **h**= Aa

7: **h**=**h/max**(**h**)

8: **endwhile**

9: **return a,h**

10: **endfunction**

Table8.5contains the convergence of HITS that we get when ap-plying Algorithm4over the sample network structure presented in Figure8.10. We can see that vertices indexed as 3 and 5 are totally useless as hubs. This is not surprising as a vertex with a good hub-ness score needs to link to vertices which have high authority scores.

data m i n i n g f ro m n e t w o r k s 189

(b) Adjacency matrix of the network.

Figure8.10: Example network to perform HITS algorithm on.

Table8.5: Illustration of the conver-gence of the HITS algorithm for the example network included in Fig-ure8.10. Each row corresponds to a vertex from the network.

Even though vertex 3 has an outgoing edge, it links to such a vertex which is not authoritive. The case of vertex 5 is even simpler as it has no outgoing edges at all, hence it is completely incapable of linking to vertices with high authority scores (in particular it even fails at linking to vertices withanynon-zero authority score).

Vertex 5 turns out to be a poor authority as well. This is in accor-dance to the fact that it receives a single incoming edge from vertex 3 which has been assigned a low hubness score.

Although vertex 3 is one of worst hubs, it is also one of the vertices
with the highest authority score at the same time. Indeed, these are
vertices 2 and 3 that obtain the highest authority scores. It is not
surprising at all that vertices 2 and 3 have the same authority scores,
since they have the same incoming edges from vertices 1 and 4. As a
consequence of vertices 2 and 3 ending up to be highly authoritive,
those vertices that link them manage to obtain high hubness. Since
those vertices that are directly accessible from vertex 1 are a superset
of those of vertex 4, the hubness score of vertex 1 surpasses that of
vertex 4. As the above example suggests, the**a**and**h**scores from the
HITS algorithm got into an equilibrial state even after a few number
of iterations.

*8.5* Further reading

PageRank and its vast number of variants, such as^{12}, admittedly ^{12}Wu et al.2013,Mihalcea and Tarau
2004,Jeh and Widom2002

belong to the most popular approaches of network analysis. As men-tioned previously, we can conveniently model a series of real-world processes and phenomena with the help of complex networks,

in-cluding but not limited to social interactions^{13}, natural language^{14} ^{13}Newman2001,Hasan et al.2006,
Leskovec et al.2010,Backstrom and
Leskovec2011,Matakos et al.2017,
Tasnádi and Berend2015

14Erkan and Radev2004,Mihalcea and Tarau2004,Toutanova et al.2004

or biomedicine^{15}. ^{15}Wu et al.2013,Ji et al.2015

Liu[2006] provides a comprehensive overview of web data mining andManning et al.[2008] offers a thorough description of informa-tion retrieval techniques.

*8.6* Summary of the chapter

Many real world phenomena can be conveniently modeled as a net-work. As these networks have a potentially enormous number of vertices, efficient ways of handling these datasets is of utmost impor-tance. Efficiency is required both in accordance of the storage of the networks and the algorithms employed over them.

In this chapter we reviewed typical ways, networks can be rep-resented and multiple algorithms that assign relevance score to the individual vertices of a network. Readers are expected to understand the working mechanisms behind these algorithms.

**9 | C** ^{LUSTERING}

^{LUSTERING}

Th i s c h a p t e rprovides an introduction to clustering algorithms.

Throughout the chapter, we overview two main paradigms of cluster-ing techniques, i.e., algorithms that perform clustercluster-ing in a hierarchi-cal and partitioning manner. Readers will learn the about the task of clustering itself, understand their working mechanisms and develop an awareness of their potential limitations and how to mitigate those.