Data Mining:

(1)

Data Mining:

Concepts and Techniques

— Chapter 9 —

9.2. Social Network Analysis

Jiawei Han and Micheline Kamber Department of Computer Science

University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj

Acknowledgements: Based on the slides by Sangkyum Kim and Chen Chen

(2)

(3)

Social Network Analysis



Social Network Introduction



Statistics and Probability Theory



Models of Social Network Generation



Networks in Biological System



Mining on Social Network



Summary

(4)

Society

Nodes: individuals

Links: social relationship (family/work/friendship/etc.)

S. Milgram (1967)

Social networks: Many individuals with

diverse social interactions between them.

John Guare Six Degrees of Separation

(5)

Communication networks

The Earth is developing an electronic nervous system, a network with diverse nodes and links are

-computers -routers -satellites

-phone lines -TV cables -EM waves

Communication networks: Many non-identical

components with diverse

connections between them.

(6)

Complex systems

Made of many non-identical elements connected by diverse interactions.

(7)

“Natural” Networks and Universality

 Consider many kinds of networks:

 social, technological, business, economic, content,…

 These networks tend to share certain informal properties:

 large scale; continual growth

 distributed, organic growth: vertices “decide” who to link to

 interaction restricted to links

 mixture of local and long-distance connections

 abstract notions of distance: geographical, content, social,…

 Do natural networks share more quantitative universals?

 What would these “universals” be?

 How can we make them precise and measure them?

 How can we explain their universality?

 This is the domain of social network theory

 Sometimes also referred to as link analysis

(8)

Some Interesting Quantities

 Connected components:

 how many, and how large?

 Network diameter:

 maximum (worst-case) or average?

 exclude infinite distances? (disconnected components)

 the small-world phenomenon

 Clustering:

 to what extent that links tend to cluster “locally”?

 what is the balance between local and long-distance connections?

 what roles do the two types of links play?

 Degree distribution:

 what is the typical degree in the network?

 what is the overall distribution?

(9)

A “Canonical” Natural Network has…

 Few connected components:

 often only 1 or a small number, indep. of network size

 Small diameter:

 often a constant independent of network size (like 6)

 or perhaps growing only logarithmically with network size or even shrink?

 typically exclude infinite distances

 A high degree of clustering:

 considerably more so than for a random network

 in tension with small diameter

 A heavy-tailed degree distribution:

 a small but reliable number of high-degree vertices

 often of power law form

(10)

Probabilistic Models of Networks

 All of the network generation models we will study are probabilistic or statistical in nature

 They can generate networks of any size

 They often have various parameters that can be set:

 size of network generated

 average degree of a vertex

 fraction of long-distance connections

 The models generate a distribution over networks

 Statements are always statistical in nature:

 with high probability, diameter is small

 on average, degree distribution has heavy tail

 Thus, we’re going to need some basic statistics and

(11)

Social Network Analysis



Social Network Introduction



Statistics and Probability Theory



Models of Social Network Generation



Networks in Biological System



Mining on Social Network



Summary

(12)

Probability and Random Variables

 A random variable X is simply a variable that probabilistically assumes values in some set

 set of possible values sometimes called the sample space S of X

 sample space may be small and simple or large and complex

 S = {Heads, Tails}, X is outcome of a coin flip

 S = {0,1,…,U.S. population size}, X is number voting democratic

 S = all networks of size N, X is generated by preferential attachment

 Behavior of X determined by its distribution (or density)

 for each value x in S, specify Pr[X = x]

 these probabilities sum to exactly 1 (mutually exclusive outcomes)

 complex sample spaces (such as large networks):

 distribution often defined implicitly by simpler components

 might specify the probability that each edge appears independently

(13)

Some Basic Notions and Laws

 Independence:

 let X and Y be random variables

 independence: for any x and y, Pr[X = x & Y = y] = Pr[X=x]Pr[Y=y]

 intuition: value of X does not influence value of Y, vice-versa

 dependence:

 e.g. X, Y coin flips, but Y is always opposite of X

 Expected (mean) value of X:

 only makes sense for numeric random variables

 “average” value of X according to its distribution

 formally, E[X] =  (Pr[X = x] X), sum is over all x in S

 often denoted by 

 always true: E[X + Y] = E[X] + E[Y]

 true only for independent random variables: E[XY] = E[X]E[Y]

 Variance of X:

 Var(X) = E[(X – )^2]; often denoted by ^2

 standard deviation is sqrt(Var(X)) = 

 Union bound:

for any X, Y, Pr[X=x & Y=y] <= Pr[X=x] + Pr[Y=y]

(14)

Convergence to Expectations

 Let X₁, X₂,…, X_n be:

 independent random variables

 with the same distribution Pr[X=x]

 expectation  = E[X] and variance ²

 independent and identically distributed (i.i.d.)

 essentially n repeated “trials” of the same experiment

 natural to examine r.v. Z = (1/n)  Xi, where sum is over i=1,…,n

 example: number of heads in a sequence of coin flips

 example: degree of a vertex in the random graph model

 E[Z] = E[X]; what can we say about the distribution of Z?

 Central Limit Theorem:

as n becomes large, Z becomes normally distributed

(15)

The Normal Distribution

 The normal or Gaussian density:

 applies to continuous, real-valued random variables

 characterized by mean (average) m and standard deviation s

 density at x is defined as

 (1/( sqrt(2))) exp(-(x-)²/2²)

 special case  = 0,  = 1: a exp(-x²/b) for some constants a,b > 0

 peaks at x = , then dies off exponentially rapidly

 the classic “bell-shaped curve”

 exam scores, human body temperature,

 remarks:

 can control mean and standard deviation independently

 can make as “broad” as we like, but always have finite variance

(16)

The Normal Distribution

(17)

The Binomial Distribution



coin with Pr[heads] = p, flip n times



probability of getting exactly k heads:



choose(n,k) p

^k

(1-p)

^n-k



for large n and p fixed:



approximated well by a normal with

 = np,  = sqrt(np(1-p))



/  0 as n grows



leads to strong large deviation bounds

(18)

The Binomial Distribution

www.professionalgambler.com/ binomial.html

(19)

The Poisson Distribution

 like binomial, applies to variables taken on integer values > 0

 often used to model counts of events

 number of phone calls placed in a given time period

 number of times a neuron fires in a given time period

 single free parameter 

 probability of exactly x events:

 exp(-) ^x/x!

 mean and variance are both 

 binomial distribution with n large, p = /n ( fixed)

 converges to Poisson with mean 

(20)

The Poisson Distribution

single photoelectron distribution

(21)

Heavy-tailed Distributions

 Pareto or power law distributions:

 for variables assuming integer values > 0

 probability of value x ~ 1/x^a

 typically 0 < a < 2; smaller a gives heavier tail

 sometimes also referred to as being scale-free

 For binomial, normal, and Poisson distributions the tail probabilities approach 0 exponentially fast

 Inverse polynomial decay vs. inverse exponential decay

 What kind of phenomena does this distribution model?

 What kind of process would generate it?

(22)

Heavy-Tailed Distributions

(23)

Distributions vs. Data

 All these distributions are idealized models

 In practice, we do not see distributions, but data

 Thus, there will be some largest value we observe

 Also, can be difficult to “eyeball” data and choose model

 So how do we distinguish between Poisson, power law, etc?

 Typical procedure:

 might restrict our attention to a range of values of interest

 accumulate counts of observed data into equal-sized bins

 look at counts on a log-log plot

 note that

 power law:

 log(Pr[X = x]) = log(1/x^) = - log(x)

 linear, slope –

 Normal:

 log(Pr[X = x]) = log(a exp(-x^/b)) = log(a) – x^/b

 non-linear, concave near mean

 Poisson:

 log(Pr[X = x]) = log(exp(-) ^x/x!)

(24)

Zipf’s Law

 Look at the frequency of English words:

 “the” is the most common, followed by “of”, “to”, etc.

 claim: frequency of the n-th most common ~ 1/n (power law, α = 1)

 General theme:

 rank events by their frequency of occurrence

 resulting distribution often is a power law!

 Other examples:

 North America city sizes

 personal income

 file sizes

 genus sizes (number of species)

 People seem to dither over exact form of these distributions

(25)

Linear scales on both axes Logarithmic scales on both axes

The same data plotted on linear and logarithmic scales.

Both plots show a Zipf distribution with 300 datapoints

Zipf’s Law

(26)

Social Network Analysis



Social Network Introduction



Statistics and Probability Theory



Models of Social Network Generation



Networks in Biological System



Mining on Social Network



Summary

(27)

Some Models of Network Generation

 Random graphs (Erdös-Rényi models):

 gives few components and small diameter

 does not give high clustering and heavy-tailed degree distributions

 is the mathematically most well-studied and understood model

 Watts-Strogatz models:

 give few components, small diameter and high clustering

 does not give heavy-tailed degree distributions

 Scale-free Networks:

 gives few components, small diameter and heavy-tailed distribution

 does not give high clustering

 Hierarchical networks:

 few components, small diameter, high clustering, heavy-tailed

 Affiliation networks:

 models group-actor formation

(28)

Models of Social Network Generation



Random Graphs (Erdös-Rényi models)



Watts-Strogatz models



Scale-free Networks

(29)

The Erdös-Rényi (ER) Model (Random Graphs)

 All edges are equally probable and appear independently

 NW size N > 1 and probability p: distribution G(N,p)

 each edge (u,v) chosen to appear with probability p

 N(N-1)/2 trials of a biased coin flip

 The usual regime of interest is when p ~ 1/N, N is large

 e.g. p = 1/2N, p = 1/N, p = 2/N, p=10/N, p = log(N)/N, etc.

 in expectation, each vertex will have a “small” number of neighbors

 will then examine what happens when N  infinity

 can thus study properties of large networks with bounded degree

 Degree distribution of a typical G drawn from G(N,p):

 draw G according to G(N,p); look at a random vertex u in G

 what is Pr[deg(u) = k] for any fixed k?

 Poisson distribution with mean l = p(N-1) ~ pN

 Sharply concentrated; not heavy-tailed

 Especially easy to generate NWs from G(N,p)

(30)

Erdös-Rényi Model (1960)

- Democratic - Random

Pál Erdös Pál Erdös

(1913-1996) Connect with

probability p p=1/6

N=10

k~1.5 Poisson distribution

(31)

A Closely Related Model



For any fixed m <= N(N-1)/2, define distribution G(N,m):



choose uniformly at random from all graphs with exactly m edges



G(N,m) is “like” G(N,p) with p = m/(N(N-1)/2)

~ 2m/N^2



this intuition can be made precise, and is correct



if m = cN then p = 2c/(N-1) ~ 2c/N



mathematically trickier than G(N,p)

(32)

Another Closely Related Model

 Graph process model:

 start with N vertices and no edges

 at each time step, add a new edge

 choose new edge randomly from among all missing edges

 Allows study of the evolution or emergence of properties:

 as the number of edges m grows in relation to N

 equivalently, as p is increased

 For all of these models:

 high probability  “almost all” large graphs of a given

(33)

Evolution of a Random Network

 We have a large number n of vertices

 We start randomly adding edges one at a time

 At what time t will the network:

 have at least one “large” connected component?

 have a single connected component?

 have “small” diameter?

 have a “large” clique?

 have a “large” chromatic number?

 How gradually or suddenly do these properties appear?

(34)

Recap

 Model G(N,p):

 select each of the possible edges independently with prob. p

 expected total number of edges is pN(N-1)/2

 expected degree of a vertex is p(N-1)

 degree will obey a Poisson distribution (not heavy-tailed)

 Model G(N,m):

 select exactly m of the N(N-1)/2 edges to appear

 all sets of m edges equally likely

 Graph process model:

 starting with no edges, just keep adding one edge at a time

 always choose next edge randomly from among all missing edges

 Threshold or tipping for (say) connectivity:

 fewer than m(N) edges  graph almost certainly not connected more than m(N) edges  graph almost certainly is connected

(35)

Combining and Formalizing Familiar Ideas

 Explaining universal behavior through statistical models

 our models will always generate many networks

 almost all of them will share certain properties (universals)

 Explaining tipping through incremental growth

 we gradually add edges, or gradually increase edge probability p

 many properties will emerge very suddenly during this process

size of police force

crime rate

number of edges

prob. NW connected

(36)

Monotone Network Properties

 Often interested in monotone graph properties:

 let G have the property

 add edges to G to obtain G’

 then G’ must have the property also

 Examples:

 G is connected

 G has diameter <= d (not exactly d)

 G has a clique of size >= k (not exactly k)

 G has chromatic number >= c (not exactly c)

 G has a matching of size >= m

 d, k, c, m may depend on NW size N (How?)

 Difficult to study emergence of non-monotone properties as the number of edges is increased

 what would it mean?

(37)

Formalizing Tipping:

Thresholds for Monotone Properties

 Consider Erdos-Renyi G(N,m) model

 select m edges at random to include in G

 Let P be some monotone property of graphs

 P(G) = 1  G has the property

 P(G) = 0  G does not have the property

 Let m(N) be some function of NW size N

 formalize idea that property P appears “suddenly” at m(N) edges

 Say that m(N) is a threshold function for P if:

 let m’(N) be any function of N

 look at ratio r(N) = m’(N)/m(N) as N  infinity

 if r(N)  0: probability that P(G) = 1 in G(N,m’(N)):  0

 if r(N)  infinity: probability that P(G) = 1 in G(N,m’(N)):  1

 A purely structural definition of tipping

 tipping results from incremental increase in connectivity

(38)

So Which Properties Tip?

 Just about all of them!

 The following properties all have threshold functions:

 having a “giant component”

 being connected

 having a perfect matching (N even)

 having “small” diameter

 With remarkable consistency (N = 50):

 giant component ~ 40 edges, connected ~ 100, small diameter ~ 180

(39)

Ever More Precise…

 Connected component of size > N/2:

 threshold function is m(N) = N/2 (or p ~ 1/N)

 note: full connectivity impossible

 Fully connected:

 threshold function is m(N) = (N/2)log(N) (or p ~ log(N)/N)

 NW remains extremely sparse: only ~ log(N) edges per vertex

 Small diameter:

 threshold is m(N) ~ N^(3/2) for diameter 2 (or p ~ 2/sqrt(N))

 fraction of possible edges still ~ 2/sqrt(N)  0

 generate very small worlds

(40)

Other Tipping Points?

 Perfect matchings

 consider only even N

 threshold function: m(N) = (N/2)log(N) (or p ~ log(N)/N)

 same as for connectivity!

 Cliques

 k-clique threshold is m(N) = (1/2)N^(2 – 2/(k-1)) (p ~ 1/N^(2/k-1))

 edges appear immediately; triangles at N/2; etc.

 Coloring

 k colors required just as k-cliques appear

(41)

Erdos-Renyi Summary

 A model in which all connections are equally likely

 each of the N(N-1)/2 edges chosen randomly &

independently

 As we add edges, a precise sequence of events unfolds:

 graph acquires a giant component

 graph becomes connected

 graph acquires small diameter

 Many properties appear very suddenly (tipping, thresholds)

 All statements are mathematically precise

 But is this how natural networks form?

 If not, which aspects are unrealistic?

 may all edges are not equally likely!

(42)

The Clustering Coefficient of a Network

 Let nbr(u) denote the set of neighbors of u in a graph

 all vertices v such that the edge (u,v) is in the graph

 The clustering coefficient of u:

 let k = |nbr(u)| (i.e., number of neighbors of u)

 choose(k,2): max possible # of edges between vertices in nbr(u)

 c(u) = (actual # of edges between vertices in nbr(u))/choose(k,2)

 0 <= c(u) <= 1; measure of cliquishness of u’s neighborhood

 Clustering coefficient of a graph:

 average of c(u) over all vertices u

k = 4

choose(k,2) = 6

(43)

Clustering: My friends will likely know each other!

Probability to be connected C

»

^p

C = # of links between 1,2,…n neighbors n(n-1)/2

Networks are clustered [large C(p)]

but have a small

characteristic path length [small L(p)].

Network C C_rand L N

WWW 0.1078 0.00023 3.1 153127

Internet 0.18-0.3 0.001 3.7-3.76 3015- 6209 Actor 0.79 0.00027 3.65 225226 Coauthorship 0.43 0.00018 5.9 52909

Metabolic 0.32 0.026 2.9 282

Foodweb 0.22 0.06 2.43 134

The Clustering Coefficient of a Network

(44)

Erdos-Renyi: Clustering Coefficient

 Generate a network G according to G(N,p)

 Examine a “typical” vertex u in G

 choose u at random among all vertices in G

 what do we expect c(u) to be?

 Answer: exactly p!

 In G(N,m), expect c(u) to be 2m/N(N-1)

 Both cases: c(u) entirely determined by overall density

 Baseline for comparison with “more clustered” models

 Erdos-Renyi has no bias towards clustered or local edges

(45)

Models of Social Network Generation



Random Graphs (Erdös-Rényi models)



Watts-Strogatz models



Scale-free Networks

(46)

Caveman and Solaria

 Erdos-Renyi:

 sharing a common neighbor makes two vertices no more likely to be directly connected than two very “distant” vertices

 every edge appears entirely independently of existing structure

 But in many settings, the opposite is true:

 you tend to meet new friends through your old friends

 two web pages pointing to a third might share a topic

 two companies selling goods to a third are in related industries

 Watts’ Caveman world:

 overall density of edges is low

 but two vertices with a common neighbor are likely connected

 Watts’ Solaria world

 overall density of edges low; no special bias towards local edges

(47)

Making it (Somewhat) Precise: the -model

 The -model has the following parameters or “knobs”:

 N: size of the network to be generated

 k: the average degree of a vertex in the network to be generated

 p: the default probability two vertices are connected

 : adjustable parameter dictating bias towards local connections

 For any vertices u and v:

 define m(u,v) to be the number of common neighbors (so far)

 Key quantity: the propensity R(u,v) of u to connect to v

 if m(u,v) >= k, R(u,v) = 1 (share too many friends not to connect)

 if m(u,v) = 0, R(u,v) = p (no mutual friends  no bias to connect)

 else, R(u,v) = p + (m(u,v)/k)^ (1-p)

 Generate NW incrementally

 using R(u,v) as the edge probability; details omitted

 Note:  = infinity is “like” Erdos-Renyi (but not exactly)

(48)

Watts-Strogatz Model

(49)

Small Worlds and Occam’s Razor

 For small , should generate large clustering coefficients

 we “programmed” the model to do so

 Watts claims that proving precise statements is hard…

 But we do not want a new model for every little property

 Erdos-Renyi  small diameter

 -model  high clustering coefficient

 In the interests of Occam’s Razor, we would like to find

 a single, simple model of network generation…

 … that simultaneously captures many properties

 Watt’s small world: small diameter and high clustering

(50)

Meanwhile, Back in the Real World…

 Watts examines three real networks as case studies:

 the Kevin Bacon graph

 the Western states power grid

 the C. elegans nervous system

 For each of these networks, he:

 computes its size, diameter, and clustering coefficient

 compares diameter and clustering to best Erdos-Renyi approx.

 shows that the best -model approximation is better

 important to be “fair” to each model by finding best fit

 Overall moral:

if we care only about diameter and clustering,  is better

(51)

Case 1: Kevin Bacon Graph

 Vertices: actors and actresses

 Edge between u and v if they appeared in a film together

Is Kevin Bacon the most

connected actor?

NO!

Rank Name Average

distance

# of movies

# of links

1 Rod Steiger 2.537527 112 2562

2 Donald Pleasence 2.542376 180 2874

3 Martin Sheen 2.551210 136 3501

4 Christopher Lee 2.552497 201 2993

5 Robert Mitchum 2.557181 136 2905

6 Charlton Heston 2.566284 104 2552

7 Eddie Albert 2.567036 112 3333

8 Robert Vaughn 2.570193 126 2761

9 Donald Sutherland 2.577880 107 2865

10 John Gielgud 2.578980 122 2942

11 Anthony Quinn 2.579750 146 2978

12 James Earl Jones 2.584440 112 3787

…

876 Kevin Bacon 2.786981 46 1811

876… Kevin Bacon 2.786981 46 1811

Kevin Bacon

No. of movies : 46 No. of actors : 1811 Average separation: 2.79

(52)

Rod Steiger

Donald Pleasence

#1

#2

#876

Kevin Bacon

(53)

Case 2: New York State Power Grid

 Vertices: generators and substations

 Edges: high-voltage power transmission lines and transformers

 Line thickness and color indicate the voltage level

 Red 765 kV, 500 kV; brown 345 kV; green 230 kV; grey 138 kV

(54)

Case 3: C. Elegans Nervous System

 Vertices: neurons in the C. elegans worm

 Edges: axons/synapses between neurons

(55)

Two More Examples

 M. Newman on scientific collaboration networks

 coauthorship networks in several distinct communities

 differences in degrees (papers per author)

 empirical verification of

 giant components

 small diameter (mean distance)

 high clustering coefficient

 Alberich et al. on the Marvel Universe

 purely fictional social network

 two characters linked if they appeared together in an issue

 “empirical” verification of

 heavy-tailed distribution of degrees (issues and characters)

 giant component

 rather small clustering coefficient

(56)

One More (Structural) Property…

 A properly tuned -model can simultaneously explain

 small diameter

 high clustering coefficient

 But what about heavy-tailed degree distributions?

 -model and simple variants will not explain this

 intuitively, no “bias” towards large degree evolves

 all vertices are created equal

 Can concoct many bad generative models to explain

 generate NW according to Erdos-Renyi, reject if tails not heavy

 describe fixed NWs with heavy tails

 all connected to v1; N/2 connected to v2; etc.

 not clear we can get a precise power law

 not modeling variation

(57)

Models of Social Network Generation



Random Graphs (Erdös-Rényi models)



Watts-Strogatz models



Scale-free Networks

(58)

World Wide Web

800 million documents (S. Lawrence, 1999)

ROBOT:

collects all URL’s found in a document and follows them recursively

Nodes: WWW documents Links: URL links

(59)

k ~ 6

P(^k=500) ~ 10^-99 N_WWW~ 10⁹

 N(k=500)~10^-90

Expected Result Real Result

P_out(k) ~ k^-out

P(k=500) ~ 10^-6

_out= 2.45 _in= 2.1

P_in(k) ~ k^{- in} N_WWW~ 10⁹

 N(k=500) ~ 10³

World Wide Web

(60)

<l >

 Finite size scaling: create a network with N nodes with P_in(k) and P_out(k)

< l > = 0.35 + 2.06 log(N) l₁₅=2 [125]

l₁₇=4 [1346  7]

… < l > = ??

1

2 3

4

5

6

7

nd.edu

19 degrees of separation R. Albert et al Nature (99)

based on 800 million webpages [S. Lawrence et al

Nature (99)]

A. Broder et al WWW9 (00)

IBM

World Wide Web

(61)

What does that mean?

Poisson distribution

Exponential

Power-law distribution

Scale-free

(62)

Scale-free Networks

 The number of nodes (N) is not fixed

 Networks continuously expand by additional new nodes

 WWW: addition of new nodes

 Citation: publication of new papers

 The attachment is not uniform

 A node is linked with higher probability to a node that already has a large number of links

 WWW: new documents link to well known sites (CNN, Yahoo, Google)

 Citation: Well cited papers are more likely to be

(63)

Scale-Free Networks

 Start with (say) two vertices connected by an edge

 For i = 3 to N:

 for each 1 <= j < i, d(j) = degree of vertex j so far

 let Z = S d(j) (sum of all degrees so far)

 add new vertex i with k edges back to {1, …, i-1}:

 i is connected back to j with probability d(j)/Z

 Vertices j with high degree are likely to get more links!

 “Rich get richer”

 Natural model for many processes:

 hyperlinks on the web

 new business and social contacts

 transportation networks

 Generates a power law distribution of degrees

 exponent depends on value of k

(64)

 Preferential attachment explains

 heavy-tailed degree distributions

 small diameter (~log(N), via “hubs”)

 Will not generate high clustering coefficient

 no bias towards local connectivity, but towards hubs

Scale-Free Networks

(65)

Case1: Internet Backbone

(Faloutsos, Faloutsos and Faloutsos, 1999)

Nodes: computers, routers Links: physical lines

(66)

(67)

Case2: Actor Connectivity

Nodes: actors Links: cast jointly

N = 212,250 actors k = 28.78

P(k) ~k

^-

Days of Thunder (1990) Far and Away (1992)

Eyes Wide Shut (1999)

=2.3

(68)

Case 3: Science Citation Index

( = 3)

Nodes: papers Links: citations

P(k) ~k

^-

2212 25

1736 PRL papers (1988)

Witten-Sander PRL 1981

(69)

Nodes: scientist (authors) Links: write paper together

(Newman, 2000, H. Jeong et al 2001)

Case 4: Science Coauthorship

(70)

Case 5: Food Web

Nodes: trophic species Links: trophic interactions

(71)

Case 6: Sex-Web

Nodes: people (Females; Males) Links: sexual relationships

Liljeros et al. Nature 2001 4781 Swedes; 18-74;

59% response rate.

(72)

Robustness of

Random vs. Scale-Free Networks

 The accidental failure of a number of nodes in a random network can fracture the

system into non-

communicating islands.

 Scale-free networks are more robust in the face of such failures.

 Scale-free networks are highly vulnerable to a coordinated attack against their hubs.

(73)

Social Network Analysis



Social Network Introduction



Statistics and Probability Theory



Models of Social Network Generation



Networks in Biological System



Mining on Social Network



Summary

(74)

protein-gene interactions

protein-protein interactions PROTEOME

GENOME

METABOLISM Bio-chemical

Bio-Map

(75)

Citrate Cycle METABOLISM Bio-chemical reactions

Metabolic Network

(76)

(77)

Nodes: chemicals (substrates)

Links: bio-chemical reactions

Metabolic Network

(78)

Organisms from all three domains of life are scale-free networks!

Archaea Bacteria Eukaryotes

Metabolic Network

(79)

protein-gene interactions

GENOME

METABOLISM Bio-chemical

reactions

Bio-Map

(80)

Protein Network

(81)

Nodes: proteins

Links: physical interactions (binding)

Yeast Protein Network

(82)

) exp(

) (

~ )

( ₀ ⁰





k k k k

k k

P 



 ^

H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai, Nature 411, 41-42 (2001)

Topology of the Protein Network

(83)

Nature 408 307 (2000)

…

“One way to understand the p53 network is to compare it to the Internet.

The cell, like the Internet, appears to be a ‘scale-free network’.”

p53 Network

(84)

p53 Network (mammals)

(85)

Social Network Analysis



Social Network Introduction



Statistics and Probability Theory



Models of Social Network Generation



Networks in Biological System



Mining on Social Network



Summary

(86)

Information on the Social Network

 Heterogeneous, multi-relational data represented as a graph or network

 Nodes are objects

 May have different kinds of objects

 Objects have attributes

 Objects may have labels or classes

 Edges are links

 May have different kinds of links

 Links may have attributes

 Links may be directed, are not required to be binary

 Links represent relationships and interactions between

(87)

What is New for Link Mining Here

 Traditional machine learning and data mining approaches assume:

 A random sample of homogeneous objects from single relation

 Real world data sets:

 Multi-relational, heterogeneous and semi-structured

 Link Mining

 Newly emerging research area at the intersection of research in social network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming

(88)

A Taxonomy of Common Link Mining Tasks

 Object-Related Tasks

 Link-based object ranking

 Link-based object classification

 Object clustering (group detection)

 Object identification (entity resolution)

 Link-Related Tasks

 Link prediction

 Graph-Related Tasks

 Subgraph discovery

 Graph classification

Generative model for graphs

(89)

What Is a Link in Link Mining?

 Link: relationship among data

 Two kinds of linked networks

 homogeneous vs. heterogeneous

 Homogeneous networks

 Single object type and single link type

 Single model social networks (e.g., friends)

 WWW: a collection of linked Web pages

 Heterogeneous networks

 Multiple object and link types

 Medical network: patients, doctors, disease, contacts, treatments

 Bibliographic network: publications, authors, venues

(90)

Link-Based Object Ranking (LBR)

 LBR: Exploit the link structure of a graph to order or prioritize the set of objects within the graph

 Focused on graphs with single object type and single link type

 This is a primary focus of link analysis community

 Web information analysis

 PageRank and Hits are typical LBR approaches

 In social network analysis (SNA), LBR is a core analysis task

 Objective: rank individuals in terms of “centrality”

 Degree centrality vs. eigen vector/power centrality

 Rank objects relative to one or more relevant objects in

(91)

PageRank: Capturing Page Popularity

(Brin & Page’98)

 Intuitions

 Links are like citations in literature

 A page that is cited often can be expected to be more useful in general

 PageRank is essentially “citation counting”, but improves over simple counting

 Consider “indirect citations” (being cited by a highly cited paper counts a lot…)

 Smoothing of citations (every page is assumed to have a non-zero citation count)

 PageRank can also be interpreted as random surfing (thus capturing popularity)

(92)

The PageRank Algorithm

(Brin & Page’98)

1

( )

0 0 1/ 2 1/ 2

1 0 0 0

0 1 0 0

1/ 2 1/ 2 0 0

( ) (1 ) ( ) 1 ( )

( ) [1 (1 ) ] ( )

( (1 ) )

j i

t i ji t j t k

d IN d k

i ki k

k T

M

p d m p d p d

N

p d m p d

N

p I M p

 





 

 

 

 

 

  

  

 



 

d₁

d₂

d₄

“Transition matrix”

d₃ ^{/N (why?)}^{Same as}

Stationary (“stable”) distribution, so we

ignore time

Random surfing model:

At any page,

With prob. , randomly jumping to a page

With prob. (1 – ), randomly picking a link to follow

I_ij = 1/N

(93)

HITS: Capturing Authorities & Hubs

(Kleinberg’98)



Intuitions



Pages that are widely cited are good authorities



Pages that cite many other pages are good hubs



The key idea of HITS



Good authorities are cited by good hubs



Good hubs point to good authorities



Iterative reinforcement …

(94)

The HITS Algorithm

(Kleinberg 98)

d₁

d₂

d₄ ^{( )}

( )

0 0 1 1 1 0 0 0

0 1 0 0 1 1 0 0

( ) ( )

;

j i

i j

d OUT d

i j

d IN d

T

T T

A

h d a d

a d h d

h Aa a A h h AA h a A Aa



 

 

  

 

 



 



   

“Adjacency matrix”

d₃ Initial values: a=h=1

Iterat

Normalize: e

2 2

( )_i ( )_i 1

i i

a d  h d 

 

(95)

Block-level Link Analysis

(Cai et al. 04)



Most of the existing link analysis algorithms, e.g.

PageRank and HITS, treat a web page as a single node in the web graph



However, in most cases, a web page contains multiple semantics and hence it might not be

considered as an atomic and homogeneous node



Web page is partitioned into blocks using the vision-based page segmentation algorithm



extract page-to-block, block-to-page relationships



Block-level PageRank and Block-level HITS

(96)

Link-Based Object Classification (LBC)

 Predicting the category of an object based on its

attributes, its links and the attributes of linked objects

 Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc.

 Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations

 Epidemics: Predict disease type based on characteristics of the patients infected by the disease

 Communication: Predict whether a communication contact is by email, phone call or mail

(97)

Challenges in Link-Based Classification

 Labels of related objects tend to be correlated

 Collective classification: Explore such correlations and jointly infer the categorical values associated with the objects in the graph

 Ex: Classify related news items in Reuter data sets (Chak’98)

 Simply incorp. words from neighboring documents: not helpful

 Multi-relational classification is another solution for link- based classification

(98)

Group Detection



Cluster the nodes in the graph into groups that share common characteristics



Web: identifying communities



Citation: identifying research communities



Methods



Hierarchical clustering



Blockmodeling of SNA



Spectral graph partitioning



Stochastic blockmodeling

(99)

Entity Resolution

 Predicting when two objects are the same, based on their attributes and their links

 Also known as: deduplication, reference reconciliation, co- reference resolution, object consolidation

 Applications

 Web: predict when two sites are mirrors of each other

 Citation: predicting when two citations are referring to the same paper

 Epidemics: predicting when two disease strains are the same

 Biology: learning when two names refer to the same protein

(100)

Entity Resolution Methods

 Earlier viewed as pair-wise resolution problem: resolved based on the similarity of their attributes

 Importance at considering links

 Coauthor links in bib data, hierarchical links between spatial references, co-occurrence links between name references in documents

 Use of links in resolution

 Collective entity resolution: one resolution decision affects another if they are linked

 Propagating evidence over links in a depen. graph

 Probabilistic models interact with different entity