Data Mining:
Concepts and Techniques
— Chapter 9 —
9.2. Social Network Analysis
Jiawei Han and Micheline Kamber Department of Computer Science
University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber. All rights reserved.
Acknowledgements: Based on the slides by Sangkyum Kim and Chen Chen
Social Network Analysis
Social Network Introduction
Statistics and Probability Theory
Models of Social Network Generation
Networks in Biological System
Mining on Social Network
Summary
Society
Nodes: individuals
Links: social relationship (family/work/friendship/etc.)
S. Milgram (1967)
Social networks: Many individuals with
diverse social interactions between them.
John Guare Six Degrees of Separation
Communication networks
The Earth is developing an electronic nervous system, a network with diverse nodes and links are
-computers -routers -satellites
-phone lines -TV cables -EM waves
Communication networks: Many non-identical
components with diverse
connections between them.
Complex systems
Made of many non-identical elements connected by diverse interactions.
“Natural” Networks and Universality
Consider many kinds of networks:
social, technological, business, economic, content,…
These networks tend to share certain informal properties:
large scale; continual growth
distributed, organic growth: vertices “decide” who to link to
interaction restricted to links
mixture of local and long-distance connections
abstract notions of distance: geographical, content, social,…
Do natural networks share more quantitative universals?
What would these “universals” be?
How can we make them precise and measure them?
How can we explain their universality?
This is the domain of social network theory
Sometimes also referred to as link analysis
Some Interesting Quantities
Connected components:
how many, and how large?
Network diameter:
maximum (worst-case) or average?
exclude infinite distances? (disconnected components)
the small-world phenomenon
Clustering:
to what extent that links tend to cluster “locally”?
what is the balance between local and long-distance connections?
what roles do the two types of links play?
Degree distribution:
what is the typical degree in the network?
what is the overall distribution?
A “Canonical” Natural Network has…
Few connected components:
often only 1 or a small number, indep. of network size
Small diameter:
often a constant independent of network size (like 6)
or perhaps growing only logarithmically with network size or even shrink?
typically exclude infinite distances
A high degree of clustering:
considerably more so than for a random network
in tension with small diameter
A heavy-tailed degree distribution:
a small but reliable number of high-degree vertices
often of power law form
Probabilistic Models of Networks
All of the network generation models we will study are probabilistic or statistical in nature
They can generate networks of any size
They often have various parameters that can be set:
size of network generated
average degree of a vertex
fraction of long-distance connections
The models generate a distribution over networks
Statements are always statistical in nature:
with high probability, diameter is small
on average, degree distribution has heavy tail
Thus, we’re going to need some basic statistics and
Social Network Analysis
Social Network Introduction
Statistics and Probability Theory
Models of Social Network Generation
Networks in Biological System
Mining on Social Network
Summary
Probability and Random Variables
A random variable X is simply a variable that probabilistically assumes values in some set
set of possible values sometimes called the sample space S of X
sample space may be small and simple or large and complex
S = {Heads, Tails}, X is outcome of a coin flip
S = {0,1,…,U.S. population size}, X is number voting democratic
S = all networks of size N, X is generated by preferential attachment
Behavior of X determined by its distribution (or density)
for each value x in S, specify Pr[X = x]
these probabilities sum to exactly 1 (mutually exclusive outcomes)
complex sample spaces (such as large networks):
distribution often defined implicitly by simpler components
might specify the probability that each edge appears independently
Some Basic Notions and Laws
Independence:
let X and Y be random variables
independence: for any x and y, Pr[X = x & Y = y] = Pr[X=x]Pr[Y=y]
intuition: value of X does not influence value of Y, vice-versa
dependence:
e.g. X, Y coin flips, but Y is always opposite of X
Expected (mean) value of X:
only makes sense for numeric random variables
“average” value of X according to its distribution
formally, E[X] = (Pr[X = x] X), sum is over all x in S
often denoted by
always true: E[X + Y] = E[X] + E[Y]
true only for independent random variables: E[XY] = E[X]E[Y]
Variance of X:
Var(X) = E[(X – )^2]; often denoted by ^2
standard deviation is sqrt(Var(X)) =
Union bound:
for any X, Y, Pr[X=x & Y=y] <= Pr[X=x] + Pr[Y=y]
Convergence to Expectations
Let X1, X2,…, Xn be:
independent random variables
with the same distribution Pr[X=x]
expectation = E[X] and variance 2
independent and identically distributed (i.i.d.)
essentially n repeated “trials” of the same experiment
natural to examine r.v. Z = (1/n) Xi, where sum is over i=1,…,n
example: number of heads in a sequence of coin flips
example: degree of a vertex in the random graph model
E[Z] = E[X]; what can we say about the distribution of Z?
Central Limit Theorem:
as n becomes large, Z becomes normally distributed
The Normal Distribution
The normal or Gaussian density:
applies to continuous, real-valued random variables
characterized by mean (average) m and standard deviation s
density at x is defined as
(1/( sqrt(2))) exp(-(x-)2/22)
special case = 0, = 1: a exp(-x2/b) for some constants a,b > 0
peaks at x = , then dies off exponentially rapidly
the classic “bell-shaped curve”
exam scores, human body temperature,
remarks:
can control mean and standard deviation independently
can make as “broad” as we like, but always have finite variance
The Normal Distribution
The Binomial Distribution
coin with Pr[heads] = p, flip n times
probability of getting exactly k heads:
choose(n,k) p
k(1-p)
n-k
for large n and p fixed:
approximated well by a normal with
= np, = sqrt(np(1-p))
/ 0 as n grows
leads to strong large deviation bounds
The Binomial Distribution
www.professionalgambler.com/ binomial.html
The Poisson Distribution
like binomial, applies to variables taken on integer values > 0
often used to model counts of events
number of phone calls placed in a given time period
number of times a neuron fires in a given time period
single free parameter
probability of exactly x events:
exp(-) x/x!
mean and variance are both
binomial distribution with n large, p = /n ( fixed)
converges to Poisson with mean
The Poisson Distribution
single photoelectron distribution
Heavy-tailed Distributions
Pareto or power law distributions:
for variables assuming integer values > 0
probability of value x ~ 1/x^a
typically 0 < a < 2; smaller a gives heavier tail
sometimes also referred to as being scale-free
For binomial, normal, and Poisson distributions the tail probabilities approach 0 exponentially fast
Inverse polynomial decay vs. inverse exponential decay
What kind of phenomena does this distribution model?
What kind of process would generate it?
Heavy-Tailed Distributions
Distributions vs. Data
All these distributions are idealized models
In practice, we do not see distributions, but data
Thus, there will be some largest value we observe
Also, can be difficult to “eyeball” data and choose model
So how do we distinguish between Poisson, power law, etc?
Typical procedure:
might restrict our attention to a range of values of interest
accumulate counts of observed data into equal-sized bins
look at counts on a log-log plot
note that
power law:
log(Pr[X = x]) = log(1/x) = - log(x)
linear, slope –
Normal:
log(Pr[X = x]) = log(a exp(-x/b)) = log(a) – x/b
non-linear, concave near mean
Poisson:
log(Pr[X = x]) = log(exp(-) x/x!)
Zipf’s Law
Look at the frequency of English words:
“the” is the most common, followed by “of”, “to”, etc.
claim: frequency of the n-th most common ~ 1/n (power law, α = 1)
General theme:
rank events by their frequency of occurrence
resulting distribution often is a power law!
Other examples:
North America city sizes
personal income
file sizes
genus sizes (number of species)
People seem to dither over exact form of these distributions
Linear scales on both axes Logarithmic scales on both axes
The same data plotted on linear and logarithmic scales.
Both plots show a Zipf distribution with 300 datapoints
Zipf’s Law
Social Network Analysis
Social Network Introduction
Statistics and Probability Theory
Models of Social Network Generation
Networks in Biological System
Mining on Social Network
Summary
Some Models of Network Generation
Random graphs (Erdös-Rényi models):
gives few components and small diameter
does not give high clustering and heavy-tailed degree distributions
is the mathematically most well-studied and understood model
Watts-Strogatz models:
give few components, small diameter and high clustering
does not give heavy-tailed degree distributions
Scale-free Networks:
gives few components, small diameter and heavy-tailed distribution
does not give high clustering
Hierarchical networks:
few components, small diameter, high clustering, heavy-tailed
Affiliation networks:
models group-actor formation
Models of Social Network Generation
Random Graphs (Erdös-Rényi models)
Watts-Strogatz models
Scale-free Networks
The Erdös-Rényi (ER) Model (Random Graphs)
All edges are equally probable and appear independently
NW size N > 1 and probability p: distribution G(N,p)
each edge (u,v) chosen to appear with probability p
N(N-1)/2 trials of a biased coin flip
The usual regime of interest is when p ~ 1/N, N is large
e.g. p = 1/2N, p = 1/N, p = 2/N, p=10/N, p = log(N)/N, etc.
in expectation, each vertex will have a “small” number of neighbors
will then examine what happens when N infinity
can thus study properties of large networks with bounded degree
Degree distribution of a typical G drawn from G(N,p):
draw G according to G(N,p); look at a random vertex u in G
what is Pr[deg(u) = k] for any fixed k?
Poisson distribution with mean l = p(N-1) ~ pN
Sharply concentrated; not heavy-tailed
Especially easy to generate NWs from G(N,p)
Erdös-Rényi Model (1960)
- Democratic - Random
Pál Erdös Pál Erdös
(1913-1996) Connect with
probability p p=1/6
N=10
k~1.5 Poisson distribution
A Closely Related Model
For any fixed m <= N(N-1)/2, define distribution G(N,m):
choose uniformly at random from all graphs with exactly m edges
G(N,m) is “like” G(N,p) with p = m/(N(N-1)/2)
~ 2m/N^2
this intuition can be made precise, and is correct
if m = cN then p = 2c/(N-1) ~ 2c/N
mathematically trickier than G(N,p)
Another Closely Related Model
Graph process model:
start with N vertices and no edges
at each time step, add a new edge
choose new edge randomly from among all missing edges
Allows study of the evolution or emergence of properties:
as the number of edges m grows in relation to N
equivalently, as p is increased
For all of these models:
high probability “almost all” large graphs of a given
Evolution of a Random Network
We have a large number n of vertices
We start randomly adding edges one at a time
At what time t will the network:
have at least one “large” connected component?
have a single connected component?
have “small” diameter?
have a “large” clique?
have a “large” chromatic number?
How gradually or suddenly do these properties appear?
Recap
Model G(N,p):
select each of the possible edges independently with prob. p
expected total number of edges is pN(N-1)/2
expected degree of a vertex is p(N-1)
degree will obey a Poisson distribution (not heavy-tailed)
Model G(N,m):
select exactly m of the N(N-1)/2 edges to appear
all sets of m edges equally likely
Graph process model:
starting with no edges, just keep adding one edge at a time
always choose next edge randomly from among all missing edges
Threshold or tipping for (say) connectivity:
fewer than m(N) edges graph almost certainly not connected more than m(N) edges graph almost certainly is connected
Combining and Formalizing Familiar Ideas
Explaining universal behavior through statistical models
our models will always generate many networks
almost all of them will share certain properties (universals)
Explaining tipping through incremental growth
we gradually add edges, or gradually increase edge probability p
many properties will emerge very suddenly during this process
size of police force
crime rate
number of edges
prob. NW connected
Monotone Network Properties
Often interested in monotone graph properties:
let G have the property
add edges to G to obtain G’
then G’ must have the property also
Examples:
G is connected
G has diameter <= d (not exactly d)
G has a clique of size >= k (not exactly k)
G has chromatic number >= c (not exactly c)
G has a matching of size >= m
d, k, c, m may depend on NW size N (How?)
Difficult to study emergence of non-monotone properties as the number of edges is increased
what would it mean?
Formalizing Tipping:
Thresholds for Monotone Properties
Consider Erdos-Renyi G(N,m) model
select m edges at random to include in G
Let P be some monotone property of graphs
P(G) = 1 G has the property
P(G) = 0 G does not have the property
Let m(N) be some function of NW size N
formalize idea that property P appears “suddenly” at m(N) edges
Say that m(N) is a threshold function for P if:
let m’(N) be any function of N
look at ratio r(N) = m’(N)/m(N) as N infinity
if r(N) 0: probability that P(G) = 1 in G(N,m’(N)): 0
if r(N) infinity: probability that P(G) = 1 in G(N,m’(N)): 1
A purely structural definition of tipping
tipping results from incremental increase in connectivity
So Which Properties Tip?
Just about all of them!
The following properties all have threshold functions:
having a “giant component”
being connected
having a perfect matching (N even)
having “small” diameter
With remarkable consistency (N = 50):
giant component ~ 40 edges, connected ~ 100, small diameter ~ 180
Ever More Precise…
Connected component of size > N/2:
threshold function is m(N) = N/2 (or p ~ 1/N)
note: full connectivity impossible
Fully connected:
threshold function is m(N) = (N/2)log(N) (or p ~ log(N)/N)
NW remains extremely sparse: only ~ log(N) edges per vertex
Small diameter:
threshold is m(N) ~ N^(3/2) for diameter 2 (or p ~ 2/sqrt(N))
fraction of possible edges still ~ 2/sqrt(N) 0
generate very small worlds
Other Tipping Points?
Perfect matchings
consider only even N
threshold function: m(N) = (N/2)log(N) (or p ~ log(N)/N)
same as for connectivity!
Cliques
k-clique threshold is m(N) = (1/2)N^(2 – 2/(k-1)) (p ~ 1/N^(2/k-1))
edges appear immediately; triangles at N/2; etc.
Coloring
k colors required just as k-cliques appear
Erdos-Renyi Summary
A model in which all connections are equally likely
each of the N(N-1)/2 edges chosen randomly &
independently
As we add edges, a precise sequence of events unfolds:
graph acquires a giant component
graph becomes connected
graph acquires small diameter
Many properties appear very suddenly (tipping, thresholds)
All statements are mathematically precise
But is this how natural networks form?
If not, which aspects are unrealistic?
may all edges are not equally likely!
The Clustering Coefficient of a Network
Let nbr(u) denote the set of neighbors of u in a graph
all vertices v such that the edge (u,v) is in the graph
The clustering coefficient of u:
let k = |nbr(u)| (i.e., number of neighbors of u)
choose(k,2): max possible # of edges between vertices in nbr(u)
c(u) = (actual # of edges between vertices in nbr(u))/choose(k,2)
0 <= c(u) <= 1; measure of cliquishness of u’s neighborhood
Clustering coefficient of a graph:
average of c(u) over all vertices u
k = 4
choose(k,2) = 6
Clustering: My friends will likely know each other!
Probability to be connected C
»
pC = # of links between 1,2,…n neighbors n(n-1)/2
Networks are clustered [large C(p)]
but have a small
characteristic path length [small L(p)].
Network C Crand L N
WWW 0.1078 0.00023 3.1 153127
Internet 0.18-0.3 0.001 3.7-3.76 3015- 6209 Actor 0.79 0.00027 3.65 225226 Coauthorship 0.43 0.00018 5.9 52909
Metabolic 0.32 0.026 2.9 282
Foodweb 0.22 0.06 2.43 134
The Clustering Coefficient of a Network
Erdos-Renyi: Clustering Coefficient
Generate a network G according to G(N,p)
Examine a “typical” vertex u in G
choose u at random among all vertices in G
what do we expect c(u) to be?
Answer: exactly p!
In G(N,m), expect c(u) to be 2m/N(N-1)
Both cases: c(u) entirely determined by overall density
Baseline for comparison with “more clustered” models
Erdos-Renyi has no bias towards clustered or local edges
Models of Social Network Generation
Random Graphs (Erdös-Rényi models)
Watts-Strogatz models
Scale-free Networks
Caveman and Solaria
Erdos-Renyi:
sharing a common neighbor makes two vertices no more likely to be directly connected than two very “distant” vertices
every edge appears entirely independently of existing structure
But in many settings, the opposite is true:
you tend to meet new friends through your old friends
two web pages pointing to a third might share a topic
two companies selling goods to a third are in related industries
Watts’ Caveman world:
overall density of edges is low
but two vertices with a common neighbor are likely connected
Watts’ Solaria world
overall density of edges low; no special bias towards local edges
Making it (Somewhat) Precise: the -model
The -model has the following parameters or “knobs”:
N: size of the network to be generated
k: the average degree of a vertex in the network to be generated
p: the default probability two vertices are connected
: adjustable parameter dictating bias towards local connections
For any vertices u and v:
define m(u,v) to be the number of common neighbors (so far)
Key quantity: the propensity R(u,v) of u to connect to v
if m(u,v) >= k, R(u,v) = 1 (share too many friends not to connect)
if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect)
else, R(u,v) = p + (m(u,v)/k)^ (1-p)
Generate NW incrementally
using R(u,v) as the edge probability; details omitted
Note: = infinity is “like” Erdos-Renyi (but not exactly)
Watts-Strogatz Model
Small Worlds and Occam’s Razor
For small , should generate large clustering coefficients
we “programmed” the model to do so
Watts claims that proving precise statements is hard…
But we do not want a new model for every little property
Erdos-Renyi small diameter
-model high clustering coefficient
In the interests of Occam’s Razor, we would like to find
a single, simple model of network generation…
… that simultaneously captures many properties
Watt’s small world: small diameter and high clustering
Meanwhile, Back in the Real World…
Watts examines three real networks as case studies:
the Kevin Bacon graph
the Western states power grid
the C. elegans nervous system
For each of these networks, he:
computes its size, diameter, and clustering coefficient
compares diameter and clustering to best Erdos-Renyi approx.
shows that the best -model approximation is better
important to be “fair” to each model by finding best fit
Overall moral:
if we care only about diameter and clustering, is better
Case 1: Kevin Bacon Graph
Vertices: actors and actresses
Edge between u and v if they appeared in a film together
Is Kevin Bacon the most
connected actor?
NO!
Rank Name Average
distance
# of movies
# of links
1 Rod Steiger 2.537527 112 2562
2 Donald Pleasence 2.542376 180 2874
3 Martin Sheen 2.551210 136 3501
4 Christopher Lee 2.552497 201 2993
5 Robert Mitchum 2.557181 136 2905
6 Charlton Heston 2.566284 104 2552
7 Eddie Albert 2.567036 112 3333
8 Robert Vaughn 2.570193 126 2761
9 Donald Sutherland 2.577880 107 2865
10 John Gielgud 2.578980 122 2942
11 Anthony Quinn 2.579750 146 2978
12 James Earl Jones 2.584440 112 3787
…
876 Kevin Bacon 2.786981 46 1811
876… Kevin Bacon 2.786981 46 1811
Kevin Bacon
No. of movies : 46 No. of actors : 1811 Average separation: 2.79
Rod Steiger
Donald Pleasence
#1
#2
#876
Kevin Bacon
Case 2: New York State Power Grid
Vertices: generators and substations
Edges: high-voltage power transmission lines and transformers
Line thickness and color indicate the voltage level
Red 765 kV, 500 kV; brown 345 kV; green 230 kV; grey 138 kV
Case 3: C. Elegans Nervous System
Vertices: neurons in the C. elegans worm
Edges: axons/synapses between neurons
Two More Examples
M. Newman on scientific collaboration networks
coauthorship networks in several distinct communities
differences in degrees (papers per author)
empirical verification of
giant components
small diameter (mean distance)
high clustering coefficient
Alberich et al. on the Marvel Universe
purely fictional social network
two characters linked if they appeared together in an issue
“empirical” verification of
heavy-tailed distribution of degrees (issues and characters)
giant component
rather small clustering coefficient
One More (Structural) Property…
A properly tuned -model can simultaneously explain
small diameter
high clustering coefficient
But what about heavy-tailed degree distributions?
-model and simple variants will not explain this
intuitively, no “bias” towards large degree evolves
all vertices are created equal
Can concoct many bad generative models to explain
generate NW according to Erdos-Renyi, reject if tails not heavy
describe fixed NWs with heavy tails
all connected to v1; N/2 connected to v2; etc.
not clear we can get a precise power law
not modeling variation
Models of Social Network Generation
Random Graphs (Erdös-Rényi models)
Watts-Strogatz models
Scale-free Networks
World Wide Web
800 million documents (S. Lawrence, 1999)
ROBOT:
collects all URL’s found in a document and follows them recursivelyNodes: WWW documents Links: URL links
k ~ 6
P(k=500) ~ 10-99 NWWW ~ 109
N(k=500)~10-90
Expected Result Real Result
Pout(k) ~ k-out
P(k=500) ~ 10-6
out= 2.45 in = 2.1
Pin(k) ~ k- in NWWW ~ 109
N(k=500) ~ 103
World Wide Web
<l >
Finite size scaling: create a network with N nodes with Pin(k) and Pout(k)
< l > = 0.35 + 2.06 log(N) l15=2 [125]
l17=4 [1346 7]
… < l > = ??
1
2 3
4
5
6
7
nd.edu
19 degrees of separation R. Albert et al Nature (99)
based on 800 million webpages [S. Lawrence et al
Nature (99)]
A. Broder et al WWW9 (00)
IBM
World Wide Web
What does that mean?
Poisson distribution
Exponential
Power-law distribution
Scale-free
Scale-free Networks
The number of nodes (N) is not fixed
Networks continuously expand by additional new nodes
WWW: addition of new nodes
Citation: publication of new papers
The attachment is not uniform
A node is linked with higher probability to a node that already has a large number of links
WWW: new documents link to well known sites (CNN, Yahoo, Google)
Citation: Well cited papers are more likely to be
Scale-Free Networks
Start with (say) two vertices connected by an edge
For i = 3 to N:
for each 1 <= j < i, d(j) = degree of vertex j so far
let Z = S d(j) (sum of all degrees so far)
add new vertex i with k edges back to {1, …, i-1}:
i is connected back to j with probability d(j)/Z
Vertices j with high degree are likely to get more links!
“Rich get richer”
Natural model for many processes:
hyperlinks on the web
new business and social contacts
transportation networks
Generates a power law distribution of degrees
exponent depends on value of k
Preferential attachment explains
heavy-tailed degree distributions
small diameter (~log(N), via “hubs”)
Will not generate high clustering coefficient
no bias towards local connectivity, but towards hubs
Scale-Free Networks
Case1: Internet Backbone
(Faloutsos, Faloutsos and Faloutsos, 1999)
Nodes: computers, routers Links: physical lines
Case2: Actor Connectivity
Nodes: actors Links: cast jointly
N = 212,250 actors k = 28.78
P(k) ~k
-Days of Thunder (1990) Far and Away (1992)
Eyes Wide Shut (1999)
=2.3
Case 3: Science Citation Index
( = 3)
Nodes: papers Links: citations
P(k) ~k
-2212 25
1736 PRL papers (1988)
Witten-Sander PRL 1981
Nodes: scientist (authors) Links: write paper together
(Newman, 2000, H. Jeong et al 2001)
Case 4: Science Coauthorship
Case 5: Food Web
Nodes: trophic species Links: trophic interactions
Case 6: Sex-Web
Nodes: people (Females; Males) Links: sexual relationships
Liljeros et al. Nature 2001 4781 Swedes; 18-74;
59% response rate.
Robustness of
Random vs. Scale-Free Networks
The accidental failure of a number of nodes in a random network can fracture the
system into non-
communicating islands.
Scale-free networks are more robust in the face of such failures.
Scale-free networks are highly vulnerable to a coordinated attack against their hubs.
Social Network Analysis
Social Network Introduction
Statistics and Probability Theory
Models of Social Network Generation
Networks in Biological System
Mining on Social Network
Summary
protein-gene interactions
protein-protein interactions PROTEOME
GENOME
METABOLISM Bio-chemical
Bio-Map
Citrate Cycle METABOLISM Bio-chemical reactions
Metabolic Network
Nodes: chemicals (substrates)
Links: bio-chemical reactions
Metabolic Network
Organisms from all three domains of life are scale-free networks!
Archaea Bacteria Eukaryotes
Metabolic Network
protein-gene interactions
protein-protein interactions PROTEOME
GENOME
METABOLISM Bio-chemical
reactions
Bio-Map
protein-protein interactions PROTEOME
Protein Network
Nodes: proteins
Links: physical interactions (binding)
Yeast Protein Network
) exp(
) (
~ )
( 0 0
k k k k
k k
P
H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai, Nature 411, 41-42 (2001)
Topology of the Protein Network
Nature 408 307 (2000)
…
“One way to understand the p53 network is to compare it to the Internet.
The cell, like the Internet, appears to be a ‘scale-free network’.”
p53 Network
p53 Network (mammals)
Social Network Analysis
Social Network Introduction
Statistics and Probability Theory
Models of Social Network Generation
Networks in Biological System
Mining on Social Network
Summary
Information on the Social Network
Heterogeneous, multi-relational data represented as a graph or network
Nodes are objects
May have different kinds of objects
Objects have attributes
Objects may have labels or classes
Edges are links
May have different kinds of links
Links may have attributes
Links may be directed, are not required to be binary
Links represent relationships and interactions between
What is New for Link Mining Here
Traditional machine learning and data mining approaches assume:
A random sample of homogeneous objects from single relation
Real world data sets:
Multi-relational, heterogeneous and semi-structured
Link Mining
Newly emerging research area at the intersection of research in social network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming
A Taxonomy of Common Link Mining Tasks
Object-Related Tasks
Link-based object ranking
Link-based object classification
Object clustering (group detection)
Object identification (entity resolution)
Link-Related Tasks
Link prediction
Graph-Related Tasks
Subgraph discovery
Graph classification
Generative model for graphs
What Is a Link in Link Mining?
Link: relationship among data
Two kinds of linked networks
homogeneous vs. heterogeneous
Homogeneous networks
Single object type and single link type
Single model social networks (e.g., friends)
WWW: a collection of linked Web pages
Heterogeneous networks
Multiple object and link types
Medical network: patients, doctors, disease, contacts, treatments
Bibliographic network: publications, authors, venues
Link-Based Object Ranking (LBR)
LBR: Exploit the link structure of a graph to order or prioritize the set of objects within the graph
Focused on graphs with single object type and single link type
This is a primary focus of link analysis community
Web information analysis
PageRank and Hits are typical LBR approaches
In social network analysis (SNA), LBR is a core analysis task
Objective: rank individuals in terms of “centrality”
Degree centrality vs. eigen vector/power centrality
Rank objects relative to one or more relevant objects in
PageRank: Capturing Page Popularity
(Brin & Page’98) Intuitions
Links are like citations in literature
A page that is cited often can be expected to be more useful in general
PageRank is essentially “citation counting”, but improves over simple counting
Consider “indirect citations” (being cited by a highly cited paper counts a lot…)
Smoothing of citations (every page is assumed to have a non-zero citation count)
PageRank can also be interpreted as random surfing (thus capturing popularity)
The PageRank Algorithm
(Brin & Page’98)1
( )
0 0 1/ 2 1/ 2
1 0 0 0
0 1 0 0
1/ 2 1/ 2 0 0
( ) (1 ) ( ) 1 ( )
( ) [1 (1 ) ] ( )
( (1 ) )
j i
t i ji t j t k
d IN d k
i ki k
k T
M
p d m p d p d
N
p d m p d
N
p I M p
d1
d2
d4
“Transition matrix”
d3 /N (why?)Same as
Stationary (“stable”) distribution, so we
ignore time
Random surfing model:
At any page,
With prob. , randomly jumping to a page
With prob. (1 – ), randomly picking a link to follow
Iij = 1/N
HITS: Capturing Authorities & Hubs
(Kleinberg’98)
Intuitions
Pages that are widely cited are good authorities
Pages that cite many other pages are good hubs
The key idea of HITS
Good authorities are cited by good hubs
Good hubs point to good authorities
Iterative reinforcement …
The HITS Algorithm
(Kleinberg 98)d1
d2
d4 ( )
( )
0 0 1 1 1 0 0 0
0 1 0 0 1 1 0 0
( ) ( )
( ) ( )
;
;
j i
j i
i j
d OUT d
i j
d IN d
T
T T
A
h d a d
a d h d
h Aa a A h h AA h a A Aa
“Adjacency matrix”
d3 Initial values: a=h=1
Iterat
Normalize: e
2 2
( )i ( )i 1
i i
a d h d
Block-level Link Analysis
(Cai et al. 04)
Most of the existing link analysis algorithms, e.g.
PageRank and HITS, treat a web page as a single node in the web graph
However, in most cases, a web page contains multiple semantics and hence it might not be
considered as an atomic and homogeneous node
Web page is partitioned into blocks using the vision-based page segmentation algorithm
extract page-to-block, block-to-page relationships
Block-level PageRank and Block-level HITS
Link-Based Object Classification (LBC)
Predicting the category of an object based on its
attributes, its links and the attributes of linked objects
Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc.
Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations
Epidemics: Predict disease type based on characteristics of the patients infected by the disease
Communication: Predict whether a communication contact is by email, phone call or mail
Challenges in Link-Based Classification
Labels of related objects tend to be correlated
Collective classification: Explore such correlations and jointly infer the categorical values associated with the objects in the graph
Ex: Classify related news items in Reuter data sets (Chak’98)
Simply incorp. words from neighboring documents: not helpful
Multi-relational classification is another solution for link- based classification
Group Detection
Cluster the nodes in the graph into groups that share common characteristics
Web: identifying communities
Citation: identifying research communities
Methods
Hierarchical clustering
Blockmodeling of SNA
Spectral graph partitioning
Stochastic blockmodeling
Entity Resolution
Predicting when two objects are the same, based on their attributes and their links
Also known as: deduplication, reference reconciliation, co- reference resolution, object consolidation
Applications
Web: predict when two sites are mirrors of each other
Citation: predicting when two citations are referring to the same paper
Epidemics: predicting when two disease strains are the same
Biology: learning when two names refer to the same protein
Entity Resolution Methods
Earlier viewed as pair-wise resolution problem: resolved based on the similarity of their attributes
Importance at considering links
Coauthor links in bib data, hierarchical links between spatial references, co-occurrence links between name references in documents
Use of links in resolution
Collective entity resolution: one resolution decision affects another if they are linked
Propagating evidence over links in a depen. graph
Probabilistic models interact with different entity