2. 2 Analytical background - Large-scale Internet measurement

2.1. Analytical background

We need tools for study Internet in a quantitative fashion

Created by XMLmind XSL-FO Converter.

• Linear algebra

• Probability and statistics

• Graph theory

2.2. LINEAR ALGEBRA 2.3. Notations

2.4. Norms and orthogonality

2.5. Matrices

2.6. Eigenvectors and eigenvalues 2.7. Alternate algebras

2.8. PROBABILITY AND STATISTICS

2.9. Why do we need statistics and probability theory?

• Most of the mechanisms in networks are not deterministic

• Randomized algorithms

• Improved robustness, load balancing, etc.

• Stochastic behavior of incoming traffic

• Without probability theory and statistics it would be hard to analyze them

Created by XMLmind XSL-FO Converter.

2.10. Notations

2.11. Definitions 2.12. Definitions - II

2.13. Expected values and moments 2.14. Variance and standard deviation 2.15. Joint probability

2.16. Conditional probability 2.17. Central limit theorem

2.18. Distributions for Internet measurements

2.19. Stochastic processes

• Typically, Internet measurements arrive over time, in some order

• To use the tools of probability in this settings we need to define the sequence of random variables which is called a stochastic process.

Created by XMLmind XSL-FO Converter.

2.20. Stochastic processes 2.21. Stochastic processes

2.22. Characterization of a stochastic process

2.23. Simpler stationary conditions

2.24. Measures of dependence

2.25. Measures of dependence

2.26. Measures of dependence

2.27. Modeling network traffic and user activity 2.28. Modeling network traffic and user activity 2.29. Short and long tailed distributions

2.30. Short and long tailed distributions 2.31. Short and long tailed distributions

2.32. Heavy tailed/power-law distribution

2.33. Heavy tailed distribution

Created by XMLmind XSL-FO Converter.

• New York City area road map

• Link lengths in km

2.34. Measured data

• Describing data

• For example: "mean of a dataset"

• An objectively measurable quantity which is the average of a set of known values

• Describing probability models

• For example: "mean of a random variable"

• A property of an abstract mathematical construct

• To emphasize the distinction, we add the adjective "empirical" to describe data

• Empirical mean vs. mean

• Classification of measured data

• Numerical: i.e. numbers

• Categorical: i.e. symbols, names, tokens, etc.

2.35. Describing data

2.36. More detailed descriptions

• Quantiles

• The pth quantile is the value

• below which the fraction p of

• the values lies.

• Median is the 0.5-quantile

• Percentile

• This can be expressed as percentile as well.

• E.g. the 90th percentile is the value that is larger than 90 percent of the data

Created by XMLmind XSL-FO Converter.

2.37. Histogram

• Defined in terms of bins which are a particular of the observed values

• Counts how many values fall in each bin

• A natural empirical analog of a random variable’s probability density function (PDF) or distribution function

• Practical problem:

• How to determine the bin boundaries

2.38. Empirical cumulative distribution function (CDF)

• Involves no binning or averaging of data values

• Provides more information about the dataset than the histogram.

• For each unique value in the data set, the fraction of data items that are smaller than that value (quantile).

• Empirical CCDF can be used similarly

2.39. Categorical data description

• Probability distribution

• An analog of the histogram for categorical data

• Measure the empirical probability of each symbol in the dataset

• Use histogram in decreasing order

Created by XMLmind XSL-FO Converter.

2.40. Describing memory and stability

Time series data

• Question: Do successive measurements tend to have any relation to each other?

Memory

• When the value of a measurement tends to give some information about the likely values of future measurements

• Empirical autocorrelation function (ACF):

Stability

• If its empirical statistics do not seem to be changing over time.

• Subjective

• Objective measures

• A typical approach is to break the dataset into windows

• E.g. a set of 1000 observations can be divided into 10 windows consisting of the 1st 100 observations, the 2nd 100 observations, and so on.

• Empirical statistics are calculated for each window then Looking for consistency, trends or predictable variation, etc.

2.41. High variability in Internet data

• Traditional statistical methods focuses on low or moderate variability of the data, e.g. Normal distribution

• However, Internet data shows high variability

• It consists of many small values mixed with a small number of large value

• A significant fraction of the data may fall many standard deviations from the mean

• Empirical distribution is highly skewed, and empirical mean and variance are strongly affected by the rare, large observations

• It may be modeled with a subexponential or heavy tailed distribution

• Mean and variance are not good metrics for high variability data, while quantiles and the empirical distribution are better,

• e.g. empirical CCDF on log-log axes for long-tailed distribution

2.42. Zipf’s law

• Categorical distributions can also be highly skewed

• A model for the shape of a categorical distribution when data values are ordered by decreasing empirical probability,

• e.g. URLs of Web pages

• Zipf’s law refers to

• the situation where

• For some positive

• constants c and B

2.43. GRAPH THEORY

2.44. Graph theory

• Generally, networks can be handled as directed or undirected graphs

• However, different phenomena could also be analysed by graph theory

• E.g. retweet graph in social network analysis

Created by XMLmind XSL-FO Converter.

• Graph theory could help us to characterize networks and other phenomena and analyze their properties

2.45. Graphs

• A graph is a pair

• Undirected and directed

• Unweighted and weighted

• 5

• 2

• 5

• 3

• 7

• 6

• 1

• 3

2.46. Subgraphs

2.47. Connected graphs

2.48. Metrics for characterization

2.49. Metrics for characterization

2.50. Matrix representation

2.51. Applications of Routing Matrix

• Origin-destination flow

2.52. Applications of routing matrix

• Delay of paths

• The routing and end-to-end delays are known

Created by XMLmind XSL-FO Converter.

2.53. Artificial graph constructions

• To model real networks by generating random graphs

• Erdős-Rényi model

• Random graphs

• Theoretical relevance

• Binomial degree distribution

• Barabási-Albert model

• Random scale-free networks

• Modeling natural and human-made systems

• Power-law degree distribution

• Other models like Watts and Strogatz model

2.54. Erdős-Rényi random graph

2.55. Erdős-Rényi random graph

2.56. Generalized random graph 2.57. Preferential attachment model 2.58. Preferential attachment model 2.59. Regular vs Random graphs

• Regular graph

• Long characteristic path length

• High degree of clustering

• Random Graph

• Short paths

• Low degree of clustering

• Small world graph

• Short characteristic path length

• High degree of clustering

2.60. AS level topology

Created by XMLmind XSL-FO Converter.

2.61. AS level topology

• High variability in degree distribution

• Some ASes are very highly connected

• Different ASes have dramatically different roles in the network

• Node degree seems to be highly correlated with AS size

• Generative models of AS graph

• "Rich get richer" model

• Newly added nodes connect to existing nodes in a way that tends to simultaneously minimize the physical length of the new connection, as well as the average number of hops to other nodes

• New ASes appear at an exponentially increasing rate, and each AS grows exponentially as well

2.62. AS level topology

2.63. MODELING

2.64. Measurement and modeling

• Model

• Simplified version of something else

• Classification

• A system model: simplified descriptions of computer systems

• Data models: simplified descriptions of measurements

• Data models

• Descriptive data models

• Constructive data models

2.65. Descriptive data model

• Compact summary of a set of measurements

• E.g. summarize the variation of traffic on a particular network as “a sinusoid with period 24 hours"

• An underlying idealized representation

• Contains parameters whose values are based on the measured data

• Drawback

• Can not use all available information

• Hard to answer "why is the data like this?" and "what will happen if the system changes?"

Created by XMLmind XSL-FO Converter.

2.66. Constructive data model

• Succinct description of a process that gives rise to an output of interest

• E.g. model network traffic as "the superposition of a set of flows arriving independently, each consisting of a random number of packets"

• The main purpose is to concisely characterize a dataset, instead of representing or simulating the real system

• Drawback

• Model is hard to generalize - such models may have many parameters

• The nature of the output is not obvious without simulation or analysis

• It is impossible to match the data in every aspect

2.67. Data model

• "All models are wrong, but some models are useful"

• Model is approximate

• Model omits details of the data by their very nature

• Modeling introduces the tension between the simplicity and utility of a model

• Under which model is the observed data more likely?

• Models involves a random process or component

• Three key steps in building a data model:

• Model selection

• Parsimonious: prefer models with fewer parameters over those with a larger number of parameters

• Parameters estimation

• Validating the model

2.68. Why build models

• Provides a compact summary of a set of measurements

• Exposes properties of measurements that are important for particular engineering problems, when parameters are interpretable

• Be a starting point to generate random but "realistic" data as input in simulation

2.69. Probability models

• Why use random models in the Internet?

• Fundamentally, the processes involved are random

• The value is an immense number of particular system properties that are far too tedious to specify

• Random models and real systems are very different things

• It is important to distinguish between the properties of a probabilistic model and the properties of real data.

3. 3 Network measurement infrastructures ETOMIC

In document Large-scale Internet measurement (Pldal 27-47)