Internet measurement and data analysis


Now that the Internet has become a social infrastructure, it becomes increasingly important to understand the current usage and behavior of the Internet and predict the future, not only for technical aspects but also for investment decisions and policy making.

However, it is challenging to grasp the Internet that is gigantic and complex systems; while it is not realistic to perform large-scale measurement covering the entire Internet, it is often the case that traditional sampling methods cannot be applied. Moreover, there are various technical, social, economical, and legal constraints, and we need to solve problems under these constraints.

In this class, you will learn about the overview of Internet measurement and large-scale data analysis, and basic skills for the forthcoming information society to obtain new knowledge from massive information.


Theme, Goals, Methods

In this class, you will learn about Internet measurement and data analysis methods, to obtain knowledge and understanding of networking technologies and large-scale data analysis. Each class will provide specific topics where you will learn problems, constraints, and solutions. At the same time, you will learn technical and theoretical backgrounds of the topics such as networking technologies, statistics, and algorithms. Each class consists of a lecture, and exercises on data analysis.

Textbooks, References

The lecture slide materials will be provided online.

ruby: gnuplot: [1] Mark Crovella and Balachander Krishnamurthy. Internet measurement: infrastructure, traffic, and applications. Wiley, 2006. [2] Antonio Nucci and Konstantina Papagiannaki. Design, Measurement and Management of Large-Scale IP Networks: Bridging the Gap Between Theory and Practice. Cambridge University Press, 2008. [3] Pang-Ning Tan, Michael Steinbach and Vipin Kumar. Introduction to Data Mining. Addison Wesley, 2006. [4] Raj Jain. The art of computer systems performance analysis. Wiley, 1991.


2 assignments and a final report.


The prerequisites for the class are basic programming skills and basic knowledge about statistics.

In the exercises and assignments, you will need to write programs to process large data sets, using the Ruby scripting language and the Gnuplot plotting tool. To understand the theoretical aspects, you will need basic knowledge about algebra and statistics. However, the focus of the class is to understand how mathematics is used for engineering applications.


Class 1 Introduction (9/28)

Network measurement and Internet measurement, network management tools, network measurement tools, exercise: introduction of Ruby scripting language, lecture slides

Class 2 Measuring the size of the Internet (10/5)

the number of users and hosts, the number of web pages, precision, errors, significant digit, how to make good graphs, exercise: graph plotting by Gnuplot( marathon, stock prices), lecture slides

Class 3 Data recording and log analysis (10/12)

data format, log analysis methods, exercise: log data and regular expression (access log data, test data, scripts) lecture slides

Class 4 Measuring the speed of the Internet (10/19)

bandwidth measurement, inferring available bandwidth, mean, standard deviation, linear regression, exercise: mean, standard deviation, linear regression, assignment 1 lecture slides

Class 5 Measuring the structure of the Internet (10/26)

Internet architecture, network layers, topologies, graph theory, exercise: topology analysis (dijkstra.rb, topology.txt), lecture slides

Class 6 Measuring the characteristics of the Internet (11/2)

delay, packet loss, jitter, correlation and multivariate analysis, principal component analysis, exercise: correlation analysis, (correlation.rb), lecture slides

Class 7 Measuring the diversity and complexity of the Internet (11/9)

sampling, statistical analysis, histogram, exercise: histogram, CDF, lecture slides

Class 8 Distributions (11/16)

normal distribution and other distributions, confidence intervals, statistical tests, exercise: generating distributions, confidence intervals, assignment 2, lecture slides

Class 9 Discussion (11/18, Friday) 9:25-10:55 e11

ref-1 ref-2

Class 10 Discussion (11/18, Friday) 11:10-12:40 e11

Class 11 Measuring time series of the Internet (11/30)

Internet and time, network time protocol, time series analysis, exercise: time series analysis, (autocorr.rb, autocorr_5min_data.txt), lecture slides

NO Class (12/7)

Class 12 Measuring anomalies of the Internet (12/14)

anomaly detection, spam filters, Bayes' theorem, exercise: anomaly detection, lecture slides

Class 13 Data mining (12/21)

pattern extraction, classification, clustering, exercise: clustering (k-means.rb, km-1.txt, km-2.txt, km-3.txt), lecture slides

Class 14 Scalable measurement and analysis (1/11)

distributed parallel processing, cloud technology, lecture slides

Class 15 Summary (1/18)

summary of the class, Internet measurement and privacy issues

Back to my home page

$Date: 2012/01/10 14:12:00 $