Internet measurement and data analysis


It becomes possible to access a huge amount of diverse data through the Internet. It allows us to obtain new knowledge and create new services, leading to an innovation called "Big Data" or "Collective Intelligence". In order to understand such data and use it as a tool, one needs to have a good understanding of the technical background in statistics, machine learning, and computer network systems.

In this class, you will learn about the overview of large-scale data analysis on the Internet, and basic skills to obtain new knowledge from massive information for the forthcoming information society.


Theme, Goals, Methods

In this class, you will learn about data collection and data analysis methods on the Internet, to obtain knowledge and understanding of networking technologies and large-scale data analysis.

Each class will provide specific topics where you will learn the technologies and the theories behind the technologies. In addition to the lectures, each class includes programming exercises to obtain data analysis skills through the exercises.

Textbooks, References

The lecture slide materials will be provided online.

ruby: gnuplot: [1] Mark Crovella and Balachander Krishnamurthy. Internet measurement: infrastructure, traffic, and applications. Wiley, 2006. [2] Pang-Ning Tan, Michael Steinbach and Vipin Kumar. Introduction to Data Mining. Addison Wesley, 2006. [3] Raj Jain. The art of computer systems performance analysis. Wiley, 1991. [4] Toby Segaran. Programming Collective Intelligence. O'Reilly Media. 2007. [5] Allen B. Downey. Think Stats: Probability and Statistics for Programmers. O'Reilly Media. 2011. [6] Chris Sanders. Practical Packet Analysis, 2nd Edition No Starch Press. 2011.


2 assignments and a final report.


The prerequisites for the class are basic programming skills and basic knowledge about statistics.

In the exercises and assignments, you will need to write programs to process large data sets, using the Ruby scripting language and the Gnuplot plotting tool. To understand the theoretical aspects, you will need basic knowledge about algebra and statistics. However, the focus of the class is to understand how mathematics is used for engineering applications.


Class 1 Introduction (4/11)

Big Data and Collective Intelligence, Internet measurement, Large-scale data analysis, exercise: introduction of Ruby scripting language, lecture slides, exercise script(count.rb, count-rubyish.rb), (optional reading material)

Class 2 Data and variability (4/18)

Summary statistics, Sampling, How to make good graphs, exercise: graph plotting by Gnuplot, lecture slides, exercise data(marathon.txt) exercise script(mean.rb, stddev.rb, stddev2.rb, median.rb, marathon.plt, marathon-cdf.rb, marathon-cdf.plt)

Class 3 Data recording and log analysis (4/25)

Network management tools, Data format, Log analysis methods, exercise: log data and regular expression, lecture slides, exercise data(, parse_accesslog.rb, access.plt)

Class 4 Distribution and confidence intervals (5/2)

Normal distribution, Confidence intervals and statistical tests, Distribution generation, exercise: confidence intervals, assignment 1, lecture slides, exercise scripts(box-muller.rb, box-muller-hist.rb, box-muller-hist.plt, conf-interval.rb, conf-interval.plt) data for assignment 1 (honolulu2015.txt) (3.6MB)

Class 5 Diversity and complexity (5/9)

Long tail, Web access and content distribution, Power-law and complex systems, exercise: power-law analysis, lecture slides, exercise data(us-surnames.csv), exercise scripts(make_ccdf.rb, ccdf.plt, count_contents.rb)

Class 6 Correlation (5/16)

Online recommendation systems, Distance, Correlation coefficient, exercise: correlation analysis, lecture slides, exercise script(correlation.rb) exercise data(correlation-data-1.txt, correlation-data-2.txt) exercise script(similarity.rb) exercise data(scores.txt)

Class 7 Multivariate analysis (5/23)

Data sensing and GeoLocation, Linear regression, Principal Component Analysis, exercise: linear regression, lecture slides, exercise script(leastsquare.rb, pca.rb), exercise data(pca-data.txt)

Class 8 Time-series analysis (6/6)

Internet and time, Network Time Protocol, Time series analysis, exercise: time-series analysis, assignment 2, lecture slides, exercise data(autocorr_5min_data.txt, ifbps-201205.txt), exercise script(autocorr.rb, autocorr.plt, hourly_out.rb, hourly_out.plt, week_out.rb, week_out.plt, correlation_out.rb) data for assignment 2: ( (164MB), (365MB))

Class 9 Topology and graph (6/13)

Routing protocols, Graph theory, exercise: shortest-path algorithm, lecture slides, exercise script (dijkstra.rb), exercise data (topology.txt, jr.txt)

Class 10 Anomaly detection and machine learning (6/20)

Anomaly detection, Machine Learning, SPAM filtering and Bayes theorem, exercise: naive Bayesian filter, lecture slides, exercise data (naivebayes.rb)

Class 11 Data Mining (6/27)

Pattern extraction, Classification, Clustering, exercise: clustering, lecture slides, exercise scripts (k-means.rb, km-data.txt, km-results.plt)

Class 12 Search and Ranking (7/4)

Search systems, PageRank, exercise: PageRank algorithm, lecture slides, exercise scripts (pagerank.rb, sample-links.txt, links-100k.txt (38MB)), wikimedia pageview count dataset for the final report ( (1.5GB)

Class 13 Scalable measurement and analysis (7/11)

Distributed parallel processing, Cloud computing technology, exercise: MapReduce algorithm, lecture slides, exercise script (wc-map.rb, wc-reduce.rb, wc-data.txt)

Back to my home page

$Date: 2016/07/08 06:40:33 $