Internet measurement and data analysis

2012 Fall Semester, Wednesday (14:4516:15)

Faculty: Kenjiro Cho (kjc at sfc.keio.ac.jp)

TA: Yohei Kuga (sora at sfc.wide.ad.jp)

SA: Yukito Ueno (eden at sfc.wide.ad.jp)

Class home page: http://web.sfc.keio.ac.jp/~kjc/classes/sfc2012fmeasurement/

Class support mail (Faculty, TA and SA): imda at sfc.wide.ad.jp
Overview
It becomes possible to access a huge amount of diverse data through
the Internet. It allows us to obtain new knowledge and create new
services, leading to an innovation called "Big Data" or "Collective
Intelligence".
In order to understand such data and use it as a tool, one needs to
have a good understanding of the technical background in
statistics, machine learning, and computer network systems.
In this class, you will learn about the overview of largescale data
analysis on the Internet, and basic skills to obtain new knowledge
from massive information for the forthcoming information society.
Syllabus
Theme, Goals, Methods
In this class, you will learn about data collection and data
analysis methods on the Internet, to obtain knowledge and
understanding of networking technologies and largescale data analysis.
Each class will provide specific topics where you will learn the
technologies and the theories behind the technologies.
In addition to the lectures, each class includes programming exercises
to obtain data analysis skills through the exercises.
Textbooks, References
The lecture slide materials will be provided online.
ruby: http://www.rubylang.org/
gnuplot: http://gnuplot.info/
[1] Mark Crovella and Balachander Krishnamurthy.
Internet measurement: infrastructure, traffic, and applications.
Wiley, 2006.
[2] PangNing Tan, Michael Steinbach and Vipin Kumar.
Introduction to Data Mining.
Addison Wesley, 2006.
[3] Raj Jain.
The art of computer systems performance analysis.
Wiley, 1991.
[4] Toby Segaran.
Programming Collective Intelligence.
O'Reilly Media. 2007.
[5] Allen B. Downey.
Think Stats: Probability and Statistics for Programmers.
O'Reilly Media. 2011.
Evaluation
2 assignments and a final report.
Prerequisites
The prerequisites for the class are basic programming skills and basic
knowledge about statistics.
In the exercises and assignments, you will need to write programs to
process large data sets, using the Ruby scripting language and the
Gnuplot plotting tool.
To understand the theoretical aspects, you will need basic knowledge
about algebra and statistics. However, the focus of the class is to
understand how mathematics is used for engineering applications.
Schedule
Class 1 Introduction (9/26)
Big Data and Collective Intelligence,
Internet measurement,
Largescale data analysis,
exercise: introduction of Ruby scripting language,
lecture slides,
(reading material)
Class 2 Data and variability (10/3)
Summary statistics,
Sampling,
How to make good graphs,
exercise: graph plotting by Gnuplot,
lecture slides,
exercise data(marathon.txt)
NO CLASS on 10/10
Class 3 Data recording and log analysis (10/17)
Network management tools,
Data format,
Log analysis methods,
exercise: log data and regular expression,
lecture slides,
exercise data(sample_access_log.bz2(14MB),
zip version(28MB),
test100lines)
Class 4 Distribution and confidence intervals (10/24)
Normal distribution,
Confidence intervals and statistical tests,
Distribution generation,
exercise: confidence intervals,
assignment 1,
lecture slides,
data for assignment 1(honolulu2010.txt)
Class 5 Diversity and complexity (10/31)
Long tail,
Web access and content distribution,
Powerlaw and complex systems,
exercise: powerlaw analysis,
lecture slides
Class 6 Correlation (11/7)
Online recommendation systems,
Distance,
Correlation coefficient,
exercise: correlation analysis,
lecture slides,
exercise data:
(correlationdata1.txt,
correlationdata2.txt)
Class 7 Multivariate analysis (11/14)
Data sensing,
Linear regression,
Principal Component Analysis,
exercise: linear regression,
lecture slides,
Class 8 Timeseries analysis (11/20) ***makeup class
Internet and time,
Network Time Protocol,
Time series analysis,
exercise: timeseries analysis,
assignment 2,
lecture slides,
exercise data:
(autocorr_5min_data.txt,
ifbps2011.txt),
data for assignment 1(ifbps2012.txt)
Class 9 Topology and graph (11/28)
Routing protocols,
Graph theory,
exercise: shortestpath algorithm,
lecture slides,
exercise data (topology.txt,
dijkstra.rb)
Class 10 Anomaly detection and machine learning (12/5)
Anomaly detection,
Machine Learning,
SPAM filtering and Bayes theorem,
exercise: naive Bayesian filter,
lecture slides,
exercise script (naivebayes.rb)
Class 11 Data Mining (12/12)
Pattern extraction,
Classification,
Clustering,
exercise: clustering,
lecture slides,
exercise script and data (kmeans.rb, kmdata.txt)
Class 12 Search and Ranking (12/19)
Search systems,
PageRank,
exercise: PageRank algorithm,
lecture slides,
exercise data (pagerank.rb, samplelinks.txt)
Class 13 Scalable measurement and analysis (12/26)
Distributed parallel processing,
Cloud computing technology,
MapReduce,
exercise: MapReduce algorithm,
lecture slides
Class 14 Privacy Issues (1/9)
Internet data analysis and privacy issues,
Summary of the class,
lecture slides
$Date: 2013/01/08 09:20:40 $