Data Mining:
Statistical Modeling and Learning from Data

Schedule (11-15 January)

	Monday General Concepts	Tuesday Linear Models	Wednesday Unsupervised ML Non-linear Models	Thursday SVM and VC theory	Friday Evaluation
9:30-10:30	Theory	Theory	Theory	Theory	Individual Evaluation
10:30-10:45	Break
10:45-11:45	Practice	Practice	Practice	Practice	Individual Evaluation
11:45-13:30	Lunch Break
13:30-14:30	Theory	Theory	Theory	Theory	Group Project
14:30-15:30	Theory	Practice	Theory	Theory
14:30-15:30	Theory	Practice	Theory	Theory
15:45-16:45	Practice		Practice	Practice	Group Project Presentation

Venue: ENS Lyon, site Monod, Amphi B (entrance from the 4th floor)
Time: 9:30 - 16:45
External participants who has no access to the building should contact Marton Karsai (marton.karsai@ens-lyon.fr) in advance.

Bibliography

- Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin, "Learning from Data", AMLBook 2012

- David J. Hand, Heikki Mannila, Padhraic Smyth, "Principles of Data Mining", MIT Press 2011

Final Project

One part of the final evaluation will be made through a group project with oral presentation of the results. The project will involve the submission of a solution to a Kaggle class competition. If you are not familiar to how Kaggle works, we strongly recommend you to try and make a submission to one of the competitions.

Lecture contents

General concepts of machine learning (learning problem, approximation-generalization, learning curve…)

Linear models ( linear regression, logistic regression, Lasso)

Non-linear models (SVM, naive Bayes, decision tree, neural networks)

Unsupervised ML (SVD, NMF, k-means, text analysis)

General Description

The course aims to provide basic skills for analysis and statistical modeling of data, with special attention to machine learning both supervised and unsupervised. An important objective of the course is the operational knowledge of the techniques and algorithms treated, and for this aim the lectures will focus on both theoretical and practical aspects of machine learning, and for the practical part it is required to have a good knowledge of programming, preferentially in Python language. The expected outcomes include (1) understanding the theoretical foundations of machine learning and (2) ability to use some Python libraries for machine learning in the context of simple applications.

Topics will include:

The major paradigms of learning from data, the learning problem, the feasibility of learning
The architecture of machine learning algorithms: model structure, scoring, and model selection The theory of generalization, model complexity, the approximation/generalization trade-off, bias and variance, the learning curve
Score functions and optimization techniques. Gradient descent and stochastic gradient descent.
Validation and Cross-Validation: validation set, leave-one-out cross validation, K-fold cross-validation
Linear Models: linear classification, linear regression, ordinary least squares, logistic regression, nonlinear transformations
Nonlinear models for classification: support vector machines, tree models, nearest-neighbor methods, Naive Bayes
Overfitting and Regularization: model complexity and overfitting, commonly used regularizers, Lasso.
Unsupervised learning: cluster analysis, the K-means algorithm, hierarchical clustering
Feature selection and dimensionality reduction: Singular Value Decomposition, Matrix Factorisation
Information retrieval, text representation and classification, term weighting

Overview of the theoretical aspects of machine learning will be followed by the application of algorithms in real problems such as: image classification, text mining, spam detection… The exercises will be implemented with the help of an interactive Python environment, with the use of standard tools for data analysis and visualization, such as the Scientific Python stack, ScikitLearn, Pandas and NLTK.

Material required

Each participant will be asked to bring a personal computer (laptop,...) (at least one for two participants) which will be used intensively during the practical lessons and the project development sessions. It should have Wifi and the Anaconda Python distribution installed. The Anaconda distribution can be downloaded [here].
The project defense will also require each group of participants to produce slides for their presentation (powerpoint, keynote or latex/PDF)
In case you would not be able to bring your own computer, we ask you to contact the secretary [mail] as soon as possible to secure a reservation of a laptop for you (very limited quantity, first asked first served).