Massive datasets (M2MO)

Where and when ?

  • From February, the 14th, 2019 till March, the 28th
  • On Thursday from 2pm till 5 pm
  • Room 202, Olympe de Gouges Building.

Roadmap

  • Nearest neighbor methods and Locally Sensitive Hashing (LSH)
  • Dimension reduction, SVD, Principal Component Analysis (PCA), Sparse PCA
  • Spectral Clustering
  • Non-Négative Matrix Factorization and Topic modelling
  • Random projections and compressed sensing
  • Handling streaming data
  • Gradient descent methods

References

  • Foundations of data science : John Hopcroft and Ravi Kannan.
  • Mining of Massive Datasets : Jure Leskovec, Anand Rajaraman , and Jeff Ullman. Cambridge University Press.
  • Statistics for High-Dimensional Data: Methods, Theory and Applications. Peter Bühlmann & S van de Geer. Springer.
  • An Introduction to Statistical Learning: with Applications in R Gareth James (Author), Daniela Witten (Author), Trevor Hastie (Author), Robert Tibshirani. Springer.
  • Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. Wes McKinney. O’Reilly.
  • Advanced R Programming. Hadley Wickham. Chapman et Hall.
  • Rcpp. Dirk Eddelbuettel. Springer.
  • Automated data collection with R. S. Munzert, C. Rubba, P. Meissner and D. Nyhuis. J. Wiley.
  • Matrix method in Data Mining and Pattern recognition. L Elden. SIAM.

Software

  • Pyton 3.6.5
  • Numpy, Scipy, Pandas
  • Spacy
  • Datasketch
  • CVXPY