Data Science References

I’m involved in several Data Science programs in both initial and continuing education. They differ depending on the audiences but the backbones are the same. In order to enjoy such programs, a basic knowledge in Math, Computer Science and Programming is required. This will avoid you having problems of vocabulary or of programming so that you can focus on data science during your training program.

There are several excellent references to help you before, during and after such a program. I propose here a personal list of my favorite references. The + indicates the technical level of a book: this range from none for a simple not too technical book to +++++ for a very advanced book (research level for instance).

My favorite books

[TLDR] My list is long. If you want a short one, here are my choices:

  • If you need a recap on maths:
    • Mathematics for Machine Learning, M. P. Deisenroth, A. A. Faisal & C. S. Ong, Cambridge University Press (2020)
      • A good recap of the essential mathematical tools for machine learning.
  • If you need a recap on statistics:
    • Discovering Statistics Using R, A. Field, J. Miles & Z. Field, SAGE Publications (2012)
      • A plain introduction to statistics using R with as less math as possible while staying correct.
  • If you need an intro to Python or R:
  • If you want a first reference on Python or R for Data Science:
    • Python Data Science Handbook, J. VanderPlas, O’Reilly (2016) (++)
      • An introduction to data science (from data ingestion to modeling) with Python.
    • R for Data Science, H. Wickham & G. Grolemund, O’Reilly (2017) (++)
      • An introduction to the R tidyverse with an eye on data science application.
  • If you want a first reference on Learning and Statistical Learning:
    • Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow (2nd ed.), A. Géron, O’Reilly (2019) (+++)
      • A pracical ML tutorial with Python with a great focus on Deep Learning.
    • The Elements of Statistical Learning (2nd ed.), T. Hastie, R. Tibshirani & J. Friedman, Springer (2011) (+++)
      • The most classical reference on statistical learning.

Prerequisite for a Data Science training program

In order to follow a Data Science training program, you should have the required vocabulary in math and computer science as well as some programming skills. You are note supposed to be an expert but a basic knowledge as exemplified by the chapters in the book below will prove to be very helpful.

Mathematics

  • Essential Mathematics for Political and Social Research, J. Gill, Cambridge University Press (2006)
    • A plain introduction to the essential math tools. Chapters 1, 3, 5, 7 and 8 covers most of the required math skills.
  • Mathematics for Machine Learning, M. P. Deisenroth, A. A. Faisal & C. S. Ong, Cambridge University Press (2020)
    • A slightly more advanced recap of the essential mathematical tools for machine learning.

Probability and Statistics

  • Discovering Statistics Using R, A. Field, J. Miles & Z. Field, SAGE Publications (2012)
    • A plain introduction to statistics using R with as less math as possible while staying correct.
  • All of Statistics, L. Wasserman, Springer (2004) (++)
    • A more mathematical presentation of statistics starting from probability and its vocabulary. Much more advanced than required after the 3 first chapters. (Chap. 1, 2, 3)

Numerical Analysis

  • Numerical Analysis for Statisticians (2nd ed.), K. Lange, Springer (2010) (++)
    • A survey of analysis and linear algebra used in statistics and machine learning. (Chap. 6, 7, 8, 9)
  • An Introduction to Optimization (4th ed.), E. Chong & S. Żak, Wiley (2013) (++)
    • A classical mathematical introduction to optimization. (Chap. 5, 7, 8, 20)

Database

  • Getting Started with SQL, Th. Nield, O’Reilly (2016) (+)
    • A good introduction to SQL, a language that is often assumed to be known.

Python/R

References for a typical Data Science program

For any topics in a Data Science program, there are many books and among them a lot are excellent. Here is my personal choice for such references.

Data Science

  • Think Like a Data Scientist, B. Godsey, Manning (2017) (++)
    • A very nice introduction to Data Science focused on processes rather than tools.
  • Doing Data Science, C. O’Neill & R. Schutt, O’Reilly (2013) (++)
    • A verbatim of a course given at Columbia in 2012. It describes some methods and gives a lot of usage.
  • The Data Science Design Manual, S. Skiena, Springer (2016) (+++)
    • A quite comprehensive tour of Data Science with a mild technical level.
  • Build a Career in Data Science, E. Robinson & J. Nolis, Manning (2020), (+)
    • An excellent guide on a Data Scientist career
  • Analytical Skills for AI and Data Science, D. Vaughan, O’Reilly (2020), (++)
    • A very interesting book to learn how to translate business issues in data science term.
  • Succeeding with AI, V. Krunig, Manning (2020), (++)
    • A book similar to the previous one and also very interesting.

Data Science with Python

  • Python Data Science Handbook, J. VanderPlas, O’Reilly (2016) (++)
    • An introduction to data science (from data ingestion to modeling) with Python.
  • Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow (2nd ed.), A. Géron, O’Reilly (2019) (+++)
    • A pratical ML tutorial with Python with a great focus on Deep Learning.
  • Python for Data Analysis, W. McKinney, O’Reilly (2017)
    • Another introduction to data science by the author of pandas

Data Science with R

  • R for Data Science, H. Wickham & G. Grolemund, O’Reilly (2017) (++)
    • An introduction to the R tidyverse with an eye on data science application.
  • Modern Data Science with R (2nd ed.), B. Baumer, D. Kaplan & J. Horton, CRC Press (2021) (++)
    • An introduction to data science (from data ingestion to modeling) with a focus on the use of R and the tidyverse.

Visualization

  • ggplot2 (2nd ed.), H. Wickham, Springer (2016) (++)
    • A description of an implementation of the Grammar of Graphics in R
  • Data Visualization: A Pratical Introduction, K. Healy, Princeton University Press (2018) (+)
    • A very nice introduction to visualization relying on R and ggplot2
  • Visualization Analysis and Design, T. Munzner, CRC Press (2014) (++)
    • A comprehensive book on visualization: from principle to examples.
  • The Truthful Art, A. Cairo, New Riders (2016) (+)
    • A beautiful book on data visualization.

Statistical Learning

  • Linear Regression, D. Olive, Springer (2017)
    • A good introduction to linear models
  • The Elements of Statistical Learning (2nd ed.), T. Hastie, R. Tibshirani & J. Friedman, Springer (2011) (+++)
    • The most classical reference on statistical learning.
  • An Introduction to Statistical Learning Application in R (2nd ed.), G. James, D. Witten, T. Hastie & R. Tibshirani, Springer (2021) (++)
    • The updated companion book with R labs.

Unsupervised Learning

  • Handbook of Cluster Analysis, Ch. Hennig, M. Meila, F. Murtagh & R. Rocci, Chapman and Hall/CRC (2015) (+++)
    • A reference on clustering
  • Model-Based Clustering and Classification for Data Science: With Applications in R, Ch. Bouveyron & G. Celeux & B. Murphy & A. Raftery, Cambridge University Press (2019) (+++)
    • A comprehensive book on model based clustering

NoSql and Big Data

  • Next Generation Databases, G. Harrison, Apress (2015) (+++)
    • An excellent book on databases starting from a high level description of the most common SQL/NoSQL databases and ending with a comprehensive analysis of the database distribution issues.
  • Spark: The Definitive Guide, M. Zaharian and B. Chambers, O’Reilly (2018) (++)
    • A comprehensive survey on Spark with Scala, Python and R

Deep Learning

  • Deep Learning with Python, F. Chollet, Manning (2017) (++)
  • Deep Learning with R, F. Chollet with J.J. Allaire, Manning (2018) (++)
    • Two versions of this excellent book on deep learning with keras.

Advanced Topics

It’s impossible to cover everything in a introductory program. Here are my favorite references on some advanced topics.

Statistics

  • Statistical Rethinking (2nd ed.), R. McElreath, CRC Press (2020)
    • Introduction to Bayesian method (+++)
  • Introduction to High-Dimensional Statistics, Ch. Giraud, CRC Press (2014)
    • Very interesting survey of high-dimensional issues (+++++)
  • Introduction to Nonparametric Estimation, A. Tsybakov, Springer (2008)
    • Very good mathematical introduction (+++)
  • Exploratory Multivariate Analysis by Example Using R (2nd ed.), F. Husson, S. Lê & J. Pagès, CRC Press (2017)
    • A very good book on Data Analysis a la Benzekri.
  • Computer Age Statistical Inference, B. Efron & T. Hastie, Cambridge University Press (2016)
    • Historical perspective on statistics with insights on moder techniques (+++++)

Machine Learning

  • Foundations of Machine Learning, M. Mohri, A. Rostamizadeh & A. Talwalkar, MIT Press (2012) (++++)
    • An introduction to Machine Learning from a PAC point of view.
  • Understanding Machine Learning, S. Shalev-Shwartz & S. Ben-David, Cambridge University Press (2014) (++++)
    • A more involved description with a similar point of view.

Recommender systems

  • Practical Recommender Systems, K. Falk, Manning (2019) (+)
    • A very nice introduction to recommender systems
  • Recommender Systems, The Textbook, Ch. Aggarwal, Springer (2016) (++)
    • A still very interesting textbook

Reinforcement Learning

  • Reinforcement Learing: an Introduction (2nd ed.), R. Sutton & A. Barto, MIT Press (2018) (++)
    • The most classical introduction to RL

Feature Engineering

  • The Art of Feature Engineering, P. Duboue, Cambridge University Press (2020) (+)
    • A good review of feature engineering

Time Series and Spatial Data

  • Advances in Financial Machine Learning, M. Lopez de Prado, Wiley (2018) (+++)
    • A good survey of time series analysis with a focus on financial application
  • Forecasting: principles and practice (3rd ed.), R. Hyndman & G. Athanopoulos, OTexts (2021) (++)
    • A quite comprehensive survey of time series
  • Spatio-Temporal Statistics With R, Ch. Wikle, A. Zammit-Mangion & N. Cressie, CRC Press (2019) (++)
    • A very nice book on spatio-temporal data except for the model validation part…

Natural Language Processing

  • Natural Language Processing in Action, H. Lane, C. Howard & H. Hapke, Manning (2019)
    • An excellent introduction to Natural Language Processing in Python

Generative Approach

  • Generative Deep Learning, D. Foster, O’Reilly (2019) (+++)

Explainability

  • Interpretable Machine Learning, Ch. Molnar, leanpub.com (2020) (+++)
    • An excellent survey on interpretability

Large Scale Data Science

  • Big Data Computing: a Guide for Business and Technology Manager, V. Kale, CRC Press (2016) (++)
    • A detailed guide of the Big Data frameworks (DB and computing engine) describing the issues and the distributed solution.
  • Big Data 2.0 Processing Systems. A Survey, S. Sakr, Springer (2016) (++)
    • A shorter survey focusing on distribted computing engine (Hadoop, Spark…)
  • Data Science with Python and Dask, J. Daniel, O’Reilly (2019) (+++)
    • A survey of the new lightweight distributed scheme Dask
  • Mastering Spark with R, J. Luraschi, K. Kuo & E. Ruiz, O’Reilly (2020) (+++)
    • A guide of Spark on R
  • Learning Spark, J. Damji, B. Wenig, T. Das & D. Lee, O’Reilly (2020) (+++)
    • A comprehensive guide of Spark on Scala and Python

Data Management

  • Master Data Management, D. Loshin, Morgan Kaufman (2008) (++)
    • A classic but still relevant book
  • Foundations for Architecting Data Solutions, T. Malaska & J. Seidman, O’Reilly (2018) (++)
    • An excellent reference with an emphasis on scaling
  • Data Management at Scale, P. Strengholt, O’Reilly (2020) (++++)
    • An opiniated vision on how to scale Data Management
  • Principles of Databases Management, W. Lemahieu, S. van den Broucke & B. Baesen, Cambridge University Press (2018) (++)
    • A comprehensive survey starting from classical databases and SQL to scaling and hardware issues

ML Ops / Data Ops

  • Effective DevOps, J. Davis and K. Daniels, O’Reilly (2016)
    • A very interesting introduction to the DevOps mindset
  • The Data Ops Cookbook (2nd ed.), Ch. Bergh, G. Benghiat and E. Strod, DataKitchen (2019)
    • A comprehensive introduction to Data Ops, not tightly linked to the Data Kitchen solution
  • Data Teams, J. Anderson, Pakt (2020) (+)
    • A good introduction to devops and team related issues
  • Machine Learning Engineering in Action, B. Wilson, Manning (2022) (++)
    • An excellent introduction to ML Engineering going from Business understanding to production

Python

  • Introducing Python: Modern Computing in Simple Packages (2nd ed.), B. Lubanovic (2019) (++)
    • An introduction to Python
  • High performance Python (2nd ed.), M. Gorelick & O. Ozsvald, O’Reilly (2020) (+++)
    • A good reference on Python optimization

R

  • Hands-On Programming with R, G. Grolemund, O’Reilly (2014) (++)
    • An introduction to R programming.
  • Advanced R (2nd ed.), H. Wickham, CRC Press (2019) (+++)
    • A quite advanced book on R with a lot of insights.
  • Extending R, J. Chambers, CRC Press (2016) (++++)
    • A book from the creator of S, the ancestor of R, interesting from both the historical point of view and the technical point of view thanks to the description of J. Chambers vision.

Information Theory

  • Information Theory, Inference and Learning Algorithms, D. MacKay, Cambridge University Press (2003) (+++)
    • A book stressing the connection between Information Theory and learning.
  • Elements of Information Theory, Th. Cover & J. Thomas, Wiley (1991) (++++)
    • The reference book on Information Theory, a topic not sot far from learning.

Ethics

  • Weapons of Math Destruction, K. O’Neil, Crown (2016)
    • An excellent book on the danger of DS
  • Ethics of Big Data, K. Davis, O’Reilly (2012)
    • A good introductory book

Optimization

  • First order method in optimization, A. Beck, SIAM (2017)
    • A comprehensive book on first order optimization
Professeur de Mathématiques Appliquées

Professeur de Mathématiques Appliquées, mes sujets d’intérêts en recherche et enseignement vont du traitement du signal à la science des données.