Data Science References

I am involved in several Data Science programs in both initial and continuing education. They differ depending on the audiences but the backbones are the same. In order to enjoy such programs, a basic knowledge in Math, Computer Science and Programming is required. This will avoid you having problems of vocabulary or of programming so that you can focus on data science during your training program.

There are several excellent references to help you before, during and after such a program. I propose here a personal list of my favorite references. The + indicates the technical level of a book: this range from none for a simple not too technical book to +++++ for a very advanced book (research level for instance).

My favorite books

[TLDR] My list is long. If you want a short one, here are my choices:

  • If you need a recap on maths:
    • Mathematics for Machine Learning, M. P. Deisenroth, A. A. Faisal & C. S. Ong, Cambridge University Press (2020)
      • A good recap of the essential mathematical tools for machine learning.
  • If you need a recap on statistics:
    • Discovering Statistics Using R, A. Field, J. Miles & Z. Field, SAGE Publications (2012)
      • A plain introduction to statistics using R with as less math as possible while staying correct.
  • If you need an intro to Python or R:
  • If you want a first reference on Python or R for Data Science:
    • Python Data Science Handbook (2nd ed.), J. VanderPlas, O’Reilly (2022) (++)
      • An introduction to data science (from data ingestion to modeling) with Python.
    • R for Data Science (2nd ed.), H. Wickham, M. Çetinkaya-Rundel & G. Grolemund, O’Reilly (2023) (++)
      • An introduction to the R tidyverse with an eye on data science application.
  • If you want a first reference on Learning and Statistical Learning:
    • Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow (3rd ed.), A. Géron, O’Reilly (2022) (+++)
      • A practice oriented ML tutorial with Python with a great focus on Deep Learning.
    • The Elements of Statistical Learning (2nd ed.), T. Hastie, R. Tibshirani & J. Friedman, Springer (2011) (+++)
      • The most classical reference on statistical learning.

Prerequisite for a Data Science training program

In order to follow a Data Science training program, you should have the required vocabulary in math and computer science as well as some programming skills. You are note supposed to be an expert but a basic knowledge as exemplified by the chapters in the book below will prove to be very helpful.

Mathematics

  • Essential Mathematics for Political and Social Research, J. Gill, Cambridge University Press (2006)
    • A plain introduction to the essential math tools. Chapters 1, 3, 5, 7 and 8 covers most of the required math skills.
  • Mathematics for Machine Learning, M. P. Deisenroth, A. A. Faisal & C. S. Ong, Cambridge University Press (2020)
    • A slightly more advanced recap of the essential mathematical tools for machine learning.

Probability and Statistics

  • Discovering Statistics Using R, A. Field, J. Miles & Z. Field, SAGE Publications (2012)
    • A plain introduction to statistics using R with as less math as possible while staying correct.
  • All of Statistics, L. Wasserman, Springer (2004) (++)
    • A more mathematical presentation of statistics starting from probability and its vocabulary. Much more advanced than required after the 3 first chapters. (Chap. 1, 2, 3)

Numerical Analysis

  • Numerical Analysis for Statisticians (2nd ed.), K. Lange, Springer (2010) (++)
    • A survey of analysis and linear algebra used in statistics and machine learning. (Chap. 6, 7, 8, 9)
  • An Introduction to Optimization (4th ed.), E. Chong & S. Żak, Wiley (2013) (++)
    • A classical mathematical introduction to optimization. (Chap. 5, 7, 8, 20)

Database

  • Getting Started with SQL, Th. Nield, O’Reilly (2016) (+)
    • A good introduction to SQL, a language that is often assumed to be known.

Python/R

References for a typical Data Science program

For any topics in a Data Science program, there are many books and among them a lot are excellent. Here is my personal choice for such references.

Data Science

  • Think Like a Data Scientist, B. Godsey, Manning (2017) (++)
    • A very nice introduction to Data Science focused on processes rather than tools.
  • Doing Data Science, C. O’Neill & R. Schutt, O’Reilly (2013) (++)
    • A verbatim of a course given at Columbia in 2012. It describes some methods and a lot of use cases.
  • The Data Science Design Manual, S. Skiena, Springer (2016) (+++)
    • A quite comprehensive tour of Data Science with a mild technical level.
  • Build a Career in Data Science, E. Robinson & J. Nolis, Manning (2020), (+)
    • An excellent guide on a Data Scientist career
  • Analytical Skills for AI and Data Science, D. Vaughan, O’Reilly (2020), (++)
    • A very interesting book to learn how to translate business issues in data science term.
  • Succeeding with AI, V. Krunig, Manning (2020), (++)
    • A book similar to the previous one and also very interesting.
  • Why Data Science projects fail?, D. Gray & E. Shellshear, CRC Press (2024) (+)
    • A very interesting book giving advices through the lens of failures.

Data Science with Python

  • Python Data Science Handbook (2nd ed.), J. VanderPlas, O’Reilly (2022) (++)
    • An introduction to data science (from data ingestion to modeling) with Python.
  • Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow (3rd ed.), A. Géron, O’Reilly (2022) (+++)
    • A pratical ML tutorial with Python with a great focus on Deep Learning.
  • Python for Data Analysis (3rd ed.), W. McKinney, O’Reilly (2022)
    • Another introduction to data science by the author of pandas

Data Science with R

  • R for Data Science (2nd ed.), H. Wickham, M. Çetinkaya-Rundel & G. Grolemund, O’Reilly (2023) (++) - An introduction to the R tidyverse with an eye on data science application.
  • Modern Data Science with R (2nd ed.), B. Baumer, D. Kaplan & J. Horton, CRC Press (2021) (++)
    • An introduction to data science (from data ingestion to modeling) with a focus on the use of R and the tidyverse.

Visualization

  • ggplot2 (2nd ed.), H. Wickham, Springer (2016) (++)
    • A description of an implementation of the Grammar of Graphics in R
  • Data Visualization: A Pratical Introduction, K. Healy, Princeton University Press (2018) (+)
    • A very nice introduction to visualization relying on R and ggplot2
  • Visualization Analysis and Design, T. Munzner, CRC Press (2014) (++)
    • A comprehensive book on visualization: from principle to examples.
  • The Truthful Art, A. Cairo, New Riders (2016) (+)
    • A beautiful book on data visualization.

Statistical Learning

  • Linear Regression, D. Olive, Springer (2017)
    • A good introduction to linear models
  • The Elements of Statistical Learning (2nd ed.), T. Hastie, R. Tibshirani & J. Friedman, Springer (2011) (+++)
    • The most classical reference on statistical learning.
  • An Introduction to Statistical Learning with Applications in R (2nd ed.), G. James, D. Witten, T. Hastie & R. Tibshirani, Springer (2021) (++)
  • An Introduction to Statistical Learning with Applications in Python, G. James, D. Witten, T. Hastie & R. Tibshirani, Springer (2023) (++)
    • The updated companion book with labs in R or Python.

Unsupervised Learning

  • Handbook of Cluster Analysis, Ch. Hennig, M. Meila, F. Murtagh & R. Rocci, Chapman and Hall/CRC (2015) (+++)
    • A reference on clustering
  • Model-Based Clustering and Classification for Data Science: With Applications in R, Ch. Bouveyron & G. Celeux & B. Murphy & A. Raftery, Cambridge University Press (2019) (+++)
    • A comprehensive book on model based clustering

NoSql and Big Data

  • Next Generation Databases, G. Harrison, Apress (2015) (+++)
    • An excellent book on databases starting from a high level description of the most common SQL/NoSQL databases and ending with a comprehensive analysis of the database distribution issues.
  • Spark: The Definitive Guide, M. Zaharian and B. Chambers, O’Reilly (2018) (++)
    • A comprehensive survey on Spark with Scala, Python and R

Deep Learning

  • Deep Learning with Python (2nd ed.), F. Chollet, Manning (2021) (++)
  • Deep Learning with R (2nd ed.), F. Chollet with T. Kalinowksi and J.J. Allaire, Manning (2022) (++)
    • Two versions of this excellent book on deep learning with keras.
  • The Little Book of Deep Learning, F. Fleuret (2023) (++)
    • A very good overview
  • Deep Learning: Foundations and Concepts, Ch. Bishop & H. Bishop; Springer (2023) (++)
    • A comprehensive introduction to deep learning building everything from scratch
  • The Elements of Differential Programming, M. Blondel & V. Roulet, arXiv (2024) (+++)
    • A very interesting book on the optimization part of deep learning.

Advanced Topics

It’s impossible to cover everything in a introductory program. Here are my favorite references on some advanced topics.

Statistics

  • Statistical Rethinking (2nd ed.), R. McElreath, CRC Press (2020)
    • Introduction to Bayesian method (+++)
  • Introduction to High-Dimensional Statistics (2nd ed.), Ch. Giraud, CRC Press (2022)
    • Very interesting survey of high-dimensional issues (+++++)
  • Introduction to Nonparametric Estimation, A. Tsybakov, Springer (2008)
    • Very good mathematical introduction (+++)
  • Exploratory Multivariate Analysis by Example Using R (2nd ed.), F. Husson, S. Lê & J. Pagès, CRC Press (2017)
    • A very good book on Data Analysis a la Benzekri.
  • Computer Age Statistical Inference, B. Efron & T. Hastie, Cambridge University Press (2016)
    • Historical perspective on statistics with insights on moder techniques (+++++)

Machine Learning

  • Foundations of Machine Learning (2nd ed.), M. Mohri, A. Rostamizadeh & A. Talwalkar, MIT Press (2018) (++++)
    • An introduction to Machine Learning from a PAC point of view.
  • Understanding Machine Learning, S. Shalev-Shwartz & S. Ben-David, Cambridge University Press (2014) (++++)
    • A more involved description with a similar point of view.
  • Learning Theory for First Principles, F. Bach, MIT Press (2024) (++++)
    • A comprehensive book putting optimization (and probalistic analysis) at the core
  • Inference and Learning from Data, A. Sayed, Cambridge University Press (2023)
    • A very comprehensive book with a lot of proofs
  • Probalistic Machine Learning: an Introduction, K. Murphy, MIT Press (2022)
    • A book covering machine learning with the probabilistic point of view.

Feature Engineering

  • Feature Engineering for Machine Learning, A. Zheng and A. Casari, O’Reilly (2018)
    • A classic on the subject
  • The Art of Feature Engineering, P. Duboue, Cambridge University Press (2020) (+)
    • A good review of feature engineering

Clustering

  • Data Clustering: Algorithms and Applications, Ch. Aggarwal and Ch. Reddy, Chapman and Hall/CRC (2013)
    • A good classic
  • Handbook of Cluster Analysis, Ch. Hennig, M. Meila, F. Murtagh, and R. Rocci, Chapman and Hall/CRC (2015)
    • Another one
  • Model-Based Clustering and Classification for Data Science, Ch. Bouveyron, G. Celeux, B. Murphy, and A. Raftery, Cambridge University Press (2019)
    • A more recent book focusing on probabilistic approaches

Dimension Reduction

  • Elements of Dimensionality Reduction and Manifold Learning., B. Ghojogh, M. Crowley, F. Karray, and A. Ghodsi, Springer (2023)
    • One of the few books on the topic

Generative Modeling

  • Deep Generative Modeling, K. Tomczak, Springer (2021) (+++)
    • A quite comprehensive review
  • Generative Deep Learning (2nd ed.), D. Foster, O’Reilly (2023) (+++)
    • A recent book covering most technique

Recommender systems

  • Practical Recommender Systems, K. Falk, Manning (2019) (+)
    • A very nice introduction to recommender systems
  • Recommender Systems Handbook (3rd ed.), F. Ricci, L. Rokach, and B. Shapira, Springer (2022) (++)
    • A quite comprehensive textbook

Reinforcement Learning

  • Reinforcement Learing: an Introduction (2nd ed.), R. Sutton & A. Barto, MIT Press (2018) (++)
    • The most classical introduction to RL
  • Markov Decision Processes in Artificial Intelligence, O. Sigaud and O. Buffet, Wiley (2010) (+++)
    • A very good math oriented reference
  • Markov Decision Processes. Discrete Stochastic Dynamic Programming, M. Puterman, Wiley (2005) (+++)
    • The reference on MDP
  • Neuro-Dynamic Programming, D. Bertsekas and J. Tsitsiklis, Athena Scientific (1996) (++++)
    • An impressive book from the 90’s covering almost everything
  • Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions., W. Powell, Wiley (2022) (+++)
    • A book based on the control point of view
  • Control Systems and Reinforcement Learning, S. Meyn, Cambridge University Press (2022) (++++)
    • A very interesting, but challenging, book
  • Stochastic Approximation: A Dynamical Systems Viewpoint, V. Borkar, Springer (2008) (+++)
    • A reference on stochastic approximation
  • Bandit Algorithms, T. Lattimore and Cs. Szepesvári, Cambridge University Press (2020) (+++)
    • A book dedicated to the bandit model

Optimization

  • First order method in optimization, A. Beck, SIAM (2017) (+++)
    • A comprehensive book on first order optimization
  • Convex Optimization: Algorithms and Complexity, S. Bubeck, Now Publisher (2015) (+++)
    • A shorter book including stochastic optimization
  • Lectures on Convex Optimization (2nd ed.), Y. Nesterov, Springer (2018) (+++++)
    • A comprehensive reference!

Time Series and Spatial Data

  • Advances in Financial Machine Learning, M. Lopez de Prado, Wiley (2018) (+++)
    • A good survey of time series analysis with a focus on financial application
  • Forecasting: principles and practice (3rd ed.), R. Hyndman & G. Athanopoulos, OTexts (2021) (++)
    • A quite comprehensive survey of time series
  • Forecasting: Principles and Practice, the Pythonic Way, R. Hyndman & G. Athanopoulos & A. Garza & C. Challu & M. Mergenthaler Canseco & K. Olivares, OTexts (2025) (++)
    • A quite comprehensive survey of time series in Python
  • Spatio-Temporal Statistics With R, Ch. Wikle, A. Zammit-Mangion & N. Cressie, CRC Press (2019) (++)
    • A very nice book on spatio-temporal data except for the model validation part…

Natural Language Processing

  • Natural Language Processing in Action, H. Lane, C. Howard & H. Hapke, Manning (2019)
    • An excellent introduction to Natural Language Processing in Python
  • Natural Language Processing with Transformers, L. Tunstall, L. von Werra & Th. Wolf, O’Reilly (2022)
    • A book oriented toward the use of Transformers

Explainability

  • Interpretable Machine Learning (2nd ed.), Ch. Molnar, leanpub.com (2022) (+++)
    • An excellent survey on interpretability

Large Scale Data Science

  • Big Data Computing: a Guide for Business and Technology Manager, V. Kale, CRC Press (2016) (++)
    • A detailed guide of the Big Data frameworks (DB and computing engine) describing the issues and the distributed solution.
  • Big Data 2.0 Processing Systems. A Survey, S. Sakr, Springer (2016) (++)
    • A shorter survey focusing on distribted computing engine (Hadoop, Spark…)
  • Data Science with Python and Dask, J. Daniel, O’Reilly (2019) (+++)
    • A survey of the new lightweight distributed scheme Dask
  • Scaling Python with Dask, H. Karau & M. Kimmins, O’Reilly (2023) (+++)
    • More on Dask
  • Mastering Spark with R, J. Luraschi, K. Kuo & E. Ruiz, O’Reilly (2020) (+++)
    • A guide of Spark on R
  • Learning Spark, J. Damji, B. Wenig, T. Das & D. Lee, O’Reilly (2020) (+++)
    • A comprehensive guide of Spark on Scala and Python

Data Management

  • Master Data Management, D. Loshin, Morgan Kaufman (2008) (++)
    • A classic but still relevant book
  • Foundations for Architecting Data Solutions, T. Malaska & J. Seidman, O’Reilly (2018) (++)
    • An excellent reference with an emphasis on scaling
  • Data Management at Scale, P. Strengholt, O’Reilly (2020) (++++)
    • An opiniated vision on how to scale Data Management
  • Principles of Databases Management, W. Lemahieu, S. van den Broucke & B. Baesen, Cambridge University Press (2018) (++)
    • A comprehensive survey starting from classical databases and SQL to scaling and hardware issues
  • Data Governance: The Definitive Guide, E. Eryurek, U. Gilad, V. Lakshmanan, A. Kibunguchy-Grant, & J. Ashdown, O’Reilly (2021) (+)
    • A comprehensive book on data governance

ML Ops / Data Ops

  • Effective DevOps, J. Davis & K. Daniels, O’Reilly (2016)
    • A very interesting introduction to the DevOps mindset
  • The Data Ops Cookbook (2nd ed.), Ch. Bergh, G. Benghiat & E. Strod, DataKitchen (2019)
    • A comprehensive introduction to Data Ops, not tightly linked to the Data Kitchen solution
  • Data Teams, J. Anderson, Pakt (2020) (+)
    • A good introduction to devops and team related issues
  • Machine Learning Engineering in Action, B. Wilson, Manning (2022) (++)
    • An excellent introduction to ML Engineering going from Business understanding to production

Python

  • Introducing Python: Modern Computing in Simple Packages (2nd ed.), B. Lubanovic (2019) (++)
    • An introduction to Python
  • High performance Python (2nd ed.), M. Gorelick & O. Ozsvald, O’Reilly (2020) (+++)
    • A good reference on Python optimization
  • Fast Python, T. Antao, Manning (2023) (++)
    • A nice reference on Python optimization with a focus on data related tasks

R

  • Hands-On Programming with R, G. Grolemund, O’Reilly (2014) (++)
    • An introduction to R programming.
  • Advanced R (2nd ed.), H. Wickham, CRC Press (2019) (+++)
    • A quite advanced book on R with a lot of insights.
  • Extending R, J. Chambers, CRC Press (2016) (++++)
    • A book from the creator of S, the ancestor of R, interesting from both the historical point of view and the technical point of view thanks to the description of J. Chambers vision.

Information Theory

  • Information Theory, Inference and Learning Algorithms, D. MacKay, Cambridge University Press (2003) (+++)
    • A book stressing the connection between Information Theory and learning.
  • Elements of Information Theory, Th. Cover & J. Thomas, Wiley (1991) (++++)
    • The reference book on Information Theory, a topic not sot far from learning.

Ethics

  • Weapons of Math Destruction, K. O’Neil, Crown (2016)
    • An excellent book on the danger of DS
  • Ethics of Big Data, K. Davis, O’Reilly (2012)
    • A good introductory book
Professeur de Mathématiques Appliquées

Professeur de Mathématiques Appliquées, mes sujets d’intérêts en recherche et enseignement vont du traitement du signal à la science des données et l’intelligence artificielle.