Teaching EA (Initiation à La Recherche)
Present and past projects with students, methodological elements.
Methodology
Important aspects of research are to be unbiased and reproducible.
A way to control the bias of a research is to be as clear as possible on the Research Questions and on the Datasets collection and filtering processes. A way to provide reproducibility is to provide Opensource code and be sure that any member of the project can rerun the full end-to-end research pipeline at any time during the research.
Tools for Reproducibility at Ecole Polytechnique
At the CMAP, they are two main shared computational resources
- The Cholesky grid, that is dedicated to intense (and possibly parallel) computations,
- The JupyterHub LabVirt, hoted by the IDCS.
On the top of it, github and gitbucket are two public (and private) ways to share code and libraries; that for I recommend
- cookie-cutter
- but I have also my own tool: create-proj, that is my own narrow implementation of python project structureds in bash shell.
Notebooks
Maintain a personal Laboratory / Experiment Notebook
During a research process, it is useful to keep track of your attempts and general thinking in a “Lab Notebook” (or an “Experience Log”, or a personal project blog). In the modern world, you can opt for a digital version of this log, which will allow you to document your work efficiently and in an organized manner. This is a personal log and not something keeping track of what the group is doing.
This Experience Log should contain the following information:
- What was tried: describe the methods, approaches, and hypotheses that were tested.
- Why you tried it: explain the reasons that led you to choose these approaches and hypotheses.
- Under what experimental conditions: specify the data used, the days the experiments were carried out, the communication channels, the filters applied, and any other relevant details.
- The main results: present the main results obtained, accompanied by tables or charts if necessary.
By regularly updating your Experience Log, you will benefit from the following advantages:
- Memory of your work: it will be easier to remember the actions taken several weeks ago, in particular when results seem contradictory. You will thus be able to identify the causes of the contradictions and avoid repeating mistakes.
- Prevention of “meta-overfitting”: by systematically documenting your trials and their results, you will protect yourself from the temptation to test only what seems to work, a practice that can lead to manipulating data and obtaining erroneous conclusions.
In summary, keeping an experience log is essential to ensure the rigor and transparency of your research work, facilitate collaboration within the team, and promote continuous learning. I strongly encourage you to adopt this practice.
Maintain an Overleaf project for the group
For students’ EA projects we will use overleaf that is a great platform to share and collectively build documents. My recommendations for EA: An overloeaf project is more than just a text file: you can create directories to store different documents.
Please create
- a <presentations> directory, where you can put a copy of your slides (in LaTeX or pdf format?), with the date in the file name
- a <reports> directory with, to start with, a single latex document with a section named after each meeting date containing a summary or the discussions during the meeting and betwen the meetings (if any)
- a latex document (perhaps in the root directory) that starts with a definition of the research project.
Defining the Research Project and Describing the Dataset
Start by a list of points that have to be documented in the Definition of the project
- the research question(s)
- why it’s complicated or not yet done (now’s the time to cite articles)
- a definition of the dataset, and how it is selected (this may have an influence)
- the features and why they correspond a priori to the research question
- what are we going to test statistically, with which models or analytics?
- what are the possible results of these tests and why will they help answer the research question?
More to say about describing the datasets.
Projects
- 2024-2025:
- Informational content of reddit’s URLs, by Younes Alexandre Bennani, Sara El Baghdadi, and Jean-Luc Tchimbakala
- Improving the Queue Reactive Model with exogenous information, by Janis AIAD and Edouard LAFERTE
- Models for Price Formation and Learning in Equilibrium under Asymmetric Information, by Fadi Jemmali and Belgacem Ben Ziada