Session 1: Course Overview

Slides: Machine Learning for Scientific Discovery 2025 - Course Overview

About the Course

Machine Learning for Scientific Discovery is an 8-week intensive course designed to equip scientists with practical machine learning skills for research applications. The course is taught by Marc Lelarge and Tony Bonnaire at ENS, running from September to November 2025, followed by 6 weeks of supervised research projects.

Learning Objectives

Apply ML Tools in Research Contexts: Master machine learning methodologies for scientific applications
Read and Evaluate ML Literature: Develop critical analysis skills for research papers, including theoretical foundations in mathematics and statistics
Implement and Customize Solutions: Gain hands-on experience with Python/PyTorch for modifying and extending ML implementations

Course Structure

Core Curriculum (8 weeks)

Week	Topic
1-2	Statistics foundations and data representation
3	Linear models
4	Optimization techniques
5	Tree-based methods and ensemble techniques
6-8	Deep learning and modern neural network architectures

Projects (6 weeks)

Department-supervised research projects applying course concepts to domain-specific problems.

Why This Course Matters

The field of machine learning is experiencing unprecedented growth:

Scale of Research: AAAI-26 received ~29,000 submissions from 75,000+ unique authors
Scientific Impact: ML is revolutionizing discovery across disciplines:
- Protein structure prediction (AlphaFold - Nobel Prize 2024)
- Drug discovery and antibiotic development
- Climate modeling and materials science
Accessibility Gap: Bridge between powerful ML tools and domain scientists without extensive CS backgrounds

Key Concepts Covered

Fundamental Principles

Inductive Inference: Understanding how patterns learned from training data generalize to new observations
Common Task Framework: Reproducible research through standardized datasets and evaluation metrics
Data Interpretation: Critical analysis skills - “data does not speak for itself”

Technical Content

Supervised and unsupervised learning algorithms
Deep learning architectures and training techniques
Clustering methods (K-means, hierarchical clustering)
Model evaluation and validation strategies

Historical Context

The course traces AI development from philosophical foundations (Descartes, 1637) through modern breakthroughs:

Early AI: Turing Test (1950), Perceptron (1958)
Deep Learning Revolution: Backpropagation (1986) → ImageNet (2012) → Modern LLMs
Current challenges: Scaling laws, computational limits, data scarcity

Practical Information

Prerequisites

Basic programming experience (Python recommended)
Undergraduate mathematics (linear algebra, calculus, statistics)
Scientific research background in any domain

Tools and Technologies

Primary Language: Python
ML Framework: PyTorch
Additional Libraries: NumPy, scikit-learn, matplotlib

Data Representation and Unsupervised Learning

The first lecture concludes with hands-on exploration of key statistical and machine learning concepts. Simpson’s Paradox is demonstrated using the classic UC Berkeley admissions dataset, showing how aggregated statistics can be misleading - while overall admission rates appeared to favor men (44% vs 35% for women), departmental analysis reveals women disproportionately applied to more competitive programs, illustrating the critical importance of proper data stratification and causal reasoning.

Clustering algorithms are introduced through practical implementations, including K-means clustering with its cost function minimization (sum of squared distances to cluster centroids) and applications in image quantization for color reduction. Hierarchical clustering methods are covered with various linkage criteria (single, complete, average linkage), demonstrating how different distance metrics affect cluster formation. These techniques serve as concrete examples of unsupervised learning while reinforcing the theme that algorithmic tools require domain expertise for meaningful interpretation.