Modern biology is a data-rich science, driven by our ability to measure the detailed molecular characteristics of cells, organs, and individuals at many different levels. Interpretation of these large-scale biological data requires the detection of statistical dependencies and patterns in order to establish useful models of complex biological systems. Techniques from machine learning are key in this endeavour. Typical examples are the visualization of single-cell RNA-seq data using dimensionality reduction methods, base calling for nanopore sequencing data using hidden Markov models and (recurrent) neural networks, and classification of high-throughput microscopy image data using convolutional neural networks. In this one-week course, the foundations of machine learning will be laid out and commonly used methods for unsupervised (clustering, dimensionality reduction, visualization) and supervised (mainly classification) learning will be explained in detail. Methods will be illustrated using recent examples from the fields of systems biology and bioinformatics. Methods discussed in the morning lectures will be put into practice during the afternoon computer lab sessions.
- Density estimation, including histograms, nearest neighbour, Parzen
- Evaluation, including ROC, cross-validation
- Parametric and non-parametric classifiers, including linear discriminant analysis, k-nearest neighbours, logistic regression, decision trees and random forests
- Feature selection, including search algorithms (forward, backward, branch & bound) and sparse classifiers (ridge, lasso, elastic net)
- Dimensionality reduction, including principal component analysis, multi-dimensional scaling, t-SNE.
- Clustering, including hierarchical clustering, k-means, Gaussian mixture models
- Hidden Markov models
- (Deep) neural networks
- Kernel-based methods, including support vector machines
After having followed this course, the student has a good understanding of a wide range of machine learning techniques and is able to recognize what method is most applicable to data analysis problems (s)he encounters in bioinformatics and systems biology applications.
The course is aimed at PhD students with a background in bioinformatics, systems biology, computer science or a related field, and life sciences. Participants from the private sector are also welcome. A working knowledge of basic statistics and linear algebra is assumed. Preparation material on statistics and linear algebra will be distributed before the course, to be studied by students missing the required background.