Parisian Master of Research in Computer Science
Master Parisien de Recherche en Informatique (MPRI)

Algorithmics and bioinformatics (26h, 3 ECTS)

Coordination: Philippe Gambette (LIGM, Univ. Paris-Est Marne-la-Vallée).

Lecturers

The lecturers are researchers in the Algorithmics for Bioinformatics group at LIGM (Univ. Gustave Eiffel at Marne-la-Vallée).

Practical information

Language

Lectures will be given in English.

Location

The lectures will take place at ENS Paris-Saclay.

Motivations and main objectives

The objective of this course is to study the algorithmic approaches used in bioinformatics. The topics covered in 2022-2023 include scaffolding, reversal and transposition distances, scaffold filling and phylogenetics.

Short description

A total of 26 hours of lectures and exercise sessions, from September 21st to November 9th (to be confirmed):

  • Gregory Kucherov: sequence analysis and data structures for bioinformatics
  • Philippe Gambette (algorithmics for phylogenetics) :
    • phylogenetic reconstruction: distance methods, maximum parsimony, maximum likelihood ; rooting
    • tree combinatorics: compatible characters, clusters, splits
    • tree comparison: maximum agreement subtree, tree distances (Robinson-Foulds, quartet and triplet distance, NNI, TBR, SPR)
    • reticulate evolution: phylogenetic network, hybridization number, tree containment problem

Lecture notes

* Gregory Kucherov:

* Philippe Gambette:

  • Lecture 1:
    • slides: rooted and unrooted trees, clusters and splits, triplets and quartets, distances between leaves, four-point condition, ultrametric, UPGMA algorithm, BUILD algorithm, drawing algorithms: equal angle and equal daylight, tree comparison : MAST (dynamic programming for 2 binary rooted trees), MCT, tanglegrams
    • Reordering a tree according to an order on its leaves: slides, video
  • Lecture 2:
    • slides: Neighbor-Joining algorithm, perfect phylogeny (Gusfield's algorithm for binary characters), maximum parsimony (Fitch's algorithm, 2-approximation), phylogenetic tree rooting, distances between phylogenetic trees.
  • Lecture 3:
    • slides: randomly generating trees, counting tree shapes, issues with real biological data, consensus trees, maximizing triplet consistency, phylogenetic networks, hybridization networks, classes of phylogenetic networks, counting phylogenetic networks, tree containment.

Evaluation / exams

The evaluation of the students will consist in a written exam (coefficient 1), as well as a coding project (optional, coefficient 1).

Written exam

The written exam will be on November 22nd. The exercises will be given in English, and you can write your answers in English or in French. You'll be authorized to have your lecture notes with you (but no computer, cellphone, etc.)

Coding project

Students will also be in charge of programming a solution to a problem occurring in bioinformatic algorithms, in Python. This work will be optional and will only be taken into account if it increases the global mark for the course.

For year 2022, the coding project consists in programming an algorithm to solve the OTCM or the OTDE problem, namely to find the optimal order on the leaves of a phylogenetic tree given as input such that the tree is planar and the order is as close as possible to the optimal order, either counting the number of crossings (OTCM) or the number of leaves to delete to get no crossings (OTDE). More details in this paper and in this video.

The project will be coded in Python, in only one Python file, using only the following libraries: argparse, glob, itertools, numpy, os, pathlib, re, sys and time. Students wanting to use other libraries, or who have any question about the coding project, are invited to contact Philippe Gambette (philippe.gambette@univ-eiffel.fr), for example to ask whether an extra library could be added to this list. If they used already existing code, they have to respect its licence and mention it clearly in the beginning of their file. Commenting the code will be appreciated, for example with approximately one comment every 5 lines.

This Python script is supposed to take as input a text file in UTF8 encoding, called input.txt, located in the same folder, containing a tree with labeled nodes coded in the Newick format on the first line (see this this example or the files in this folder which are described in the readme file of the repository; the input order of the nodes is simply the lexicographic order according to Python) and to output a text file, called output.txt, contained in the same folder, containing on the first line a Newick encoding of the output tree (with reordered nodes) and the number of conflicts, or deleted leaves, on the next line.

This code should have added value compared with the code available in this GitHub repository, which provides examples of input files, for example:

  • new implementations of the same algorithm or other algorithms (to help compare the results for bug tracking)
  • adaptation to take into account unrooted trees, not binary trees
  • adaptation to take into account trees with leaves having the same values (corresponding to equal values in the order)
  • implementation of other algorithms (for example the FPT algorithms in this paper or the reduction to 3-Hitting Set, for OTDE, or published algorithms solving OTCM in worst-case time complexity faster than O(n²), listed in slide 32 of this slideshow)

The project will be evaluated taking into account the correctness of the code, its speed, its readability (quality of comments in particular) as well as the quantity of code written personally and the respect of intellectual property.

The deadline for this coding project task will be November 22nd29th, at 23:59 (there was a possibility to extend the deadline, only by request before the initial deadline at philippe.gambette@univ-eiffel.fr).

By default this project will be done individually, if you want to make a group project, please contact me by email to describe which variant of the algorithm you plan to code, and the composition of the group, so that I can confirm if there is enough work for a group project.

So far, the following projects have been received (please contact me if you sent me your project and your name does not appear below 2 days after you sent it, your email may be in my spam folder):

  • T. B.: 2022-11-30 “Bioinformatics Project”
  • L. H.: 2022-11-30 “Librairies pour le projet de bioinformatique”
  • E. L.: 2022-11-30 “Projet Bioinformatique - ”
  • M. Q.: 2022-11-29 “Projet Bioinformatique”
  • A. B.: 2022-11-29 “Projet Bioinformatique”
  • Y. P.: 2022-11-29 “Bio info project”
  • I. B.: 2022-11-29 “BioInfo project, ”
  • E. C.: 2022-11-29 “Projet Algorithmique & Bioinformatique”
  • H. S.: 2022-11-28 “rendu projet OTCM”
  • G. R.: 2022-11-22 “Projet de Bioinformatique”
  • M. D.: 2022-11-21 “Bioinformatics Project”

Schedule

  • Gregory Kucherov (sequence analysis and data structures for bioinformatics) :
    • 2022-09-20, 8:30-12h45
    • 2022-09-27, 8:30-12h45
    • 2022-10-04, 8:30-12h45
    • 2022-10-11, 8:30-12h45
  • Philippe Gambette (algorithmics for phylogenetics) :
    • 2022-10-18, 8:30-12h45
    • 2022-10-25, 8:30-12h45
    • 2022-11-08, 8:30-12h45
    • 2022-11-22, 9:30-12:30 (exam)

Prerequisites

  • basic algorithms and data structures; basic discrete probability; basics of computational complexity

Related courses

Bibliography

 
Universités partenaires Université Paris-Diderot
Université Paris-Saclay
ENS Cachan École polytechnique Télécom ParisTech
ENS
Établissements associés Université Pierre-et-Marie-Curie CNRS INRIA CEA