Algorithms for Bioinformatics
CMPT 353, Spring 2018
Google classroom: 4gq9i3
This course presents algorithms used to analyze biological data. Its main focus is on string algorithms which are used to extract new knowledge from biological sequences - the
abstractions of ordered information encoded in macromolecules (nucleic acids and proteins).
The algorithmic techniques include discrete algorithms on strings,
Burrows-Whealer transform - BWT, FM-indexes,
dynamic programming, hidden Markov models, and
algorithms for trees.
Table of contents |
---|
1. Course Info
1.1. Contacts |
|
---|---|
Instructor: | Marina Barsky |
Lecture hours: | MF, 3:45 - 5:10 PM, FSH 112 |
Office hours (Tentative): | M, 6:40 - 7:40 PM, FSH 111 |
e-mail: | [email protected] |
1.2. Textbooks
(G). "Algorithms on Strings, Trees, and Sequences:
Computer Science and Computational Biology" by Dan Gusfield.
(D). "Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids" by Richard Durbin
1.3. Deliverables |
|||
---|---|---|---|
Weekly quizzes | 10% | (in-class) | |
3 assignments: | 30% | ||
2 midterm tests*: | 30% | ||
Final project*: | 30% | Due: last week of classes |
*Score of at least 50% is required in order to pass the course.
2. Lecture handouts
- Object of study: bio-sequences.
Molecular basis of life. Slides 01.01.
Slides 01.02.
Algorithms for bioinformatics: intro. Slides 01.03.
Readings (optional): Molecular biology. The Beginner's guide to Molecular biology. Mechanical nature of self-reproduction. - Exact string matching. The Knuth-Morris-Pratt algorithm (KMP). Slides 2.
Visualization of the KMP algorithm from this code: kmp1.py.
Shorter version: kmp2.py.
KMP - overlap function. KMP and pattern-recognition automaton. Slides 2.1.
Code of the overlap function: ol.py. Full algorithm code: kmp3.py.
String matching automaton: Capture.
Readings: (G) 1.1 - 1.5, 2.3. - String indexing. Introduction to suffix trees. Slides 3.
Applications of suffix trees: repeats and longest common substring. Slides 4.
Readings: (G) 5.
Suffix Arrays - space-efficient alternative to the Suffix trees. Slides 5. Algorithm for Suffix Array construction by Larsson. Slides 6. Readings: Original paper.
Burrows-Wheeler Matrix. Burrows-Wheeler Transform. Compressed FM-index and optimal search. Slides 7. Readings: Blog post. - Approximate String matching. Cheapest path. Dynamic programming. Edit distance. Slides 8. Optimal alignment in linear space (by Hirschberg). Better algorithm for edit distance (by Miller & Myers). Slides 9. String similarity and alignments. Scoring DNA alignments. Local vs. global alignment. Slides 10. Alignment heuristics: suffix trees, filtration, BLAST. Aligning proteins. Slides 11. Biological applications of string searching algorithms. Slides 12. Readings: (G) 11, 12, 13, 15.
- Probabilistic algorithms. Primer: conditional probability. Slides 11.1. Primer: Markov chains. Slides 11.2. Code example: markov.py. Hidden Markov Models. Application for gene finding and for protein family alignments. Slides 11. Sample code for generating sequences: markov_sequence.py. Gene finding: combination of discrete and probabilistic methods. Slides 12. Readings: The most famous tutorial on HMM. (D) 3,4,5.
- Phylogenies. Multiple sequence alignment. Error-bounded approximation algorithms: the center star algorithm for MSA. Slides 13. Distance-based methods for constructing phylogenetic trees. Slides 14. Parsimony and character-based phylogenies. Slides 15. Readings: (G) 14, 17, (D) 7.
Readings (G) refer to the chapters in the Gusfield book, and readings (D) refer to the chapters in the Durbin book.
3. Assignments
Three assignments:
To submit: join the Google classroom (code 4gq9i3)
4. Tests
4.1. Discrete algorithms on strings
Quiz 1.DNAQuiz 2.DNA2
Quiz 3.KMP
Quiz 4.SuffixTree
Quiz 5.MaximalRepeats
Quiz 6.SuffixArray
Quiz 7.Suffix sorting
Quiz 8.FM index
Quiz 9.Edit distance
Quiz 10.Edit distance algorithms
Quiz 11.Dynamic programming
Quiz 12.Alignment heuristics
4.2. Probabilistic sequence models
Quiz 13.Viterbi algorithmQuiz 14.HMM parameter estimation