Research:
Graduate studies:
Undergraduate studies:
Lab information:
Photographs
Contact:
Bioinformatics Group
School of
Computer Science
University of Waterloo
200 University Ave W
Waterloo, ON N2L 3G1
Canada
E-mail:
Dan Brown

|
Brona Brejova, Daniel G. Brown, Tomas Vinar. Optimal Spaced
Seeds for Hidden Markov Models, with Application to Homologous Coding
Regions. Web supplement
This page contains data used in papers:
-
Brona Brejova, Daniel G. Brown, Tomas Vinar. Optimal Spaced Seeds for
Hidden Markov Models, with Application to Homologous Coding
Regions. In R. Baeza-Yates, E. Chavez, M. Crochemore, ed.,
Combinatorial Pattern Matching, 14th Annual Symposium (CPM), 2676
volume of Lecture Notes in Computer Science, pp. 42-54, Morelia,
Michoacan, Mexico, June 25-27 2003. Springer.
[Extended view]
- Brona Brejova, Daniel Brown, Tomas Vinar. Vector seeds: an
extension to spaced seeds allows substantial improvements in
sensitivity and specificity. In G. Benson, R. Page, ed., Algorithms
and Bioinformatics: 3rd International Workshop (WABI), 2812 volume of
Lecture Notes in Bioinformatics, pp. 39-54, Budapest, Hungary,
September 2003. Springer. [Extended
view]
If you use this data, please cite one of the above papers.
Human-mouse dataset
- Training set of ungapped fragments, 1 means match, 0 means mismatch,
- means gap, . is used to pad to the nearest codon boundary
39kB
- Testing set of gapped fragments
38kB
- Actual gapped DNA alignments corresponding to
fragments in training set
148kB
and in testing set
142kB
.
(Note that for training gapped fragments were further split
to ungapped fragments.)
- Blastp alignments of SWISSPROT proteins used to obtain these sets
29043kB
- Genomic sequences coding these proteins, coding regions are
in uppercase, introns and intergenic regions in lowercase.
Human: 42413kB
Mouse: 4036kB
- List of alignments used in training and testing sets
5kB
- Sequences used to compare running time of different programs
(that is, genomic sequences from the above files corresponding to the
testing set of alignments)
Human:
2899kB
Mouse:
1377kB
Coding regions of these sequences (non-coding masked out by n's)
Human:
216kB
Mouse:
189kB
Human-drosophila dataset
- Training set of ungapped fragments, 1 means match, 0 means mismatch,
- means gap, . is used to pad to the nearest codon boundary
18kB
- Testing set of gapped fragments
20kB
- Actual gapped DNA alignments corresponding to
fragments in training set
62kB
and in testing set
67kB
.
(Note that for training gapped fragments were further split
to ungapped fragments.)
- Blastp alignments of SWISSPROT proteins used to obtain these sets
8056kB
- Genomic sequences coding these proteins, coding regions are
in uppercase, introns and intergenic regions in lowercase.
Human: 42413kB
(same as above)
Drosophila: 1258kB
- List of alignments used in training and testing sets
2kB
- Sequences used to compare running time of different programs
(that is, genomic sequences from the above files corresponding to the
testing set of alignments)
Human:
2715kB
Drosophila:
247kB
Coding regions of these sequences (non-coding masked out by n's)
Human:
155kB
Drosophila:
108kB
Set of human protein alignments
- Set of 566 human protein alignments used in experiments with vector
seeds for protein sequences 145kB
.
|