Bioinformatics
Research
Group
[Bioinformatics Group Logo]
  Home
Today's events
People
Alumni

Research:
  Current projects
Publications
Software
Events

Graduate studies:
  Prospective students
Coming to Waterloo
Courses
Upcoming conferences

Undergraduate studies:
  Prospective students
Bioinformatics club
Undergraduate research assistants

Lab information:
  Mailing lists
Lab location
New user guide
Computing info
FAQs
Booking system

Photographs

Contact:
Bioinformatics Group
School of
Computer Science
University of Waterloo
200 University Ave W
Waterloo, ON N2L 3G1
Canada

E-mail: Dan Brown

University of Waterloo

Brona Brejova, Daniel G. Brown, Tomas Vinar. Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions. Web supplement

This page contains data used in papers:

  • Brona Brejova, Daniel G. Brown, Tomas Vinar. Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions. In R. Baeza-Yates, E. Chavez, M. Crochemore, ed., Combinatorial Pattern Matching, 14th Annual Symposium (CPM), 2676 volume of Lecture Notes in Computer Science, pp. 42-54, Morelia, Michoacan, Mexico, June 25-27 2003. Springer. [Extended view]
  • Brona Brejova, Daniel Brown, Tomas Vinar. Vector seeds: an extension to spaced seeds allows substantial improvements in sensitivity and specificity. In G. Benson, R. Page, ed., Algorithms and Bioinformatics: 3rd International Workshop (WABI), 2812 volume of Lecture Notes in Bioinformatics, pp. 39-54, Budapest, Hungary, September 2003. Springer. [Extended view]

If you use this data, please cite one of the above papers.

Human-mouse dataset

  • Training set of ungapped fragments, 1 means match, 0 means mismatch, - means gap, . is used to pad to the nearest codon boundary 39kB
  • Testing set of gapped fragments 38kB
  • Actual gapped DNA alignments corresponding to fragments in training set 148kB and in testing set 142kB . (Note that for training gapped fragments were further split to ungapped fragments.)
  • Blastp alignments of SWISSPROT proteins used to obtain these sets 29043kB
  • Genomic sequences coding these proteins, coding regions are in uppercase, introns and intergenic regions in lowercase.
    Human: 42413kB
    Mouse: 4036kB
  • List of alignments used in training and testing sets 5kB
  • Sequences used to compare running time of different programs (that is, genomic sequences from the above files corresponding to the testing set of alignments)
    Human: 2899kB
    Mouse: 1377kB
    Coding regions of these sequences (non-coding masked out by n's)
    Human: 216kB
    Mouse: 189kB

Human-drosophila dataset

  • Training set of ungapped fragments, 1 means match, 0 means mismatch, - means gap, . is used to pad to the nearest codon boundary 18kB
  • Testing set of gapped fragments 20kB
  • Actual gapped DNA alignments corresponding to fragments in training set 62kB and in testing set 67kB . (Note that for training gapped fragments were further split to ungapped fragments.)
  • Blastp alignments of SWISSPROT proteins used to obtain these sets 8056kB
  • Genomic sequences coding these proteins, coding regions are in uppercase, introns and intergenic regions in lowercase.
    Human: 42413kB (same as above)
    Drosophila: 1258kB
  • List of alignments used in training and testing sets 2kB
  • Sequences used to compare running time of different programs (that is, genomic sequences from the above files corresponding to the testing set of alignments)
    Human: 2715kB
    Drosophila: 247kB
    Coding regions of these sequences (non-coding masked out by n's)
    Human: 155kB
    Drosophila: 108kB

Set of human protein alignments

  • Set of 566 human protein alignments used in experiments with vector seeds for protein sequences 145kB .


This page is maintained by Mike Gore.
Contact:
Last modified: 10/03/2003
Google
Search WWW Search this site