Hardware and Software Acceleration of Genomic Searching Through Novel Algorithms


Research and development of a new and sensitive genomic search algorithm which uses fragment assembly and indexing techniques to operate up to an order of magnitude faster than existing algorithms. The APAC National Facility makes it possible to provide the benchmark data necessary to verify the speed and sensitivity characteristics compared to the existing algorithms.


Principal Investigator

Greg Knowles
School of Informatics & Engineering
Flinders University

Project

f80

Co-Investigator

Paul Gardner-Stephen
School of Informatics & Engineering
Flinders University

RFCD Codes

270299, 280103, 280204, 280207, 280399, 280506


Significant Achievements, Anticipated Outcomes and Future Work

We have significantly completed the first round of benchmark data collation using an implementation (seqaln) of the Smith-Waterman algorithm for a group of 83 sequences. This information is being combined with the output of our novel algorithm, Dash, to provide insight and direction in the analysis and refinement of that algorithm. Of particular value has been the removal of "artifact chasing" which has resulted from benchmarking against other approximate algorithms. Each Smith-Waterman run required of the order of 100,000 CPU seconds. This contrasts to the existing approximate algorithms which we have compared to in the past, which typically require 10-100 CPU seconds.

Future work consists in the immediate term of comparing these "true" results against the approximate results of successive revisions of our algorithms. In the longer term we indend to obtain results for a larger corpus of searches to provide more statistically significant characterisation of our algorithms.

 

Computational Techniques Used

The Smith-Waterman algorithm is a dynamic programming algorithm which is employed to find the best possible alignment of two DNA or protein sequences against each other, given certain scoring parameters. It has complexity of the order of the product of the lengths of the sequences being aligned. Hence it becomes extremely expensive computationally when attempting alignment of sequences against entire genomes or databases, possibly consisting of tens of billions of residues. In return for this excessive cost, it does guarantee finding the optimal alignment in each case. This is what provides the value for our project where we consider algorithms which are some 10,000 to 1,000,000 times faster, but in return do not guarantee finding the optimal alignments. By having the optimal alignments available it becomes possible to objectively compare the performance of two approximate algorithms.

 

Publications, Awards and External Funding

External Funding and Awards

CRC for Sensor, Signal and Information Processing project "Firmware for Genomic Searching"

Publications

P. Gardner-Stephen and G. Knowles, "DASH: A New High Speed Genomic Search and Alignment Tool", 4th International Conference on Mathematics and Computers in Biology and Chemistry (MCBC'03),1, 2003, 121-127.