|
Authors: | Mark Cieliebak, Thomas Erlebach, Zsuzsanna Liptak, Jens Stoye, Emo Welzl |
Group: | Theory of Communication Networks |
Type: | Inproceedings |
Title: | Algorithmic Complexity of Protein Identification: Combinatorics of Weighted Strings |
Year: | 2001 |
Month: | September |
Pub-Key: | CELSW01 |
Book Titel: | Combinatorics Of Searching, Sorting, And Coding |
Abstract: | We investigate a problem from computational biology: Given a constant-size alphabet A with a weight function m mapping each letter to a positive real number, find an efficient data structure and query algorithm solving the following problem: For a weight M and k strings over A, return a list of strings which contain a substring with weight M. In the one-string version of the problem, we consider only one string and ask whether it has a substring with weight M. We allow preprocessing of the strings resp. the one string, and measure efficiency in two parameters: storage space required for the preprocessed data, and running time of the query algorithm for given M. We are interested in data structures and algorithms requiring subquadratic storage space and sublinear query time, where we measure the input size as the length of the input string (resp. strings). We describe several algorithms solving different variants of the problem: Lookup solves the one-string problem with O(n) space and O(n*(log log n)/log n) time; Interval solves the one-string problem with O(n) space in O(log n) time for alphabets of size 2. Both can be generalized to the multi-string problem. Finally, Cluster, which we only sketch, solves the one-string problem and can be adjusted for a space-time-tradeoff; its analysis, however, depends on combinatorial properties of weighted strings which we solve only in part. |
Location: | Ischia Island, Italy |
Resources: | [BibTeX] |