Seminars & Colloquia

Sharma Thankachan

University of Central Florida

"Provably Efficient String Algorithms, Compressed Data Structures, and their Applications to Computational Biology"

Friday March 04, 2022 10:15 AM
Location: 3211, EB2 NCSU Centennial Campus
Zoom Meeting Info
(Visitor parking instructions)


Abstract: Algorithms on strings (or sequence or text) and related data structures are fundamental to computer science and form the backbone of many data processing tools in Computational Biology. Most problems under exact string matching can be solved optimally using classic data structures like suffix trees/arrays. However, real-world applications are more complex as they often require inexact matching, like matchings that tolerate a small number of mismatches/edits. Although heuristics are popular and frequently employed in practice, a robust algorithmic framework for handling them has been elusive. To that end, I will briefly sketch an algorithmic framework that we developed for solving various approximate matching problems with provably efficient guarantees. At its core, our framework transforms approximate matching to (an arguably simpler) exact matching. The next topic is compressed indexing. The objective is to design compact data structures for various string-matching problems. The first breakthrough in this line of research is from early 2000, known as the compressed suffix trees/arrays and the FM-index, which encapsulates the suffix array's string-matching functionality in optimal bits. They rely essentially on a technique called Burrows-Wheeler Transform (BWT) and its associated operation called LF mapping. However, compressing variants of suffix trees/arrays that handle more complex matchings like parameterized matching and order-isomorphic matchings, 2D matchings, etc., has been elusive. We solved several problems in these directions, including one open for nearly two decades. The final topic comes from pan-genomics, a trending area in bioinformatics, where there is an increasing demand for compactly modeling sequence data into labeled graphs. As a result, many of the existing algorithms and structures for string-to-string matching need to be redesigned for string-to-graph matching. In that context, I will briefly present the main results from our work on BWT-indexable graphs (Wheeler graphs), deeper understandings of De-Bruijn graphs, related hardness results, etc., along with some future research directions.
Short Bio: Sharma Thankachan is an assistant professor of Computer Science at the University of Central Florida (UCF) Orlando. Before joining UCF in 2017, he worked as a postdoc/research scientist in the School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, and the Cheriton School of Computer Science, University of Waterloo, Canada. He received his Ph.D. in Computer Science from Louisiana State University in 2014. His bachelor's degree (B. Tech) is from the National Institute of Technology in Calicut, India. His main research areas include string algorithms, theoretical aspects of data compression, data structures for compressed indexing, and parallel algorithms. A significant part of his work deals with fundamental problems stemming from applied fields like computational biology and information retrieval. He has published in prestigious venues like the Journal of ACM, SODA, RECOMB, ICALP, Supercomputing (SC), etc. His research is supported primarily by the National Science Foundation (NSF), and he is a recipient of the prestigious NSF CAREER award.

Host: Don Sheehy, CSC

Back to Seminar Listings
Back to Colloquia Home Page