CSE Colloquia: Augmenting k-mer sketching for (meta)genomic sequence comparisons

Abstract:
Over the last decade, k-mer sketching (e.g. minimizers or MinHash) to create succinct summaries of long sequences has proven effective at improving the speed of sequence comparisons. However, rigorously characterizing the accuracy of these techniques has been more difficult. In this talk, I'll touch on three results that showcase some of the modern theoretical developments and practical applications of theory to building faster sequence comparison tools for metagenomics.

We begin by rigorously providing average-case guarantees for the popular seed-chain-extend heuristic for pairwise sequence alignment under a random substitution model, showing that it is accurate and runs in close to O(n log n) time for similar sequences. Then, we will turn our focus to metagenomics: our new tool skani computes average nucleotide identity (ANI) using sparse approximate alignments, and is both more accurate and over 20 times faster than the current state-of-the-art FastANI for comparing incomplete, fragmented MAGs (metagenome assembled genomes). This was enabled by Belbasi, et al.'s work showing that minimizers are biased Jaccard estimators, whereas other k-mer sketching does not have that drawback. Finally, we will introduce sylph (unpublished work), which enables fast and accurate database search to find nearest neighbor genomes (in ANI space) of low-coverage sequenced samples by using a combination of k-mer sketching with a zero-inflated Poisson correction (45x faster than MetaPhlAn for screening databases).

All of the work in this talk is joint with my brilliant PhD student Jim Shaw.

Shaw J, Yu YW. Proving sequence aligners can guarantee accuracy in almost O (m log n) time through an average-case analysis of the seed-chain-extend heuristic. Genome Research (2023) 33 (7), 1175-1187
Shaw J, Yu YW. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods (2023). https://doi.org/10.1038/s41592-023-02

Bio:
William is a computational biologist and applied mathematician interested in compression, genomics, privacy, and data sketching. Prior to moving to CMU, he was an assistant professor in the math department at the University of Toronto. He did his masters at Imperial College London as a Marshall Scholar, his doctoral studies with Bonnie Berger in Mathematics at MIT as a Hertz Fellow, and was a postdoctoral fellow with Griffin Weber in Biomedical Informatics at Harvard Medical School.

William is generally interested in developing novel algorithms for bioinformatics applications and translating existing tools from the CS literature to biology. More specifically, his primary research themes are on average-case analysis of string algorithms, probabilistic sketching, compressive algorithms for data science, and private and efficient biomedical database queries, as well as practical software implementing those ideas.

 

Share this event

facebook linked in twitter email

Event Contact: Timothy Zhu

 
 

About

The School of Electrical Engineering and Computer Science was created in the spring of 2015 to allow greater access to courses offered by both departments for undergraduate and graduate students in exciting collaborative research fields.

We offer B.S. degrees in electrical engineering, computer science, computer engineering and data science and graduate degrees (master's degrees and Ph.D.'s) in electrical engineering and computer science and engineering. EECS focuses on the convergence of technologies and disciplines to meet today’s industrial demands.

School of Electrical Engineering and Computer Science

The Pennsylvania State University

207 Electrical Engineering West

University Park, PA 16802

814-863-6740

Department of Computer Science and Engineering

814-865-9505

Department of Electrical Engineering

814-865-7667