Micah Thornton: Examining Uses of DFT distance metrics in SARS-CoV-2 Genomes

Co-authors: Monnie McGee

https://youtu.be/bFW9xMpdSp0

The Fourier Coefficients (FC) of a genomic sequence can be calculated according to a method proposed earlier this decade by Yin et al. Here we are concerned with the efficacy of these coefficients in capturing useful information about viral sequences. The FCs are rapidly computable and comparable which allows for speedy real-time numerical analyses of sequences. In this work we investigate using the FCs as summaries of SARS-CoV-2 sequences by applying regional classification procedures, and graphical examination. Specifically we extract geographic submission location from sequences submitted to the GISAID Initiative, and attempt to use the FCs to classify these sequences in addition to displaying them visually utilizing dimensionality reduction. We show that the FCs may serve as useful numerical summaries for sequences which allow manipulation, identification, and differentiation via classical mathematical and statistical methods not readily applicable to character strings. Further we argue that subsets of the FCs may be usable for the same purposes, indicating a reduction in storage requirements. We conclude by offering extensions of the research, and potential future directions for subsequent analyses and further theoretical development of techniques specific to the FCs and suggesting different kinds of series transforms for discretely indexed signals like genomes.

Micah Thornton
Program: PhD in Biostatistics
Faculty Mentor: Monnie McGee

Leave a Reply

Your email address will not be published. Required fields are marked *