Loading...

Redesign of the Parallelized Kraken Algorithm with the Aim of Achieving Memory Efficiency for Data Classification

Kiyani Joulandan, Tala | 2024

0 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 57445 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Koohi, Somayyeh
  7. Abstract:
  8. With the remarkable advancements in genome sequencing technology, we are witnessing a significant increase in the volume and diversity of metagenomic data. This growth has introduced new challenges in the analysis of metagenomic data, among which precise classification of these data into taxonomic groups is one of the most important. Assigning a label to a new metagenomic data at higher taxonomic levels has become a concern in metagenomic data classification. The main challenges in this area include classification accuracy, processing speed, and memory and resource consumption. These challenges have caused existing methods to fall short in fully meeting the increasing demands of this field. One of the metagenomic sequence classification methods based on aligning short substrings is the Kraken2 tool. Despite its good accuracy and popularity, Kraken2 faces the problem of high resource consumption. In this study, we aim to preserve and improve the accuracy of the best existing method, Kraken2, while addressing the issues of resource and memory consumption. We focus on developing and improving a method based on the primary structure of Kraken2, with the main objective of reducing memory requirements while maintaining accuracy in metagenomic classification. By using advanced computational techniques and specific optimizations, our method seeks to overcome the limitations of current methods and provide an efficient and effective solution for metagenomic data analysis. This new approach, by utilizing the fundamental structures of Kraken2 and introducing a novel encoding for storing nucleotide data in the database, significantly reduces memory consumption while preserving classification accuracy and speed. Similar to Kraken2, this method uses minimizer data structures but redefines sequences to reduce their size. In this study, we also modified parallelizable structures in the proposed method, making it executable on graphical processing units (GPUs). These changes exploit the parallel processing capabilities of GPUs, resulting in improved algorithm execution time by reducing identifier matching times. This time reduction indicates an approximate improvement in the algorithm's execution speed. Specifically, when comparing the optimized version of Kraken2 with our optimized tool, we observed a 61% improvement in memory usage, a 13% improvement in sensitivity, a 1% improvement in accuracy, a 17% improvement in the F1-score, and a 10% reduction in database construction time
  9. Keywords:
  10. Genome Sequencing ; Graphics Procssing Unit (GPU) ; Memory Optimization ; Microbiome ; Metagenomics Data ; Metagenomic Data Classification ; Kraken2 Approach ; K-mer Database

 Digital Object List

 Bookmark

No TOC