Design and Analysis of DNA Sequencing Methods

Nashtaali, Damoun | 2017

1583 Viewed
  1. Type of Document: Ph.D. Dissertation
  2. Language: Farsi
  3. Document No: 50137 (05)
  4. University: Sharif University of Technology
  5. Department: Electrical Engineering
  6. Advisor(s): Hossein Khalaj, Babak; Abolfazl, Motahhari
  7. Abstract:
  8. A DNA sequence is the information source of living kinds. Information of this sequence is at its constructing bases which has four different kinds. Sequencing DNA is necessary to resolve this information. At 1977, Sanger reported the first sequence of a DNA string. Recently, a human DNA string can be sequenced with 1000 in ~2 hours. Knowing DNA sequence helps to find function of each organism, predict and cure diseases (especially in cancer). Next Generation Sequencing (NGS) methods are based on shot-gun sequencing which fragmentize DNA strings and sequence each fragment. After sequencing, processing information of DNA is performed by the processing machine in two different types: alignment or assembly (based on availability of the reference genome). In this thesis, we propose two improvements for sequencing and processing machines. The improvement in the processing machine is an algorithm for aligning output fragments of the sequencing machine. Classical alignment methods map all fragments independently and sequentially without considering the DNA sequence structure. However, the proposed algorithm maps fragments based on the reference genome structure. For this purpose, a model for the DNA genome is developed. The results of simulation and real-set data tests show that this method achieve better results than other existed alignment method. The strength point of the proposed method is its speed and accuracy especially at the first stage of algorithm. In addition, this is the first method which estimates its parameters for running from the reference genome and input fragments. The improvement on the sequencing machine is a novel method for sequencing and processing synchronously based on decreasing number of read bases for re-sequencing of the target genome. Lander-Waterman’s coverage bound establishes the total number of reads required to cover the whole genome of size G bases, as a result, the total number of bases to be sequenced should be O(G log G). Although the result leads to a tight bound, it is based on a tacit assumption that the set of reads are first collected through a sequencing process and then are processed through a computation process. In this thesis, we present a significant improvement compared to Lander-Waterman’s result and prove that by combining the sequencing and computing processes, one can re-sequence the whole genome with as low as O(G) sequenced bases in total. Our approach also dramatically reduces the required computational power for the combined process. Simulation results are performed on real genomes with different sequencing error rates. The results support our theory predicting the log G improvement on coverage bound and corresponding reduction in the total number of bases required to be sequenced
  9. Keywords:
  10. DNA Sequencing ; Next Generation of Sequencing (NGS) ; DNA Sequency Alignment ; DNA Sequency Assembly ; DNA Sequency Statistical Model ; Coverage Bound

 Digital Object List


...see more