Loading...
Human Genome Sequence Analysis Using Statistical and Machine Learning Methods
Alaei, Shervin | 2012
524
Viewed
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 43351 (19)
- University: Sharif University of Technology
- Department: Computer Engineering
- Advisor(s): Manzuri Shalmani, Mohammad Taghi
- Abstract:
- During recent decades, dramatic advances in Genetics and Molecular Biology, has provided scientists with enormous amounts of molecular genomic information of different living organisms, from DNA sequences to complex 3d structures of proteins. This information is raw data which their analysis can provide better understanding of genome mechanisms, discriminating healthy and tumor cells, predicting disease type, making drugs based on genome information, and many more applications. Here, one important issue is the inevitable use of computer science and statistics to analyze these data; such that according to the vast amount of data, would provide intelligent methods, which yield most accurate results in the lowest amount of time. One of the important problems in analyzing genome data is the classification of cell types, according to their gene expression data. These data are representative of activation levels of different genes in different conditions. In recent years, various supervised methods were suggested which do classifications with good accuracy. However, since accessing to the labeled data is not always easy, due to high costs and other difficulties, there is a need for developing semi-supervised methods for classification. In this thesis, first a new clustering method is suggested, and then two supervised and semi-supervised methods for classification are provided, which have advantages in running time compared to the similar state-of-the-art methods, while preserving high accuracy. These methods are in fact extensions of the Support Vector Clustering method, a method which has attracted many attentions in the past decade, due to its special properties such as clustering in arbitrary levels, etc.; and many efforts were done for improving its running time. By applying the suggested methods on the available benchmark data, relative superiority of them in performance and accuracy, compared to the other classification methods, especially when there are few labeled data, is verified.
- Keywords:
- Machine Learning ; Identification ; Clustering ; Bioinformatics ; Gene Expression Data ; Semi-Supervised Clustering ; Genome Analysis
- محتواي پايان نامه
- view