Information theory of mixed population genome-wide association studies

Tahmasebi, B ; Sharif University of Technology | 2019

311 Viewed
  1. Type of Document: Article
  2. DOI: 10.1109/ITW.2018.8613344
  3. Publisher: Institute of Electrical and Electronics Engineers Inc , 2019
  4. Abstract:
  5. Genome-Wide Association Study (GWAS) addresses the problem of associating subsequences of individuals' genomes to the observable characteristics called phenotypes. In a genome of length G, it is observed that each characteristic is only related to a specific subsequence of it with length L, called the causal subsequence. The objective is to recover the causal subsequence, using a dataset of N individuals' genomes and their observed characteristics. Recently, the problem has been investigated from an information theoretic point of view in [1]. It has been shown that there is a threshold effect for reliable learning of the causal subsequence at Gh ( N L/G ) by characterizing the capacity of it. Here h(.) denotes the binary entropy function. However, it is assumed that the dataset is collected from one population and the problem of mixed population datasets is not considered in [1], which is observed in many practical settings. In this paper, we study the mixed population version of GWAS, where we assume that the dataset is gathered from K subpopulations, rather than one. Each subpopulation has a specific causal subsequence for the observed characteristic and the subpopulation origins of individuals are latent. The objective is to recover all the causal subsequences with high accuracy. We investigate the fundamental limits of mixed population GWAS and characterize its capacity. It is observed that for a special class of two subpopulations, the capacity is one-fourth of the capacity of unmixed population case with the same parameters. Also, the capacity of this problem has connections to the capacity region of the Multiple Access Channel (MAC). © 2018 IEEE Information Theory Workshop, ITW 2018. All rights reserved
  6. Keywords:
  7. Genome-wide association studies ; Multiple access channel ; Threshold effect ; DNA sequences ; Gene encoding ; Capacity regions ; DNA Sequencing ; Entropy function ; High-accuracy ; Multiple access channels ; Special class ; Information theory
  8. Source: 2018 IEEE Information Theory Workshop, ITW 2018, 25 November 2018 through 29 November 2018 ; 2019 ; 9781538635995 (ISBN)
  9. URL: https://ieeexplore.ieee.org/document/8613344