Recovery from random samples in a big data set

Molavipour, S ; Sharif University of Technology | 2015

509 Viewed
  1. Type of Document: Article
  2. DOI: 10.1109/LCOMM.2015.2478815
  3. Publisher: Institute of Electrical and Electronics Engineers Inc , 2015
  4. Abstract:
  5. Consider a collection of files, each of which is a sequence of letters. One of these files is randomly chosen and a random subsequence of the file is revealed. This random subsequence can be the result of a random sampling of the file. The goal is to recover the identity of the file, assuming a simple greedy matching algorithm to search the file collection. We study the fundamental limits on the maximum size of the file collection for reliable recovery in terms of the length of the random subsequence. The sequence of each file is assumed to follow a hidden Markov model (HMM), which is a common model for many data structures such as voice or DNA sequences. The connection between this problem and coding over a deletion channel with greedy decoders is discussed
  6. Keywords:
  7. search ; Algorithms ; Channel coding ; DNA sequences ; Hidden Markov models ; Markov processes ; Recovery ; Common models ; Deletion channels ; Greedy match ; Matching algorithm ; Random sample ; Random sampling ; Big data
  8. Source: IEEE Communications Letters ; Volume 19, Issue 11 , September , 2015 , Pages 1929-1932 ; 10897798 (ISSN)
  9. URL: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7268856