Recovery from random samples in a big data set

Please enable javascript in your browser.

Molavipour, S ; Sharif University of Technology | 2015

618 Viewed

Type of Document: Article
DOI: 10.1109/LCOMM.2015.2478815
Publisher: Institute of Electrical and Electronics Engineers Inc , 2015
Abstract:
Consider a collection of files, each of which is a sequence of letters. One of these files is randomly chosen and a random subsequence of the file is revealed. This random subsequence can be the result of a random sampling of the file. The goal is to recover the identity of the file, assuming a simple greedy matching algorithm to search the file collection. We study the fundamental limits on the maximum size of the file collection for reliable recovery in terms of the length of the random subsequence. The sequence of each file is assumed to follow a hidden Markov model (HMM), which is a common model for many data structures such as voice or DNA sequences. The connection between this problem and coding over a deletion channel with greedy decoders is discussed
Keywords:
search ; Algorithms ; Channel coding ; DNA sequences ; Hidden Markov models ; Markov processes ; Recovery ; Common models ; Deletion channels ; Greedy match ; Matching algorithm ; Random sample ; Random sampling ; Big data
Source: IEEE Communications Letters ; Volume 19, Issue 11 , September , 2015 , Pages 1929-1932 ; 10897798 (ISSN)
URL: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7268856