Loading...

Deep Learning for Multimodal Data

Rastegar, Sarah | 2015

1297 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 47596 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Soleymani, Mahdieh
  7. Abstract:
  8. Recent advances in data recording has lead to different modalities like text, image, audio and video. Images are annotated and audio accompanies video. Because of distinct modality statistical properties, shallow methods have been unsuccessful in finding a shared representation which maintains the most information about different modalities. Recently, deep networks have been used for extracting high-level representations for multimodal data. In previous methods, for each modality, one modality-specific network was learned. Thus, high-level representations for different modalities were extracted. Since these high-level representations have less difference than raw modalities, a shared representation is computed from them. The main problem in previous methods is that they didn’t consider lower-level intractions between modalities. In addition, the final representation is more affected by stronger modality and therefore, when there’s only a weak modality present, representation isn’t very informative. In this thesis, we extract high-level representation for each modality by using a modality-specific generalized denoising stacked auto-encoder. Then, we try to hold high-level representations separetely instead of merging them. Then, each level of each modality is reconstructed from previous level of the other modality using cross edges. Proposed network tries to learn these edges bottom up in a deep manner. As we will show theoretically, these cross edges preserve more inter-modality information. Furthermore, we have proposed a novel fine-tuning for unsupervised multimodal deep networks. This fine-tuning allows us to use any amount of supervision information. In experiments, the proposed method outperforms state-ofthe-art retrieval methods on PASCAL-Sentence and SUN-Attribute datasets. Proposed method has also promising results on an artificial multimodal dataset made of MNIST images. In addition, it outperforms state-of-the-art methods in multilabel application for Mediamill dataset
  9. Keywords:
  10. Multi-Modal Data ; Deep Networks ; Cross Edges ; Stacked Denoising Autoencoder ; Unsupervised Fine Tuning

 Digital Object List

 Bookmark

...see more