Answering Questions about Image Contents by Deep Networks

Chavoshian, Mohammad; Soleymani Baghshah, Mahdieh

Please enable javascript in your browser.

Answering Questions about Image Contents by Deep Networks

Chavoshian, Mohammad | 2018

1200 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 50632 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Soleymani Baghshah, Mahdieh
Abstract:
Due to the recent advances in the learning of multimodal data, humans tend to use computer systems in order to solve more complex problems. One of them is Visual Question Answering (VQA), where the goal is finding the answer of a question asked about the visual contents of a given image. This is an interdisciplinary problem between the areas of Computer Vision, Natural Language Processing and Reasoning. Because of recent achievements of Deep Neural Networks in these areas, recent works used them to address the VQA task. In this thesis, three different methods have been proposed which adding each of them to existing solutions to the VQA problem can improve their results. First method tries to extract as much information from the input question as possible. For instance, if the question is about the number of fruits in a picture, without seeing the input image, the system can understand that there are some fruits in the picture and the answer to the problem should be a number. Proposed technique tries to extract information from given question by find possible answer set to the question. Second proposed method suggests to use an Reinforcement Learning agent to crop unrelated parts of the image for answering the question. For example, if picture of a forrest given with a man wearing a red shirt in it and the question is about the color of his shirt, cropping trees from the image may help the VQA system to answer the question properly by not getting distracted because of green color of the trees. Lastly, using an alternative loss function for answering numeric question suggested. Commonly used loss function for this task is the Classification problem loss function which is unable to consider the ordering of the classes. As numeric answers have an ordering, using such loss function for them is not reasonable. In the third proposed solution, it is suggested to use a loss function related to Ordinal Regression to capture the mentioned ordering. In experiments, all the proposed methods are tested on VQA v1 dataset. First solution could suggest acceptable answer sets for given questions and the second solution was relatively good in finding the related part of the image to the question. Moreover, using the third proposed method tested by adding to two different structures and in both scenarios, it resulted in improvements in the accuracy of the numeric questions
Keywords:
Deep Networks ; Reinforcement Learning ; Multi-Modal Data ; Visual Question Answering

Digital Object List

محتواي کتاب
view

Bookmark

Friend's email
Your name
Your email
enter code