Loading...
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 50632 (19)
- University: Sharif University of Technology
- Department: Computer Engineering
- Advisor(s): Soleymani Baghshah, Mahdieh
- Abstract:
- Due to the recent advances in the learning of multimodal data, humans tend to use computer systems in order to solve more complex problems. One of them is Visual Question Answering (VQA), where the goal is finding the answer of a question asked about the visual contents of a given image. This is an interdisciplinary problem between the areas of Computer Vision, Natural Language Processing and Reasoning. Because of recent achievements of Deep Neural Networks in these areas, recent works used them to address the VQA task. In this thesis, three different methods have been proposed which adding each of them to existing solutions to the VQA problem can improve their results. First method tries to extract as much information from the input question as possible. For instance, if the question is about the number of fruits in a picture, without seeing the input image, the system can understand that there are some fruits in the picture and the answer to the problem should be a number. Proposed technique tries to extract information from given question by find possible answer set to the question. Second proposed method suggests to use an Reinforcement Learning agent to crop unrelated parts of the image for answering the question. For example, if picture of a forrest given with a man wearing a red shirt in it and the question is about the color of his shirt, cropping trees from the image may help the VQA system to answer the question properly by not getting distracted because of green color of the trees. Lastly, using an alternative loss function for answering numeric question suggested. Commonly used loss function for this task is the Classification problem loss function which is unable to consider the ordering of the classes. As numeric answers have an ordering, using such loss function for them is not reasonable. In the third proposed solution, it is suggested to use a loss function related to Ordinal Regression to capture the mentioned ordering. In experiments, all the proposed methods are tested on VQA v1 dataset. First solution could suggest acceptable answer sets for given questions and the second solution was relatively good in finding the related part of the image to the question. Moreover, using the third proposed method tested by adding to two different structures and in both scenarios, it resulted in improvements in the accuracy of the numeric questions
- Keywords:
- Deep Networks ; Reinforcement Learning ; Multi-Modal Data ; Visual Question Answering
- محتواي کتاب
- view
- فهرست شکلها
- فهرست جدولها
- مقدمه
- روشهای پیشین
- راهکار پیشنهادی
- آزمایشها
- جمعبندی و کارهای آتی
- مراجع