Loading...
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 55770 (02)
- University: Sharif University of Technology
- Department: Mathematical Sciences
- Advisor(s): Alishahi, Kasra
- Abstract:
- Multi armed bandit is a simple framework for modeling sequential decision making problems. A learner should choose between some arms at every time step and gains the reward of corresponding chosen arm. The environment is unknown to the learner, so he should make a balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future known as exploration vs exploitation dilemma. The goal is finding a policy that minimizes the regret, which is a performance measure of the learner policy. We can make assumptions on how the rewards are generated, like stationary stochastic model, but we abandon almost all of them and consider adversarial bandit model that an adversary chooses the rewards. In this thesis we prove an upper bound O ̃(d^2.5 √n) of regret for the case loss functions are convex.(rewards are concave), which d is the dimension of the action set and n is time horizon. Convex bandit is a generalization of linear and finite-armed bandits which proved upper and lower bounds of regret are tight up to a logarithm factor in terms of n and d The best known lower bound for convex bandit is Ω(√n) which holds even when the function class is restricted to linear functions.
- Keywords:
- Regret Minimization ; Multi-Armed Bandit Problem ; Exploration/Exploitation Dilemma ; Bayesian Bandit ; Adversarial Bandit ; Convex Bandit
- محتواي کتاب
- view