Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Bandits constitute a classical framework for decision-making under uncertainty. Stochastic contextual bandits are a variant of bandits, which consist of multiple arms with each own stochastic context. In this setting, the goal is to learn the arms of highest reward subject to contextual information, while the unknown reward parameter of each arm needs to be learned. To maximize cumulative reward, an adaptive policy is required to manage the delicate trade-off between learning the best (i.e., exploration) and earning the most (i.e., exploitation). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partial context observations remains unexplored to date, despite being theoretically more general and practically more versatile. Thus, we consider partial observations, which are noisy linear functions of the unobserved context vectors. Another important issue for contextual bandits is to find the optimal algorithm for the exploration-exploitation trade-off based on different reward structures. We suggest two different reward setups: shared across all arms and arm-specific. We study two different policies, which are Greedy and Thompson sampling algorithms, for these two different reward setups. This study shows that Greedy algorithm has the optimal rate performance in the shared parameter setup, while Thompson sampling successfully balances exploration and exploitation in the arm-specific reward parameter setup. Specifically, We establish the following primary results for these algorithms in two reward setups: (i) Greedy algorithm has a high-probability upper bound for regret in the shared parameter setup and (ii) Thompson sampling has poly-logarithmic high-probability upper bounds for regret in both parameter setups. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling and Greedy algorithm. To establish the results, we utilize martingale techniques and concentration inequalities for dependent data and also develop novel probabilistic bounds for time-varying suboptimality gaps, among others. These techniques pave the road towards studying other decision-making problems with contextual information.

Details

PDF

Statistics

from
to
Export
Download Full History