Partially Observable Stochastic Contextual Bandits

Park, Hongju

Partially Observable Stochastic Contextual Bandits

Park, Hongju

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Bandits constitute a classical framework for decision-making under uncertainty. Stochastic contextual bandits are a variant of bandits, which consist of multiple arms with each own stochastic context. In this setting, the goal is to learn the arms of highest reward subject to contextual information, while the unknown reward parameter of each arm needs to be learned. To maximize cumulative reward, an adaptive policy is required to manage the delicate trade-off between learning the best (i.e., exploration) and earning the most (i.e., exploitation). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partial context observations remains unexplored to date, despite being theoretically more general and practically more versatile. Thus, we consider partial observations, which are noisy linear functions of the unobserved context vectors. Another important issue for contextual bandits is to find the optimal algorithm for the exploration-exploitation trade-off based on different reward structures. We suggest two different reward setups: shared across all arms and arm-specific. We study two different policies, which are Greedy and Thompson sampling algorithms, for these two different reward setups. This study shows that Greedy algorithm has the optimal rate performance in the shared parameter setup, while Thompson sampling successfully balances exploration and exploitation in the arm-specific reward parameter setup. Specifically, We establish the following primary results for these algorithms in two reward setups: (i) Greedy algorithm has a high-probability upper bound for regret in the shared parameter setup and (ii) Thompson sampling has poly-logarithmic high-probability upper bounds for regret in both parameter setups. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling and Greedy algorithm. To establish the results, we utilize martingale techniques and concentration inequalities for dependent data and also develop novel probabilistic bounds for time-varying suboptimality gaps, among others. These techniques pave the road towards studying other decision-making problems with contextual information.

Details

Record ID

1803

Record Created

2024-12-05

Title

Partially Observable Stochastic Contextual Bandits

Author

Park, Hongju

Contributor

Shirani Faradobeh, Mohamad Kazem Advisor
Bai, Ray Committee Member
Mohammadpour Velni, Javad Committee Member
Doshi, Prashant Committee Member

College or School

Statistics

Content Type

Dissertation

Pagination

150

File Format

pdf

Language

English

Degree Type

Doctor of Philosophy (PHD)

Name of Granting Institution

University of Georgia

Year Degree Granted

2024-05

Keywords

Contextual Bandits; Greedy Algorithm; Partial Observability; Reinforcement Learning; Thompson Sampling

Record Appears in

Electronic Theses and Dissertations > Doctoral Dissertation
All Resources
Doctoral

System Control Number

9949666128802959

PDF

Statistics

Download Full History