节点文献

双臂赌博机问题与相关极限定理

Two-Armed Bandit Problems and Related Limit Theorems

【作者】 张兆昂

【导师】 陈增敬;

【作者基本信息】 山东大学 , 金融数学与金融工程, 2023, 博士

【摘要】 双臂赌博机问题(Two-armed bandit,TAB)问题最初是由Thompson[55]在1933年为临床试验的应用而描述的,是一种序列设计问题的形式,通常用一台有两个臂的赌博机来描绘.决策者可以每次从两个臂中抽出一个,并根据某些分布获得随机回报.决策者的目标是考虑到之前所有的选择和结果并通过设计一个连续的决策规则来确定一个具体的策略,以使预期总收益最大化.这个经典问题简明地模拟了强化学习中探索(多次尝试每条手臂以进一步增加认知)和利用(拉动当前最优手臂以使得奖励最大化)之间的权衡.受现实世界中所面临的那些权衡的启发,如商业和工业,临床试验和工程中的学习和优化问题,TAB问题自提出以来受到广泛关注.该问题也有非常多的实际应用,关于其在广告投放、源路由、计算机游戏、推荐服务、等待问题和资源分配等问题中的重要性和适用性,可以参考[8],[37].与其他含有未知的问题类似,研究TAB问题时有两种主要的框架.第一种是所谓的频率论框架,此时两臂的成功概率是未知的确定的常数.由于策略的表现取决于未知的参数,此时一个明显的挑战是,鉴于没有任何策略在所有参数下都能比所有其他策略有更好的表现,所以该如何比较策略的优劣.另一种是贝叶斯框架,此时代表两个手臂的成功概率的未知参数是具有某给定先验分布的随机变量.随着试验的进行,相应的后验分布会不断更新.贝叶斯定理为自适应学习提供了一个强有力的数学方法,因而是序列设计问题的一个理想工具.然而,后验概率的随机性使得在这个框架中考虑具体问题极具挑战性.在这篇论文中,我们将分别采取这两种框架并研究不同的问题,如最优和渐近最优策略,并得到了一些相关的极限定理,如大数定律和中心极限定理.同时,在经典概率论中,概率测度的可加性起着关键作用.然而,这样的可加性假设在很多领域都被放弃了,因为无数的不确定现象不能用可加概率或线性期望来很好地建模.最近,受金融市场中上对冲定价及波动不确定性等现象的启发,Peng[43]引入了次线性期望的框架,为处理这些问题提供了非常有效的方法.Peng还提出了独立和同分布的随机变量的概念,并得到了一个新的弱大数定律.后来,Chen等人[18]得到了一个新的强大数定律.更多的相关结果可参见[13],[17],[30],[31],[32],和[42].请注意,这些结果大多是基于独立的假设.然而,在许多情况下,独立性通常不成立,相依性反而是一个自然而然的特征.在经典概率论中,相依性通常由一系列相依指标来刻画,例如,强混合,ρ混合和ψ混合条件(分别见[49],[35]和[5]).相比之下,在次线性期望的框架下,此类的相依性很难定义.这就衍生出了如下问题:此时有没有一种合适的方法来刻画依赖性呢?注意到在处理金融领域的实际问题时,线性过程通常被用来描述金融风险(见Peng等人[44]),所以我们自然会想到用它来刻画相依性.此外,在经典的概率论中,用线性过程来刻画相依性也是一种十分有效的方式,参见Wu[57].在此篇论文中,我们研究了次线性期望框架下线性过程的极限行为,并推导出了它们的大数定律.本论文共有三章,结构如下论文的第一章,我们在频率论框架下研究了TAB问题.我们首先提出了一个渐近最优策略,采用此策略时的平均收益将收敛到较大期望值.接下来,我们得到了一个推广的渐近最优策略,此时的平均收益率可以收敛于双臂最大最小期望中间中的任意一点.数值模拟的结果表明我们提出的策略表现良好.然后,我们推广了 Chen等人[15]的收敛到Bandit分布的中心极限定理,并将其应用于相似性检验.我们发现它比基于正态分布的传统方法具有更高的功效,数值模拟的结果也验证了我们的结论.论文的第二章,我们研究了一类贝叶斯TAB问题,此时给定了关于哪个手臂是更好的先验概率,随着试验的进行,相应的后验概率也在不断地更新.我们首先考虑当总收益的拱形奖励函数存在时的情况,因为这种函数在计算二阶矩、方差、区间概率等方面起着关键作用.我们提出了一种短视策略,并通过动态规划原理证明此策略在拱形奖励函数的结构下是最优的.然后,我们研究了不同策略下后验概率的性质,证明这些后验概率是随着试验次数的增加几乎处处趋向于0或1的鞅,并且具有相同的分布.基于这些性质,我们给出了另外两种短视策略,并得到了相应的大数定律和中心极限定理.论文的第三章,我们研究了次线性期望框架下线性过程的极限行为,并推导出了相应的大数定律,包括弱大数定律和强大数定律.我们发现它们是经典线性情况下的定理的自然延伸,也可以从我们的结果中推导出次线性期望下独立随机变量的相应大数定律.

【Abstract】 The two-armed bandit(TAB)problem,originally described by Thompson[55]in 1933 for the application of clinical trial,is a form of the sequential design problem and usually portrayed by a slot machine with two arms,X-and Y-arm.A decision-maker can draw one of two arms at a time and receive a random return based on certain distributions.The goal of the decision-maker is to determine a precise strategy to maximize the expected total return by designing a sequential decision rule,taking into account all the previous choices and results.This classic problem concisely models the trade-offs between exploration(attempting each arm several times to further increase knowledge)and exploitation(pulling the arm believed for reward maximization)in reinforcement learning.Motivated by those trade-offs faced in the real world,such as learning and optimization problems in business and industry,clinical trials,and engineering,the TAB problem has received much attention since it was proposed.Bandit problems also have practical applications.For further importance and applicability of this problem in Ad placement,source routing,computer game-playing,recommendation services,waiting problems,and resource allocation,one can refer to[8],[37].Similar to other problems with unknowns,there are two major approaches to formulating the TABs.One is the so-called frequentist point of view in which the probabilities of success for both arms are deterministic and unknown parameters.When taking this approach,since the effectiveness of a strategy is dependent on the unknown parameters,an obvious challenge is how to compare strategies given that no strategy dominates all other strategies for all parameter values.The other approach is the Bayesian point of view in which the unknown parameters of the probability of success for both arms are random variables with given prior distributions.As the trial progresses,the distributions are continuously updated.Bayes’s theorem provides a convenient mathematical formalism that allows for adaptive learning and is an ideal tool for sequential decision problems.However,the randomness of the posterior probabilities makes it extremely challenging to consider specific problems in this framework.In this dissertation,we take these two approaches separately and study different problems,such as optimal and asymptotically optimal strategies,and obtain some limit theorems,such as laws of large numbers and central limit theorems.Meanwhile,in classical probability theory,the additivity of probability measures plays a crucial role.However,such an additivity assumption has been abandoned in many areas because myriad uncertain phenomena cannot be well modeled by using additive probabilities or linear expectations.Recently,motivated by problems of model uncertainties in statistics,measures of risk,and super-hedging pricing in finance,Peng[43]introduced the framework of sublinear expectation,which provides very effective methods to handle those problems.Peng also initiated the notion of independently and identically distributed random variables and developed a new weak law of large numbers.Later,[18]obtained a new strong law of large numbers.More related results can be found in[13],[17],[30],[31],[32],and[42].Note that most of these results are based on the assumption of independence.However,in many situations,independence usually does not hold,and dependence is an intrinsic feature.In classical probability theory,dependence is often characterized by a series of dependence coefficients,for example,strong-mixing,ρ-mixing,and φ-mixing conditions(see[49],[35],and[5],respectively).By contrast,in the framework of sublinear expectation,it is difficult to define the corresponding conditions.Here comes the question:Is there a proper way to characterize dependence?When dealing with practical problems in finance,linear processes are usually used to describe financial risks(see Peng et al.[44]),so we naturally think of using them to characterize dependence.Besides,it is also an advantageous way in the classical case(see Wu[57]).In this dissertation,we investigate the limit behavior of linear processes in the framework of sublinear expectation and derive laws of large numbers for them.This dissertation has three chapters and is organized as follows:In Chapter 1,we study the TAB problems in the frequentist framework.We first present an asymptotically optimal strategy,which leads to a convergence of the average returns to the larger expectation.Next,we provide a generalized asymptotically optimal strategy such that the average return converges to any point between the maximum and minimum expectations of the two arms.Simulation studies pose supportive evidence that the newly proposed strategies perform well.Then we generalize the central limit theorem of Chen et al.[15],which converges to the Bandit distribution,and apply it to similarity test,which we find to be more powerful than the traditional method based on the normal distribution.The simulation results validate our conclusions.In Chapter 2,we study a type of Bayesian TAB problem,where a prior probability on which arm generated more expectation of the returns is given,and as the trial processes,the corresponding posterior probability is updated continuously.We first consider the existence of an arched reward function,since such a function plays a key role when calculating the second-order moments,variance,probability of intervals,and so on.We propose a myopic strategy and show it is optimal under the structure of the arched reward function by exploring the dynamic programming principle.Then we study the properties of posterior probabilities under different strategies,which are proved to be martingales that tend to 0 or 1 almost surely as the number of trials increases and have the same distribution.Based on these properties,we introduce two other myopic strategies and obtain corresponding laws of large numbers and central limit theorems.In Chapter 3,we investigate the limit behavior of linear processes in the framework of sublinear expectation and derive several laws of large numbers for them.It turns out that our theorems are natural extensions of the ones in the classical linear case,and we can derive the corresponding laws of large numbers for independent random variables under sublinear expectation from our result.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2024年 03期
  • 【分类号】TP18
  • 【下载频次】147
节点文献中: 

本文链接的文献网络图示:

本文的引文网络