Earning While Learning: An Adversarial Multi-Armed Bandit Based Real-Time Bidding Scheme in Deregulated Electricity Market

Yufeng Wang*, Bo Zhang, Jianhua Ma, Qun Jin


研究成果: Article査読


As the specific incarnation of cyber-physical-social systems, in deregulated electricity market, the market gaming behaviors may have significantly affected the costs of electricity delivered to the market. Especially, from the supply side, the primary goal of power generating companies (PGCs) is to develop strategic biddings to maximize their profits in long-term trading, when facing intrinsic uncertainty. Typically, in such repeated and dynamic settings, one fundamental challenge is that, any PGC neither has prior knowledge about all unknown opponents' incentives, nor observes their strategies and obtained profits. Especially, the common setting is that, once the bidding auction has occurred, the PGC only observes the market clearing price (MCP) at each round, and winning or losing status. While it is typical to assume some perfect or bounded rationality model of the PGCs, their real behaviors do not follow such assumptions due to lack of complete information, computational intractability, or lack of perfect execution, etc. We formulate the problem of sequentially optimizing any PGC's bids with an adversarial multi-armed bandit (MAB) model. Specifically, at each round, a PGC chooses to play against all other opponents from an infinite set of possible strategies that are split into continuous intervals by sequentially occurred MCPs. Then at the end of each round, the PGC observes the outcome of the auction and updates its estimation on the expected bid's fitness for each interval (i.e., how much the expected profit of the interval could be achieved), and selects the bid for the next round using the proposed algorithm Exp3C (i.e., exponential-weight for exploration and exploitation with continuous value). The experimental results based on real dataset demonstrate that Exp3C performs better than other heuristic schemes including pure greedy, ϵ-greedy and MCP predication based bidding schemes. Moreover, we theoretically prove the upper bound of average Exp3C regret per round follows O(2/√T), where T is the number of total rounds. In summary, the proposed Exp3C has two distinguished advantages. First it is distributed, since its decisions uniquely depend on its past decisions and profits. Second, it is rational, since a PGC is given guarantees on its own accumulated profit regardless of other PGCs' behaviors.

ジャーナルIEEE Transactions on Network Science and Engineering
出版ステータスPublished - 2022

ASJC Scopus subject areas

  • 制御およびシステム工学
  • コンピュータ サイエンスの応用
  • コンピュータ ネットワークおよび通信


「Earning While Learning: An Adversarial Multi-Armed Bandit Based Real-Time Bidding Scheme in Deregulated Electricity Market」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。