Pu Wang 王普

Pu Wang 

Pu Wang
Undergraduate student
Turing Class, College of Computer Science and Technology,
Zhejiang University
Email: puwang0508@gmail.com
GitHub / Notebook

About me

I am a third-year undergraduate student (2023.9 – Present) in Turing Class, Chu Kochen Honors College, Zhejiang University, pursuing a B.E. in Artificial Intelligence with an honors degree. Since March 2025 I have been a research intern at the State Key Lab of CAD&CG, advised by Prof. Yao-Xiang Ding.

My research interests lie in the theory and algorithms of sequential decision-making, with a growing focus on large language models and language agents. I am interested in the algorithmic foundations of how language models and agents reason, plan, optimize behavior, and make decisions in interactive environments. More broadly, I aim to connect ideas from reinforcement learning, imitation learning, and preference learning to modern problems in LLM post-training, RLHF, agentic AI, and alignment.

Research directions

  • Sequential decision-making. Theory and algorithms for bandits and reinforcement learning, e.g., unified frameworks for best arm identification and regret minimization in dueling bandits.

  • Imitation & preference learning. Learning from partial or heterogeneous supervision, with connections to RLHF and preference-based optimization.

  • LLMs & language agents. How language models and agents reason, plan, and make decisions in interactive environments, and how to align them.

News

  • 2026.06 – Our work on a unified framework for dueling bandits (TG-ITE) is available as a preprint on arXiv.

Selected Papers

Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits
Pu Wang, Yao-Xiang Ding
arXiv preprint, 2026
tl;dr: The first unified framework for N-armed dueling bandits that jointly handles best arm identification, weak regret, and strong regret under only the Condorcet-winner assumption, via a shared tree-based identification primitive paired with objective-specific exploitation.

For the full list, see Publications.