Pu Wang 王普

Pu Wang
Undergraduate student
Turing Class, College of Computer Science and Technology,
Zhejiang University
Email: puwang0508@gmail.com
GitHub / Notebook

About me

I am a third-year undergraduate student (Sept. 2023 — Present) in Turing Class, Chu Kochen Honors College, Zhejiang University, pursuing a B.E. in Artificial Intelligence with an honors degree. Since March 2025 I have been a research intern at the State Key Lab of CAD&CG, advised by Prof. Yao-Xiang Ding.

My research interests lie in the theory and algorithms of sequential decision making, with a growing focus on large language models and language agents. I am interested in the algorithmic foundations of how language models and agents reason, plan, learn from other agents, and adapt their behavior in interactive environments. More broadly, I aim to connect ideas from reinforcement learning, imitation learning, and machine teaching to modern problems in reasoning-oriented post-training and self-improving language agents.

Research directions

Sequential decision making. Theory and algorithms for bandits and reinforcement learning, e.g., unified frameworks for best arm identification and regret minimization in dueling bandits.
Imitation learning & teaching. How agents can teach and learn behaviorally sufficient representations, especially when one agent learns from multiple heterogeneous or imperfect teachers.
LLMs & language agents. How language models and agents reason, plan, learn, and adapt in interactive environments, with an emphasis on verifiable feedback and self-improvement.

News

June 2026 — Our work on a unified framework for dueling bandits (TG-ITE) is available as a preprint on arXiv.

Selected Papers

Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits
Pu Wang, Yao-Xiang Ding
arXiv preprint, 2026
tl;dr: The first unified framework for N-armed dueling bandits that jointly handles best arm identification, weak regret, and strong regret under only the Condorcet-winner assumption, via a shared tree-based identification primitive paired with objective-specific exploitation.

For the full list, see Publications.