I'm a third-year Ph.D. Candidate in MMLab at The Chinese University of Hong Kong, supervised by Prof. Dahua Lin. I work on scene-level visual creation and manipulation using structured representations and multimodal generative models.
In my research, I consistently leverage the inherent structure of visual data.
In video understanding, I exploit the natural temporal and semantic structure of videos to enable efficient perception.
In visual scene generation, I study how structured scene representations support reliable and controllable creation and manipulation of visual content.
Imagine360 lifts standard perspective video into 360-degree video with rich and structured motion, unlocking dynamic scene experience from full 360 degrees.
LayerPano3D generates full-view, explorable panoramic 3D scene from a single text prompt.
Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection Jing Tan, Yuhong Wang,
Gangshan Wu,
Limin Wang T-PAMI, 2023 arXiv /
code /
blog
We present Temporal Perceiver (TP), a general architecture based on Transformer decoders as a unified solution to detect arbitrary generic boundaries, including shot-level, event-level and scene-level temporal boundaries.
PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang,
Limin Wang NeurIPS, 2022 arXiv /
code /
blog
PointTAD effectively tackles multi-label TAD by introducing a set of learnable query points to represent the action keyframes.
A new Dual-level query-based TAD framework to precisely detect actions from both instance-level and boundary-level.
Experience
AWS AI Lab
Applied Scientist Intern June 2025 – Nov. 2025 Bellevue, WA, US
Conducted research in collaboration with
Prof. Zhuowen Tu and
Prof. Jiajun Wu
on reinforcement learning–guided spatial-aware image editing,
with Yantao Shen
as the internship manager.
Leveraged RLVR to enable precise geometric object transformations
in images following text instructions.
Tencent, PCG
Research Intern Dec. 2021 – Mar. 2023 Beijing, China
Worked on multi-label temporal action detection via learnable query points.
Developed a more general video action detection framework capable of
localizing specific actions among multiple simultaneous actions.
Selected Honors & Awards
CSIG Master’s Thesis Incentive Program Awardee, China Society of Image and Graphics, 2025
Outstanding Master’s Thesis Award (3/226), Nanjing University, 2024
National Scholarship (6/226), Ministry of Education of China, 2022