Weili Xu

Seeking research opportunities in MLsys
Check out my CV here

I am a research intern at Together AI, as well as a rising senior undergraduate in Computer Engineering. I’m currently pursuing a dual degree from University of Illinois Urbana-Champaign and Zhejiang University.

I’m interested in various aspects of machine learning and computer systems:

Efficient sequence modeling algorithms with hardware-aware implementation
Heterogeneous runtime optimization for agentic workloads
Long-context modeling for multi-modal (text, video, audio, etc.) applications

My journey into MLSys research began with AuroraLong, a hybrid multimodal LLM I built that unlocked hour long video understanding on consumer GPUs, which lead to a first-author paper accepted at ICCV 2025. This steered my focus toward system-driven modeling, where we co-design architecture and infrastructure to bridge the gap between fantastic algorithms and the rapid iteration that scales them.

news

Jun 29, 2026	ThunderAgent is accepted as a Spotlight paper at ICML 2026 and is integrated into NVIDIA Dynamo and SkyRL.
May 18, 2026	Started internship at Together AI, see you in SF!
Oct 20, 2025	Video-MMLU is granted Outstanding Paper Award by ICCV 2025 Workshop on Knowledge-Intensive Multimodal Reasoning!

latest posts

Feb 14, 2026	Understanding Activation Memory Dynamics in Pipeline Parallelism Variants
Feb 07, 2026	How Thread Block Swizzling boosts L2 Cache Hit Rate in Matrix Multiplication
Jan 30, 2026	Implementing Flash Attention: Backward Pass in Triton

selected publications

ThunderAgent

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

Hao Kang^*, Ziyang Li^* , Weili Xu^*, Xinyu Yang^*, Yinfang Chen , Junxiong Wang , Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora

In Proceedings of the International Conference on Machine Learning (ICML), 2026

ICML 2026 Spotlight. Integrated into NVIDIA Dynamo and SkyRL.

Abs arXiv Bib PDF Forum Code Website

ThunderAgent is a simple, fast, and program-aware system for agentic inference. It introduces a program-aware scheduler to improve KV-cache reuse and balance memory usage across nodes, together with tool-call lifecycle management for long-running agentic rollouts. ThunderAgent has been integrated into NVIDIA Dynamo and SkyRL.

@inproceedings{kang2026thunderagent,
  title = {ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System},
  author = {Kang, Hao and Li, Ziyang and Xu, Weili and Yang, Xinyu and Chen, Yinfang and Wang, Junxiong and Chen, Beidi and Krishna, Tushar and Xu, Chenfeng and Arora, Simran},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year = {2026},
  url = {https://openreview.net/forum?id=kR4iOTaAOJ&referrer=%5Bthe%20profile%20of%20Beidi%20Chen%5D(%2Fprofile%3Fid%3D%7EBeidi_Chen1)},
  forum = {https://openreview.net/forum?id=kR4iOTaAOJ&referrer=%5Bthe%20profile%20of%20Beidi%20Chen%5D(%2Fprofile%3Fid%3D%7EBeidi_Chen1)},
  note = {ICML 2026 <strong>Spotlight</strong>. Integrated into <a href="https://docs.nvidia.com/dynamo/latest/user-guides/agents/thunder-agent-program-scheduler">NVIDIA Dynamo</a> and <a href="https://github.com/NovaSky-AI/SkyRL/tree/main/examples/train/thunder_agent">SkyRL</a>.},
}

AuroraLong
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, and Gaoang Wang

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2025

Abs arXiv Bib PDF Supp Code

The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.
@inproceedings{xu2025auroralong, title = {AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding}, author = {Xu, Weili and Song, Enxin and Chai, Wenhao and Wen, Xuexiang and Ye, Tian and Wang, Gaoang}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = oct, year = {2025}, pages = {23453-23465}, }
Video-MMLU
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

Enxin Song, Wenhao Chai , Weili Xu, Jianwen Xie, Yuxuan Liu, and Gaoang Wang

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2025

Abs arXiv Bib PDF Supp Code Website

Recent advancements in language multimodal models (LMMs) for video have demonstrated their potential for understanding video content, yet the task of comprehending multi-discipline lectures remains largely unexplored. We introduce Video-MMLU, a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures. We evaluate over 90 open-source and proprietary models, ranging from 0.5B to 40B parameters. Our results highlight the limitations of current models in addressing the cognitive challenges presented by these lectures, especially in tasks requiring both perception and reasoning. Additionally, we explore how the number of visual tokens and the large language models influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.
@inproceedings{song2025videommlu, author = {Song, Enxin and Chai, Wenhao and Xu, Weili and Xie, Jianwen and Liu, Yuxuan and Wang, Gaoang}, title = {Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = oct, year = {2025}, pages = {6099-6113}, }