Daoyuan Chen's Homepage

About

Daoyuan Chen (陈道源)

Hi, I'm doing research and development at Data Analytics and Intelligence Lab, Alibaba Tongyi. I earned my Master's degree in Computer Application Technology in June 2019 from Peking University, co-supervised by Ying Shen and Kai Lei (academic mentors), and Yaliang Li (industry mentor).

I've published over 30 technical papers, more than 10 of which I've led as the first author and were presented at top-tier conferences such as ICML, NeurIPS, ICLR, SIGMOD, TPAMI, KDD, and SIGIR.

I’ve learned a lot from the open-source community and am glad to have the opportunity to deeply engage with several interesting open-source projects.

As a maintainer and technical leader, I've contributed to:

• Data-Juicer:
- Data processing for and with foundation models.
• Trinity-RFT:
- Reinforcement fine-tuning towards self-evolving LLMs.

As a contributor from the official team, I've also worked on:

• FederatedScope:
- An easy-to-use FL platform.
• AgentScope:
- Agent-oriented programming for building LLM applications.

My interests broadly lie in insight- and theory-informed research, simple yet effective systems, and real-world applications related to:

• Large Language Models (LLMs)
• Multimodal LLMs
• Efficient Machine Learning (ML)
• Data- and Knowledge-Driven ML
• Human-centric ML
• Federated Learning (FL)

More specifically, including but not limited to:

• Data-model co-development: building dedicated infrastructures, and exploring generalized feedback signals between them
• Enhancing, scaling, and improving the understanding of data quality, diversity, and usability
• Synthetic data for model training and evaluation
• Building data agents powered by natural language for broad applications
• Better human-AI interaction: empathetic dialog, multimodal AIGC, personalized modeling, co-design of RLHF & RLAIF.
• On-device solutions via utilizing efficient models, and addressing privacy issues with FL

Collaborations are welcome; we're currently hiring full-time researchers/developers and self-motivated interns! Feel free to reach out if you are interested.

Selected

Technical Works

Full paper lists: Google Scholar and DBLP

You can click the buttons below to switch between different paper-grouping methods and different subgroups.

The number in parentheses ('n') indicates the size of each subgroup; '#' indicates co-first author; '^' indicates industrial mentor to the first student author.

(ICML’25 spotlight, top 2.6%) Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development
- Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou
- A new middleware links data and model feedback, enabling high performance and low cost verified in broad tasks. Top-1 in VBench leaderboard.
(TPAMI’25, survey) The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
- Zhen Qin, Daoyuan Chen^#, Wenhao Zhang, Liuyi Yao, Yilun Huang, Bolin Ding, Yaliang Li, Shuiguang Deng
- Highlighting the current state and potential of co-development between data and multi-modal LLMs, in a dual perspective.
(arXiv’25) Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models
- Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen#, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Yaliang Li, Bolin Ding, Jingren Zhou
- A new open-source project, exploraing how to process feedback data, and evolve large models in dynamic enviroments.
(arXiv’25) DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?
- Qirui Jiao, Daoyuan Chen^#, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li
- A synthetic benchmark revealing notable performance drops despite large models' proficiency with short descriptions.
(arXiv’25) MindGym: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?
- Zhe Xu, Daoyuan Chen^#, Zhenqing Ling, Yaliang Li, Ying Shen
- A new data synthesis method, which enables large models to self-synthesize high-quality, low-variance data for efficient fine-tuning (16% gain on MathVision using only 400 samples).
(arXiv’25) Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
- Zhenqing Ling, Daoyuan Chen^#, Liuyi Yao, Qianli Shen, Yaliang Li, Ying Shen
- A theoretically informed method, which treats diversity as a reward, achieves new SOTA average performance across 7 benchmarks on SOTA LLMs with domain-undetermined data.
(CVPR’25) Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
- Qirui Jiao, Daoyuan Chen^, Yilun Huang, Bolin Ding, Yaliang Li, Ying Shen
- A new method of contrastive data synthesis, creating a high-quality dataset that describes object differences focusing on fine-grained regions in images.
(arXiv’24) Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models
- Daoyuan Chen, Yilun Huang, Xuchen Pan, Nana Jiang, Haibin Wang, Yilei Zhang, Ce Ge, Yushuo Chen, Wenhao Zhang, Zhijian Ma, Jun Huang, Wei Lin, Yaliang Li, Bolin Ding, Jingren Zhou
- Offering efficient multimodal data processing abilities with 100+ operators, and seamless scalability for foundation models.
(SIGMOD’24) Data-Juicer: A One-Stop Data Processing System for Large Language Models
- Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, Jingren Zhou
- Providing open, versatile data processing abilities to ease the creation and evaluation of diverse data recipes for LLMs.
(KDD’24, tutorial) Multi-modal Data Processing for Foundation Models: Practical Guidances and Use Cases
- Daoyuan Chen, Yaliang Li, Bolin Ding, The Data-Juicer Team
- Discussing practical skills in multi-modal data processing, to efficiently handle data variety, quality, and scale for foundation models.
(arXiv’24, benchmark) HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data
- Ting Zhou, Daoyuan Chen^, Qirui Jiao, Bolin Ding, Yaliang Li, Ying Shen
- A new benchmark to evaluate MLLMs on inner emotion recognition and outer behavioral manifestations, advancing human-like understanding in video perception.
(COLING’24) ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization
- Mengsha Liu, Daoyuan Chen^, Yaliang Li, Guian Fang and Ying Shen.
- Optimizing chart summarization by combining contextual reasoning and deep analysis, achieving superior performance across multiple metrics.
(arXiv’24) From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information
- Qirui Jiao, Daoyuan Chen^, Yilun Huang, Yaliang Li, Ying Shen.
- Showing how adaptively fine-tuning MLLMs with textual detection information can boost performance via extensive experiments.
(arXiv’24) A Bivariate Data Mixing Law for Language Model Pretraining
- Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding.
- A new bivariate data mixing law that models the joint scaling patterns of domain proportions and data quantity in LLM pretraining.
(arXiv’24) ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction
- Zhongjie Duan, Qianyi Zhao, Cen Chen, Daoyuan Chen, Wenmeng Zhou, Yaliang Li, Yingda Chen
- Enhancing text-to-image generation by iteratively integrating synthesis and understanding models, improving aesthetics and efficiency without extra inference costs.
(arXiv’24) AgentScope: A Flexible yet Robust Multi-Agent Platform
- Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, Liuyi Yao, Hongyi Peng, Zeyu Zhang, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, Jingren Zhou.
- Offering open and broad abilities to build LLM-empowered multi-agent applications in an easier way.
(arXiv’25) FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models
- Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, Zikai Zhang, Rui Hu, InSeo Song, Lee KangYoon, Hong Jia, Ting Dang, Junyan Wang, Zheyuan Liu, Daniel Janes Beutel, Lingjuan Lyu, Nicholas D. Lane.
(NeurIPS’24) Federated Fine-tuning of Large Language Models under Heterogeneous Language Tasks and Client Resources
- Jiamu Bai, Daoyuan Chen^#, Bingchen Qian, Liuyi Yao, Yaliang Li.
- An aggregation scheme for FL of LLMs that dynamically adjusts LoRA ranks to harness the full potential of diverse client resources, enhancing generalization, and validated on thousands of tasks.
(ICML’24) Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes
- Zhen Qin, Daoyuan Chen^, Bingchen Qian, Bolin Ding, Yaliang Li, Shuiguang Deng.
- A theory-informed method for federated full-parameter tuning of LLMs, which incurs <18KB communication cost per round for a 3B LLM, meanwhile delivering SOTA accuracy.
(KDD’24) On the Convergence of Zeroth-Order Federated Tuning in Large Language Models
- Zhenqing Ling, Daoyuan Chen^, Liuyi Yao, Yaliang Li, Ying Shen.
- FedMeZO, a memory-efficient method for FL of LLMs that achieves faster convergence and reduced GPU memory usage, backed by theoretical analysis and empirical validation.
(KDD’24) FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning
- Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, Jingren Zhou.
- A new package designed to address the challenges of fine-tuning LLMs in FL.
(TKDE’24) Is Sharing Neighbor Generator in Federated Graph Learning Safe?
- Liuyi Yao, Zhen Wang, Yuexiang Xie, Yaliang Li, Weirui Kuang, Daoyuan Chen, Bolin Ding.
- Sharing neighbor generators in graph FL poses privacy risks, as it enables data reconstruction attacks, highlighting the need for defense strategies.
(ICML’23) Efficient Personalized Federated Learning via Sparse Model-Adaptation
- Daoyuan Chen, Liuyi Yao, Dawei Gao, Yaliang Li, Bolin Ding.
- An efficient pFL method with theoretical analysis, which adaptively learns sparse local models, and achieves SOTA accuracy with improved efficiency simultaneously.
(KDD’23) FS-Real: Towards Real-World Cross-Device Federated Learning
- Daoyuan Chen, Dawei Gao, Yuexiang Xie, Xuchen Pan, Zitao Li, Yaliang Li, Bolin Ding, Jingren Zhou.
- An efficient, scalable system for real-world cross-device FL, addressing challenges of heterogeneous devices and large-scale training.
(VLDB’23) FS-Real: A Real-World Cross-Device Federated Learning Platform
- Dawei Gao, Daoyuan Chen#, Zitao Li, Yuexiang Xie, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou.
- Industrial use-cases of FS-Real, including phones and cars.
(KDD’23) Revisiting Personalized Federated Learning: Robustness Against Backdoor Attacks
- Zeyu Qin, Liuyi Yao, Daoyuan Chen, Yaliang Li, Bolin Ding, Minhao Cheng.
- Personalized FL with partial model-sharing boosts robustness against backdoor attacks, unlike full model-sharing, offering insights for improved defense strategies.
(VLDB’23) FederatedScope: A Flexible Federated Learning Platform for Heterogeneity
- Yuexiang Xie, Zhen Wang, Dawei Gao, Daoyuan Chen, Liuyi Yao, Weirui Kuang, Yaliang Li, Bolin Ding, Jingren Zhou.
- An open package for FL research and development.
(NeurIPS’22, benchmark) pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning
- Daoyuan Chen, Dawei Gao, Weirui Kuang, Yaliang Li, Bolin Ding.
- Benchmarking personalized FL, containing more than 10 datasets, 20 pFL methods, and systematic evaluation with highlighted benefits and potential of pFL.
(KDD’22, tutorial) A Practical Introduction to Federated Learning
- Yaliang Li, Bolin Ding, Zhen Wang, Yuexiang Xie, Dawei Gao, Liuyi Yao, Daoyuan Chen, Weirui Kuang, Hongzhu Shi, Jingren Zhou
- Hands-on lessons on FL and FS package.
(SIGIR’24) Dynamic Demonstration Retrieval and Cognitive Understanding for Emotional Support Conversation
- Zhe Xu, Daoyuan Chen^, Jiayi Kuang, Zihao Yi, Yaliang Li, Ying Shen.
- Enhancing emotional support conversations by improving empathetic response generation and comprehending implicit mental states.
(ICLR’23) Learned Index with Dynamic $\epsilon$
- Daoyuan Chen, Wuchao Li, Yaliang Li, Bolin Ding, Kai Zeng, Defu Lian, Jingren Zhou.
- A mathematically-grounded learned index framework, which is efficient, based on theoretically derived prediction error bounds, and pluggable to several SOTA learned index methods.
(TNNLS, 2023, IF 10.4) Knowledge-Based Reasoning Network for Relation Detection
- Ying Shen, Min Yang, Yaliang Li, Dong Wang, Haitao Zheng, Daoyuan Chen.
(ACL’20) Relabel the noise: joint extraction of entities and relations via cooperative multiagents
- Daoyuan Chen, Yaliang Li, Kai Lei, Ying Shen.
- A cooperative multiagent method jointly extracts entities and relations by re-labeling noisy data, addressing shifted label distribution and improving extraction performance.
(IJCAI’20) Adabert: Task-adaptive bert compression with differentiable neural architecture search
- Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, Jingren Zhou.
- Using differentiable neural architecture search to compress BERT into task-adaptive models, achieving faster inference and smaller size while maintaining performance.
(CIKM’20) An Adaptive Embedding Framework for Heterogeneous Information Networks
- Daoyuan Chen, Yaliang Li, Bolin Ding, Ying Shen.
(AAAI’20) Joint learning of answer selection and answer summary generation in community question answering
- Yang Deng, Wai Lam, Yuexiang Xie, Daoyuan Chen, Yaliang Li, Min Yang, Ying Shen.
(CIKM’19) Knowledge-aware textual entailment with graph attention network
- Daoyuan Chen, Yaliang Li, Min Yang, Hai-Tao Zheng, Ying Shen.
(SIGIR’19) Answer-enhanced Path-aware Relation Detection over Knowledge Base
- Daoyuan Chen, Min Yang, Hai-Tao Zheng, Yaliang Li, Ying Shen.
(COLING’18, best paper nominations) Cooperative denoising for distantly supervised relation extraction
- Kai Lei, Daoyuan Chen#, Yaliang Li, Nan Du, Min Yang, Wei Fan, Ying Shen.
(SIGIR’18) Ontology evaluation with path-based text-aware entropy computation
- Ying Shen, Daoyuan Chen#, Min Yang, Yaliang Li, Nan Du, Kai Lei.
(Artificial intelligence in medicine, 2018, IF 7.5) An ontology-driven clinical decision support system (IDDAP) for infectious disease diagnosis and antibiotic prescription
- Ying Shen, Kaiqi Yuan, Daoyuan Chen, Joël Colloc, Min Yang, Yaliang Li, Kai Lei.

Working

Experiences

2023 - Now, Data Analytics and Intelligence Lab, Alibaba Tongyi
July 2019 - 2023, Data Analytics and Intelligence Lab, Alibaba DAMO Academy
Research Intern, March 2018 - June 2018, Tencent Medical AI Lab
Research Assistant, October 2016 - August 2017, Multimedia Software Engineering Research Center @ City University of Hong Kong

Professional

Activities

Tutorial Organizer:

KDD 2022
KDD 2024

Competition Organizer: data leaderboards for (multimodal) LLMs

Competition Participant:

KDD Cup, AutoML-Graph Track, 4/149, 2020 (our solution)

Conference Area Chair:

NeurIPS (2025)

Conference Reviewer:

NeurIPS, ICML, ICLR (2022-2025)
CVPR, ICCV, ECCV (2023-2025)
COLM (2024-2025)
KDD (2021-2024)
ACL, EMNLP, NAACL (2021-2024)
IJCAI, CIKM (2021-2022)

Journal Reviewer:

Artificial Intelligence (AIJ)
Journal of Machine Learning Research (JMLR)
IEEE Transactions on Knowledge and Data Engineering (TKDE)
IEEE Transactions on Computers (TC)
Expert Systems with Applications (ESWA)
IEEE Transactions on Big Data (TBD)
Knowledge-Based Systems (KBS)
Artificial Intelligence In Medicine
Neural Networks
Neurocomputing
Patterns

Misc.

Creativity is intelligence having fun.

I enjoy learning new things, playing basketball, guitar, and music (especially R&B and hip-hop).

Contact

Collaborations are welcome; we're currently hiring full-time researchers/developers and self-motivated interns!

Feel free to reach out if you’re interested: daoyuanchen.cdy@alibaba-inc.com, chendaoyuan@pku.edu.cn