About
Daoyuan Chen (陈道源)
Hi, I'm doing research and development at Data Analytics and Intelligence Lab, Alibaba Tongyi. I earned my Master's degree in Computer Application Technology in June 2019 from Peking University, co-supervised by Ying Shen and Kai Lei (academic mentors), and Yaliang Li (industry mentor).
I've published over 30 technical papers, more than 10 of which I've led as the first author and were presented at top-tier conferences such as ICML, NeurIPS, ICLR, SIGMOD, KDD, ACL and SIGIR.
I’ve learned a lot from the open-source community and am glad to have the opportunity to deeply engage with several open-source projects:
- • Data-Juicer:
-
• AgentScope:
- A developer platform for LLM-empowered multi-agent applications.
- Committer; co-authored paper: [13].
- • FederatedScope:
My interests broadly lie in insight- and theory-informed research, simple yet effective systems, and real-world applications related to:
- • Large Language Models (LLMs)
- • Multimodal LLMs
- • Efficient Machine Learning (ML)
- • Data- and Knowledge-Driven ML
- • Human-centric ML
- • Federated Learning (FL)
More specifically, including but not limited to:
- • Data-model co-development, e.g., building dedicated infrastructures, and exploring generalized feedback signals between them
- • Algorithms for enhancing data quality, diversity, and usability
- • Synthetic data for model training and evaluation
- • Better human-computer interaction, e.g., empathetic dialog, multimodal AIGC, and personalized modeling
- • On-device solutions via utilizing small models, and addressing privacy issues with FL
Collaborations are welcome; we're currently hiring full-time researchers/developers and self-motivated interns!
Feel free to
reach out if you are interested.
Selected
Research
Full lists: Google Scholar and DBLP
Remark: # indicates equal contribution to the first author; ^ indicates industrial mentor to the first student author.
(Multimodal) LLMs, Data-Driven & Human-Centric ML
- (arXiv’25)
MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions
- Zhe Xu, Daoyuan Chen^#, Zhenqing Ling, Yaliang Li, Ying Shen
- A new data synthesis method, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency over SOTA baselines (16% gain on MathVision using only 400 samples).
- (arXiv’25) Diversity as a Reward:
Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
- Zhenqing Ling, Daoyuan Chen^#, Liuyi Yao, Yaliang Li, Ying Shen
- A theoretically informed method, which treats diversity as a reward, achieves new SOTA average performance across 7 benchmarks on SOTA LLMs with domain-undetermined data.
- (CVPR’25)
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
- Qirui Jiao, Daoyuan Chen^, Yilun Huang, Yaliang Li, Ying Shen
- A new method of contrastive data synthesis, creating a high-quality dataset that describes object differences focusing on fine-grained regions in images.
- (arXiv’24) Data-Juicer 2.0: Cloud-Scale
Adaptive Data Processing for Foundation Models
- Daoyuan Chen, Yilun Huang, Xuchen Pan, Nana Jiang, Haibin Wang, Ce Ge, Yushuo Chen, Wenhao Zhang, Zhijian Ma, Yilei Zhang, Jun Huang, Wei Lin, Yaliang Li, Bolin Ding, Jingren Zhou
- Offering efficient multimodal data processing abilities with 100+ operators, and seamless scalability for foundation models.
- (SIGMOD’24) Data-Juicer: A
One-Stop Data Processing System for Large Language Models
- Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, Jingren Zhou
- Providing open, versatile data processing abilities to ease the creation and evaluation of diverse data recipes for LLMs.
- (arXiv’24) Data-Juicer Sandbox: A
Feedback-Driven Suite for Multimodal Data-Model Co-development
- Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou
- A new middleware links data and model feedback, enabling high performance and low cost verified in broad tasks. Top-1 in VBench leaderboard.
- (KDD’24, tutorial) Multi-modal Data
Processing for Foundation Models: Practical Guidances and Use Cases
- Daoyuan Chen, Yaliang Li, Bolin Ding, The Data-Juicer Team
- Discussing practical skills in multi-modal data processing, to efficiently handle data variety, quality, and scale for foundation models.
- (arXiv’24, survey)
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development
Perspective
- Zhen Qin, Daoyuan Chen^#, Wenhao Zhang, Liuyi Yao, Yilun Huang, Bolin Ding, Yaliang Li, Shuiguang Deng
- Highlighting the current state and potential of co-development between data and multi-modal LLMs, in a dual perspective.
- (arXiv’24, benchmark)
HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic
Benchmark Data
- Ting Zhou, Daoyuan Chen^, Qirui Jiao, Bolin Ding, Yaliang Li, Ying Shen
- A new benchmark to evaluate MLLMs on inner emotion recognition and outer behavioral manifestations, advancing human-like understanding in video perception.
- (COLING’24) ChartThinker: A
Contextual Chain-of-Thought Approach to Optimized Chart Summarization
- Mengsha Liu, Daoyuan Chen^, Yaliang Li, Guian Fang and Ying Shen.
- Optimizing chart summarization by combining contextual reasoning and deep analysis, achieving superior performance across multiple metrics.
- (arXiv’24)
From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection
Information
- Qirui Jiao, Daoyuan Chen^, Yilun Huang, Yaliang Li, Ying Shen.
- Showing how adaptively fine-tuning MLLMs with textual detection information can boost performance via extensive experiments.
- (arXiv’24)
A Bivariate Data Mixing Law for Language Model Pretraining
- Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding.
- A new bivariate data mixing law that models the joint scaling patterns of domain proportions and data quantity in LLM pretraining.
- (arXiv’24)
ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction
- Zhongjie Duan, Qianyi Zhao, Cen Chen, Daoyuan Chen, Wenmeng Zhou, Yaliang Li, Yingda Chen
- Enhancing text-to-image generation by iteratively integrating synthesis and understanding models, improving aesthetics and efficiency without extra inference costs.
- (arXiv’24)
AgentScope: A Flexible yet Robust Multi-Agent Platform
- Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, Liuyi Yao, Hongyi Peng, Zeyu Zhang, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, Jingren Zhou.
- Offering open and broad abilities to build LLM-empowered multi-agent applications in an easier way.
FL+LLM, On-device & Personalized FL
- (NeurIPS’24) Federated
Fine-tuning of Large Language Models under Heterogeneous Language Tasks and Client
Resources
- Jiamu Bai, Daoyuan Chen^#, Bingchen Qian, Liuyi Yao, Yaliang Li.
- An aggregation scheme for FL of LLMs that dynamically adjusts LoRA ranks to harness the full potential of diverse client resources, enhancing generalization, and validated on thousands of tasks.
- (ICML’24)
Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18
Kilobytes
- Zhen Qin, Daoyuan Chen^, Bingchen Qian, Bolin Ding, Yaliang Li, Shuiguang Deng.
- A theory-informed method for federated full-parameter tuning of LLMs, which incurs <18KB communication cost per round for a 3B LLM, meanwhile delivering SOTA accuracy.
- (KDD’24)
On the Convergence of Zeroth-Order Federated Tuning in Large Language Models
- Zhenqing Ling, Daoyuan Chen^, Liuyi Yao, Yaliang Li, Ying Shen.
- FedMeZO, a memory-efficient method for FL of LLMs that achieves faster convergence and reduced GPU memory usage, backed by theoretical analysis and empirical validation.
- (KDD’24) FederatedScope-LLM: A
Comprehensive Package for Fine-tuning Large Language Models in Federated Learning
- Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, Jingren Zhou.
- A new package designed to address the challenges of fine-tuning LLMs in FL.
- (TKDE’24)
Is Sharing Neighbor Generator in Federated Graph Learning Safe?
- Liuyi Yao, Zhen Wang, Yuexiang Xie, Yaliang Li, Weirui Kuang, Daoyuan Chen, Bolin Ding.
- Sharing neighbor generators in graph FL poses privacy risks, as it enables data reconstruction attacks, highlighting the need for defense strategies.
- (ICML’23)
Efficient Personalized Federated Learning via Sparse Model-Adaptation
- Daoyuan Chen, Liuyi Yao, Dawei Gao, Yaliang Li, Bolin Ding.
- An efficient pFL method with theoretical analysis, which adaptively learns sparse local models, and achieves SOTA accuracy with improved efficiency simultaneously.
- (KDD’23)
FS-Real: Towards Real-World Cross-Device Federated Learning
- Daoyuan Chen, Dawei Gao, Yuexiang Xie, Xuchen Pan, Zitao Li, Yaliang Li, Bolin Ding, Jingren Zhou.
- An efficient, scalable system for real-world cross-device FL, addressing challenges of heterogeneous devices and large-scale training.
- (VLDB’23)
FS-Real: A Real-World Cross-Device Federated Learning Platform
- Dawei Gao, Daoyuan Chen#, Zitao Li, Yuexiang Xie, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou.
- Industrial use-cases of FS-Real, including phones and cars.
- (KDD’23)
Revisiting Personalized Federated Learning: Robustness Against Backdoor Attacks
- Zeyu Qin, Liuyi Yao, Daoyuan Chen, Yaliang Li, Bolin Ding, Minhao Cheng.
- Personalized FL with partial model-sharing boosts robustness against backdoor attacks, unlike full model-sharing, offering insights for improved defense strategies.
- (VLDB’23)
FederatedScope: A Flexible Federated Learning Platform for Heterogeneity
- Yuexiang Xie, Zhen Wang, Dawei Gao, Daoyuan Chen, Liuyi Yao, Weirui Kuang, Yaliang Li, Bolin Ding, Jingren Zhou.
- An open package for FL research and development.
- (NeurIPS’22, benchmark)
pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning
- Daoyuan Chen, Dawei Gao, Weirui Kuang, Yaliang Li, Bolin Ding.
- Benchmarking personalized FL, containing more than 10 datasets, 20 pFL methods, and systematic evaluation with highlighted benefits and potential of pFL.
- (KDD’22, tutorial)
A Practical Introduction to Federated Learning
- Yaliang Li, Bolin Ding, Zhen Wang, Yuexiang Xie, Dawei Gao, Liuyi Yao, Daoyuan Chen, Weirui Kuang, Hongzhu Shi, Jingren Zhou
- Hands-on lessons on FL and FS package.
Efficient, Adaptive, & Knowledge-Driven ML
- (SIGIR’24)
Dynamic Demonstration
Retrieval and Cognitive Understanding for Emotional Support Conversation
- Zhe Xu, Daoyuan Chen^, Jiayi Kuang, Zihao Yi, Yaliang Li, Ying Shen.
- Enhancing emotional support conversations by improving empathetic response generation and comprehending implicit mental states.
- (ICLR’23)
Learned Index with Dynamic $\epsilon$
- Daoyuan Chen, Wuchao Li, Yaliang Li, Bolin Ding, Kai Zeng, Defu Lian, Jingren Zhou.
- A mathematically-grounded learned index framework, which is efficient, based on theoretically derived prediction error bounds, and pluggable to several SOTA learned index methods.
- (TNNLS, 2023, IF 10.4)
Knowledge-Based
Reasoning Network for Relation Detection
- Ying Shen, Min Yang, Yaliang Li, Dong Wang, Haitao Zheng, Daoyuan Chen.
- (ACL’20)
Relabel the noise: joint
extraction of entities and relations via cooperative multiagents
- Daoyuan Chen, Yaliang Li, Kai Lei, Ying Shen.
- A cooperative multiagent method jointly extracts entities and relations by re-labeling noisy data, addressing shifted label distribution and improving extraction performance.
- (IJCAI’20)
Adabert: Task-adaptive
bert compression with differentiable neural architecture search
- Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, Jingren Zhou.
- Using differentiable neural architecture search to compress BERT into task-adaptive models, achieving faster inference and smaller size while maintaining performance.
- (CIKM’20)
An Adaptive
Embedding Framework for Heterogeneous Information Networks
- Daoyuan Chen, Yaliang Li, Bolin Ding, Ying Shen.
- (AAAI’20)
Joint learning of answer
selection and answer summary generation in community question answering
- Yang Deng, Wai Lam, Yuexiang Xie, Daoyuan Chen, Yaliang Li, Min Yang, Ying Shen.
- (CIKM’19)
Knowledge-aware textual entailment with graph
attention network
- Daoyuan Chen, Yaliang Li, Min Yang, Hai-Tao Zheng, Ying Shen.
- (SIGIR’19)
Answer-enhanced Path-aware Relation Detection
over Knowledge Base
- Daoyuan Chen, Min Yang, Hai-Tao Zheng, Yaliang Li, Ying Shen.
- (COLING’18, best paper nominations)
Cooperative denoising
for distantly supervised relation extraction
- Kai Lei, Daoyuan Chen#, Yaliang Li, Nan Du, Min Yang, Wei Fan, Ying Shen.
- (SIGIR’18)
Ontology
evaluation with path-based text-aware entropy computation
- Ying Shen, Daoyuan Chen#, Min Yang, Yaliang Li, Nan Du, Kai Lei.
- (Artificial intelligence in medicine, 2018, IF 7.5)
An ontology-driven clinical decision support system
(IDDAP) for infectious disease diagnosis and antibiotic prescription
- Ying Shen, Kaiqi Yuan, Daoyuan Chen, Joël Colloc, Min Yang, Yaliang Li, Kai Lei.
Working
Experiences
- 2023 - Now, Data Analytics and Intelligence Lab, Alibaba Tongyi
- July 2019 - 2023, Data Analytics and Intelligence Lab, Alibaba DAMO Academy
- Research Intern, March 2018 - June 2018, Tencent Medical AI Lab
- Research Assistant, October 2016 - August 2017, Multimedia Software Engineering Research Center @ City University of Hong Kong
Professional
Activities
Tutorial Organizer:
- KDD 2022
- KDD 2024
Competition Organizer: data leaderboards for (multimodal) LLMs
Competition Participant:
- KDD Cup, AutoML-Graph Track, 4/149, 2020 (our solution)
Conference Reviewer:
- NeurIPS/ICML/ICLR (2022-2025)
- CVPR/ICCV/ECCV (2023-2025)
- COLM (2024-2025)
- KDD (2021-2024)
- ACL/EMNLP/NAACL (2021-2024)
- IJCAI/CIKM (2021-2022)
Journal Reviewer:
- Expert Systems with Applications
- Neurocomputing
- Neural Networks
- Knowledge-Based Systems
- IEEE Transactions on Big Data
- Patterns
- Artificial Intelligence (AIJ)
- Artificial Intelligence In Medicine
Misc.
Creativity is intelligence having fun.
I enjoy learning new things, playing basketball, guitar, and music (especially R&B and hip-hop).
Contact
Collaborations are welcome; we're currently hiring full-time researchers/developers and self-motivated interns!
Feel free to reach out if you’re interested: daoyuanchen.cdy@alibaba-inc.com, chendaoyuan@pku.edu.cn