Bio & Contact

About

Hi, I'm a researcher and engineer at Alibaba Tongyi Lab, working on data intelligence for the era of large models and AI agents. I earned my Master's degree in Computer Application Technology in June 2019 from Peking University, advised by Ying Shen, Kai Lei, and Yaliang Li.


I've published over 40 technical papers, with more than 20 as first author at top-tier venues including ICML, NeurIPS, ICLR, SIGMOD, KDD, and TPAMI. I’ve served as Area Chair for NeurIPS, ICLR, and ICML, and organized multiple KDD tutorials and Tianchi competitions focused on large models and data-centric AI.


I’ve learned a lot from the open-source community and am happy to deeply engage with below usage-driven open-source projects.

    As maintainer and technical lead:

  • • Data-Juicer:
    • A full-stack, multimodal data processing framework for foundation models — now evolving toward task-aware, agent-driven data curation.
  • • Trinity-RFT:
    • Reinforcement fine-tuning infrastructure enabling self-evolving LLMs via high-quality trajectory feedback.
  • As core contributor from the official team:

  • • AgentScope:
    • An agent-oriented programming framework for building composable, observable LLM applications.
  • • FederatedScope:
    • An easy-to-use federated learning (FL) platform supporting privacy-preserving on-device intelligence.

I'm interested in building simple, robust systems and algorithms for real-world machine learning (ML) applications, with a focus on data-centric foundations and siting at the intersection of:

  • • Large Language Models (LLMs)
  • • Multimodal LLMs
  • • Efficient / resource-aware ML
  • • Data- and knowledge-driven ML
  • • Human-centric ML & Human–AI interaction

I’m especially interested in rethinking data as a dynamic, intelligent infrastructure in the age of LLMs and agents, which co-evolves with models and humans.

  • • Data–model co-evolution & context-aware quality
    Closing the loops: models help select, label, and synthesize data; data reshapes model behavior through continuous feedback, adaptive curation, and task-aware training pipelines. Quality is contextual — good for which task, user, or constraint?
  • • Synthetic data & data-centric agents
    Using LLMs, agents, and simulations to generate, perturb, and organize data — creating “sandboxes” to probe generalization and robustness. Building language-powered agents that read, rewrite, route, and structure data across logs and multimodal streams.
  • • Human-in-the-loop & privacy-aware data ecosystems
    Designing human–AI collaboration (dialogue, multimodal AIGC, personalization, RLHF/RLAIF), so users can shape data and policies over time. Enabling on-device / FL with locally meaningful, private data under real-world constraints.

Collaborations are welcome; we're currently hiring full-time researchers/developers and self-motivated interns! Feel free to reach out if you’re interested: daoyuanchen.cdy@alibaba-inc.com, chendaoyuan@pku.edu.cn

Selected

Technical Works

Full paper lists: Google Scholar and DBLP

You can click the buttons below to switch between different paper-grouping methods and different subgroups.

The number in parentheses ('n') indicates the size of each subgroup; '#' indicates co-first author; '^' indicates industrial mentor to the first student author.

Working

Experiences

  • 2023 - Now, Data Analytics and Intelligence Lab, Alibaba Tongyi
  • July 2019 - 2023, Data Analytics and Intelligence Lab, Alibaba DAMO Academy
  • Research Intern, March 2018 - June 2018, Tencent Medical AI Lab
  • Research Assistant, October 2016 - August 2017, Multimedia Software Engineering Research Center @ City University of Hong Kong

Professional

Activities

Conference Area Chair:
  • ICML (2026)
  • ICLR (2026)
  • NeurIPS (2025)
Tutorial Organizer:
  • KDD 2022
  • KDD 2024
Competition Organizer: data leaderboards for (multimodal) LLMs
Competition Participant:
Conference Reviewer:
  • NeurIPS, ICML, ICLR (2022-2025)
  • CVPR, ICCV, ECCV (2023-2025)
  • COLM (2024-2025)
  • KDD (2021-2024)
  • ACL, EMNLP, NAACL (2021-2024)
  • IJCAI, CIKM (2021-2022)
Journal Reviewer:
  • Artificial Intelligence (AIJ)
  • Journal of Machine Learning Research (JMLR)
  • IEEE Transactions on Knowledge and Data Engineering (TKDE)
  • IEEE Transactions on Computers (TC)
  • Expert Systems with Applications (ESWA)
  • IEEE Transactions on Big Data (TBD)
  • Knowledge-Based Systems (KBS)
  • Artificial Intelligence In Medicine
  • Neural Networks
  • Neurocomputing
  • Patterns

Misc.

Creativity is intelligence having fun.

I enjoy learning new things, reading, playing basketball, guitar, and music (especially R&B and hip-hop).