MLZero

A Multi-Agent System for End-to-end Machine Learning Automation

AWS

MLZero: An end-to-end multi-agent system that integrates specialized perception agents with dual memory modules (semantic and episodic) to power iterative coding cycles, transforming raw data into ready-to-use models and prediction outputs with zero human intervention. The overall framework of MLZero. (1) \textbf{Perception} that interprets arbitrary data inputs and transforms them into structured context; (2) \textbf{Semantic Memory} that enriches the system with knowledge of the ML Library; (3) \textbf{Episodic Memory} that maintains chronological execution records for targeted debugging; and (4) \textbf{Iterative Coding} that implements a refinement process with feedback loops and augmented memory.

Introduction

Existing AutoML systems have advanced the automation of machine learning (ML); however, they still require substantial manual configuration and expert input, particularly when handling multimodal data. We introduce MLZero, a novel multi-agent framework powered by Large Language Models (LLMs) that enables end-to-end ML automation across diverse data modalities with minimal human intervention.

MLZero employs a cognitive perception module that transforms raw multimodal inputs into perceptual context, addressing key LLM limitations through semantic and episodic memory. Our system demonstrates superior performance on MLE-Bench Lite, securing six gold medals and outperforming competitors on our Multimodal AutoML Agent Benchmark with a success rate of 0.92 (+263.6%). MLZero maintains robust effectiveness even with a compact 8B LLM, outperforming full-size systems from existing solutions.

Our approach integrates specialized perception agents with dual memory modules for iterative code development and error correction. MLZero not only overcomes limitations of previous LLM-based approaches but also represents a truly end-to-end system with superior performance across diverse machine learning tasks. Our main contributions include:

A novel multi-agent system delivering high-quality end-to-end multimodal ML solutions with minimal human intervention
Superior performance on MLE-Bench Lite with more medal counts (6 gold) and higher success rate
A comprehensive benchmark suite evaluating challenging scenarios including multilingual, multitable, and zero-shot tasks
Empirical evidence showing MLZero outperforms existing ML agents across all metrics (+263.6% success rate)
Detailed ablation studies identifying key components driving performance gains

The MLZero Framework

We present MLZero, a multi-agent system that automates end-to-end solutions for multimodal ML tasks. Given input data $x$ and optional user inputs $U_{\text{opt}}$, the system produces solutions including predicted outputs $y$, code artifacts $C$, and execution logs $L$: $\mathcal{F}(x, U^{\text{opt}}) = (y, C, L)$.

Our ML model building process for various tasks is achieved by generating code employing different ML libraries and executing it. For supervised learning tasks, $x$ typically includes labeled training data, unlabeled test data, and a brief task description or instruction. For zero-shot tasks, $x$ would simply consist of unlabeled test data and the task description. Through this comprehensive system, MLZero effectively bridges the gap between noisy raw data inputs and sophisticated ML solutions, providing a truly end-to-end automated ML framework adaptive to any modalities.

Our system comprises four modules, where each module is a subsystem with one or more agents, and each agent is a specialized LLM augmented with utility functions: (1) Perception that interprets arbitrary data inputs and transforms them into structured context; (2) Semantic Memory that enriches the system with knowledge of the ML Library; (3) Episodic Memory that maintains chronological execution records for targeted debugging; and (4) Iterative Coding that implements a refinement process with feedback loops and augmented memory.

Perception Module

The Perception module $\mathcal{P}$ acts as the cognitive lens of the system, orchestrating the transformation of various data inputs into actionable ML workflow specifications: $\mathcal{P}(x, U^{\text{opt}}) = (P, M)$. This module consists of three agents: the File grouping and file perception agent performs structural analysis of raw data $x$, grouping similar files and interpreting file contents; the Task perception agent extracts semantic information from raw data, derived context, and user input $U_{\text{opt}}$ to identify objectives, constraints, and evaluation criteria; and the ML Library selection agent employs context-aware reasoning to match problem characteristics with the appropriate ML Library $M$.

Semantic Memory Module

The Semantic Memory Module $\mathcal{S}_t$ enhances the LLM's parametric knowledge with domain-specific information from external knowledge bases at each iteration $t$. These knowledge bases are constructed offline by two agents: the summarization agent compresses relevant knowledge into concise paragraphs serving as queryable indices, while the condensation agent transforms this knowledge into precise and streamlined guidance. At each iteration $t$, given the error context $R_t$, the Semantic Memory Module processes this information through its retrieval agent to query the knowledge base of the selected ML library $M$, extracting condensed information $G_t$: $\mathcal{S}_t(P, M, R_t) = G_t$.

Episodic Memory Module

The Episodic Memory module, $\mathcal{E}_t$, enhances the success rate of MLZero in ML model building by providing error context $R_t$ at each iteration $t$ leveraging its chronological record of the system execution history: $\mathcal{E}_t(P, C_{t-1}, L_{t-1}, G_{t-1}, R_{t-1}) = R_t$. This component is initialized with the perception context $P$ and progressively stores the interaction data at each iteration. When invoked during code generation, the error analyzer agent distills encountered issues and contexts into concise error summaries paired with fix suggestions, enabling subsequent coding agents to efficiently address specific problems without processing excessive contextual information.

Iterative Coding Module

With the support of components above, our system enters an iterative coding process $\mathcal{G}_t$, where at each iteration $t$ it refines the solution based on execution feedback: $\mathcal{G}_t(P, U^{\text{opt}}_t, R_t, G_t) = (y_t, C_t, L_t)$. For each iteration $t$, the system first combines the perceptual context $P$, optional user input $U^{\text{opt}}_t$, error context $R_t$, and the retrieved knowledge $G_t$ to guide the coder agent in producing executable code $C_t$. The system then executes the generated code in a configured environment, capturing logs $L_t$ and stores the model output $y_t$. The executer agent analyzes these results and logs to determine the next steps: finalizing output upon success or identifying errors and initiating the next coding iteration.

Experimental Results

Main Results: Comprehensive Evaluation

To evaluate the effectiveness of MLZero, we conducted extensive experiments across multiple benchmarks and datasets. Our evaluations span two primary benchmarks: MLE-bench Lite with 21 diverse Kaggle competitions and the Multimodal AutoML Agent Benchmark with 25 diverse datasets covering various modalities and ML tasks.

Performance comparison on MLEbench-Lite. MLZero demonstrates superior performance with an average rank of 1.43, significantly outperforming competing approaches including AIDE (2.36), ResearchAgent (MLAB) (3.29), and CodeActAgent (OpenHands) (2.93). MLZero achieves the highest success rate of 86% on MLEbench-Lite, compared to AIDE (81%), MLAB (62%), and OpenHands (71%), while also outperforming competitors in medal counts with six gold and two silver medals.

Performance comparison between MLZero and several state-of-the-art machine learning and coding agents on our Multimodal AutoML Agent Benchmark. MLZero with default configuration achieves a markedly higher success rate (92.0%) compared to all competing agents and delivers solutions of superior quality, as evidenced by its substantially lower average rank (2.42). The 8B configuration, which uses LLama 3.1 8B, maintains better performance (45.3% success rate, 5.14 rank average) than other agents despite a significantly reduced model size. In the 24-hour extended run-time experiment (24hrs), MLZero approaches expert-level performance given sufficient computational resources.

Implementation Details & Ablation Study

Each agent was assigned a 3-hour time limit per dataset to produce results. It is important to note that only MLZero and Codex CLI operate truly end-to-end, while other agents require varying degrees of preprocessing or postprocessing to function on our benchmark. For example, DS-Agent requires manual code execution, while results from AIDE and AutoKaggle needed manual extraction from working directories.

To assess our method without the advantage of its integrated external knowledge, the -ext configuration removes all access to external ML libraries. This configuration yields a 69.3% success rate and a 4.94 average rank, still outperforming all competitors and highlighting the efficiency of MLZero's other components. We also investigated the contribution of episodic memory through the -epi configuration, which removes episodic memory but retains the LLM's conversation history within the coder agent. This setup achieves an 86.7% success rate with a 2.86 average rank, demonstrating that while episodic memory provides significant benefits, maintaining a coherent conversational context still yields reasonable performance.

In contrast, we explored modifying AIDE to access external knowledge, indicated as +ext. While it shows improvement, MLZero continued to outperform it under these comparable conditions. This result underscores that the superior performance of MLZero stems from its overall system design, not merely from the inclusion of episodic memory or external knowledge in isolation.

BibTeX

@misc{fang2025mlzeromultiagentendtoendmachine,
        title={MLZero: A Multi-Agent System for End-to-end Machine Learning Automation}, 
        author={Haoyang Fang and Boran Han and Nick Erickson and Xiyuan Zhang and Su Zhou and Anirudh Dagar and Jiani Zhang and Ali Caner Turkmen and Cuixiong Hu and Huzefa Rangwala and Ying Nian Wu and Bernie Wang and George Karypis},
        year={2025},
        eprint={2505.13941},
        archivePrefix={arXiv},
        primaryClass={cs.MA},
        url={https://arxiv.org/abs/2505.13941}, 
  }

Table of Contents