ML Agents Train Prompt Injection Detector with 99% F1 Score
Everlier used an ML agent to build a browser-based prompt injection classifier. This approach delivered high performance while managing costs, revealing both the efficiency and limitations of agentic…
Everlier used an ML agent to build a browser-based prompt injection classifier. This approach delivered high performance while managing costs, revealing both the efficiency and limitations of agentic model training.
Everlier, operating as /u/Everlier on Reddit, trained a prompt injection classifier that achieved a 99% F1 score with a model size of approximately 65 MB. The model, a DistilBERT variant, runs directly in the browser using Transformers.js v3. This outcome demonstrates the potential of specialized ML agents like ml-intern when paired with powerful large language models such as DeepSeek v4 Flash for targeted machine learning tasks.
The project aimed to assess how a purpose-built ML agent compares to general-purpose coding agents for developing a prompt injection detection system. The process highlighted specific advantages in dataset procurement and iterative model refinement, alongside clear limitations when deviating from established architectures.
Agent-driven Dataset Sourcing
The initial phase of model training, often the most time-consuming, was significantly streamlined by the ml-intern agent. Everlier configured ml-intern with a Hugging Face token and pointed it to OpenRouter, leveraging OpenAI-compatible APIs. The agent autonomously identified and utilized two existing datasets for prompt injection: deepset/prompt-injections and Shomi28/prompt-injection-dataset.
This automated dataset discovery bypassed what Everlier noted is typically "95% of the work" in such tasks. For the second iteration (v2) of the model, the agent located and integrated a larger synthetic dataset, Bordair/bordair-multimodal, which contributed to the final performance metrics.
Iterative Model Selection and Training
The agent facilitated an iterative approach to model selection and training. For the first version (v1), Everlier targeted CPU inference and selected DistilBERT. After a series of parameter sweeps, the ml-intern agent launched a full training run, resulting in a model with an F1 score of 95.87%.
For v2, the DistilBERT model was retrained using the Bordair/bordair-multimodal dataset. This iteration yielded the final production model, which was quantized to ONNX int8, achieved an F1 score of 99%, and maintained a compact size of approximately 65 MB. This model is accessible via a live demonstration on Hugging Face Spaces at https://huggingface.co/spaces/av-codes/prompt-injection-detector.
Cost-Controlled LLM Orchestration
One notable aspect of Everlier's approach was the efficient management of LLM costs. The DeepSeek v4 Flash model, accessed via OpenRouter, incurred a total cost of under $5 for all agent runs throughout the project. This low operational cost underscores the economic viability of using specialized agents with cost-effective LLM APIs for iterative development and experimentation.
However, cost efficiency was not universal. A separate attempt to train an HRM-Text model on Hugging Face remote training, utilizing a T4 GPU, cost $20. This particular run failed after the first epoch due to the agent's inability to correctly implement the training routine and optimize parameters from the specified research paper.
Troubleshooting Agent Limitations
The project exposed specific limitations of the ml-intern agent when dealing with non-standard or novel architectures. The agent initially struggled with the HRM-Text model, misinterpreting the architecture and setting up a TRM run instead. When explicitly steered back to HRM with the correct paper (https://arxiv.org/abs/2605.20613), the training script proved unoptimized for the hardware.
The subsequent $20 remote training session on a T4 GPU failed because the agent did not follow the training routine outlined in the paper, leading to incorrect optimizer and parameter settings. This resulted in "params blowing up" after the first epoch, indicating a critical breakdown in the agent's ability to adapt to complex, research-driven training protocols without explicit, precise human guidance.
WHAT WE'D CHANGE
The reliance on synthetic datasets, while efficient for rapid prototyping, introduces a significant vulnerability. Everlier noted that the synthetic dataset used for the v2 model meant "the train/test splits might be too similar." For a production-grade prompt injection detector, a more diverse and real-world-representative dataset is essential. This would involve collecting actual adversarial prompts from various sources, potentially through red-teaming exercises or community contributions, to ensure the model generalizes effectively beyond its training distribution. Without this, the reported 99% F1 score, while impressive on the given data, may not reflect real-world robustness.
The agent's failure to correctly implement the HRM-Text model's training routine highlights a gap in its autonomous capabilities for advanced research. While ml-intern excelled on the "happy path" with established architectures like DistilBERT, it struggled with a non-standard model. Future iterations of such agentic tools, or their application, should incorporate more robust validation steps for complex training scripts. This could involve pre-flight checks against paper specifications, or a human-in-the-loop verification process for optimizer and parameter settings before committing to expensive compute resources.
The $20 spent on a failed remote training job, though a small sum, points to an area for improved cost control. For larger-scale or more frequent experimentation, such failures could accumulate. Implementing automated budget caps or conditional training halts that trigger upon early signs of performance degradation (e.g., exploding gradients, no loss reduction) could prevent unnecessary expenditure. This would allow founders to explore novel architectures more safely, even when an agent's understanding of the training protocol is incomplete.
LANDING
Everlier's experience demonstrates that ML agents can significantly accelerate the development of specialized models, particularly by automating labor-intensive steps like dataset discovery and iterative training. The project achieved a high-performing, browser-deployable prompt injection detector with minimal LLM costs. However, the agent's limitations with novel architectures and the inherent risks of synthetic datasets underscore that human oversight remains critical. The future of agent-driven ML development will likely involve a hybrid approach, where agents handle the routine while founders provide strategic guidance and validate complex experimental paths.
Pull quote: “”
Every claim ties to a primary source. See our methodology.