Defenders Unlearn Poisoned Fine-Tuning in Summarizers
Fine-tuning data poisoning quietly warps AI summaries.
Training-time data poisoning during fine-tuning poses a significant threat to LLMs deployed for abstractive text summarization, where small task-specific datasets exert outsized influence on model behavior. The paper presents a unified post-hoc defense framework for detecting and remediating fine-tuning-stage poisoning in summarization models across the ML supply chain. It details two detection modes: in white-box settings, poisoned document summary pairs exhibit abnormally high training influence, enabling detection via influence-function analysis with semantic consistency checks. In black-box settings, the team reports poisoned models show two to three times greater sensitivity to semantics-preserving perturbations, enabling behavioral auditing without access to training data.
Beyond existing poisoning formulations, the team reports novel attacks targeting factual distortion and representational bias, showing that poisoning alters summarization behavior without triggering conventional alarms. Across nine architectures and six benchmark datasets under adaptive attacks, the defenses achieve 85-92 percent detection precision, while gradient-ascent unlearning restores up to 96 percent of original behavior with minimal utility loss (less than 0.6 percent ROUGE degradation). These results indicate that fine-tuning time poisoning leaves persistent structural artifacts, enabling practical detection and post-deployment recovery without full retraining.
From the engineering standpoint, the claim is clear: you can build a defense that sits after fine-tuning and still clean up or revert poisoned behavior without scrapping an entire model. The numbers matter for product planning (85-92 percent precision gives you usable signal, and 96 percent restoration suggests a viable rollback path when signs of poisoning appear). The work also signals a practical blueprint for teams that worry about supply chain integrity, not just the immediate model, by focusing on both the data that enters fine-tuning and the model’s responses after deployment.
Practitioner insights
In short, the study reframes poisoning as a deviation you can detect and correct after the fact, rather than a catastrophe that forces a model rewrite. By anchoring defenses in concrete signals (training influence, semantic sensitivity, and recoverability), product teams gain a sensible, engineering driven playbook for safeguarding summary systems.
Sources
- Detect, Unlearn, Restore: Defending Text Summarization Models Against Data PoisoningarXiv LLM/Foundation Query / Primary source / Published JUN 24, 2026 / Accessed JUN 25, 2026