Prototype Guided Backdoor Defense via Activation Space Manipulation

ICCV 2025

1️⃣CVIT, KCIS, IIIT Hyderabad 2️⃣Amazon Research, India

Abstract

Deep learning models are susceptible to backdoor attacks involving malicious perturbation of some training data with a trigger to force misclassification to a target class. Various triggers have been used including semantic triggers that are easily realizable in real-world settings. We present Prototype Guided Backdoor Defense (PGBD), a robust post-hoc defense that scales across different trigger types, including previously unsolved semantic triggers. PGBD exploits displacements in the geometric spaces of activations to penalize movement and alignmet towards the trigger. This is done using a novel sanitization loss of a post-hoc fine-tuning step. This approach scales to all types of attacks and triggers, and achieves better performance across settings. We also present the first defense against semantic attacks on a new celebrity face images dataset. Activation spaces can provide rich clues to enhance deep learning models in different ways.



Our main contributions

A robust and scalable backdoor defense that achieves state of the art results across all attack variations. PGBD shows constitent performance across differnet attack types (patch, functional, adaptive, dynamic, and semantic) and also across previously difficult attack settings of low poisoning ratios.

Defense configurable to all types of attack scenarios and defense settings by leveraging the class-level geometric relations in the activation space of backdoored models. Prototype Activation Vectors (V ) are leveraged to sanitize the model. Depending on the extent of attack knowledge, the type and usage of PAV to perform model sanitization can be determined.

A new public semantic attack dataset consisting of completely real image based attacks, and sythetically modified image based attacks. Additionally, our semantic attack features three varitions: 1) (Sunglasses) Diverse triggers from a single concept, where triggers are real world objects present in the original image, 2) Diverse triggers from a single concept, where triggers are synthetically generated and added post-hoc to semantically valid locaitons in the image, and 3) A single unique trigger added post-hoc to semantically valid locations in the image. PGBD is a first time successfull defense against such semantic attacks. Results show best defnese across all three types of triggers.


Results

Gradcam Visualization

Grad-CAM visualizations of clean and backdoored images before and after applying PGBD. The model focuses on the trigger region in the backdoored image before applying PGBD, while after applying PGBD, the model focuses on the relevant object features.

BibTeX

@article{amula2025prototype,
          title={Prototype Guided Backdoor Defense},
          author={Amula, Venkat Adithya and Samavedam, Sunayana and Saini, Saurabh and Gupta, Avani and others},
          journal={arXiv preprint arXiv:2503.20925},
          year={2025}
        }