Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models
Sim-CLIP is an unsupervised adversarial fine-tuning framework designed to enhance the robustness of vision encoders in Vision–Language Models (VLMs). The workflow begins by generating adversarial perturbations of input images using gradient-based methods such as PGD, creating paired clean and adversarial views. These two views are then passed through a shared-weight Siamese CLIP encoder to produce corresponding representations. Instead of relying on traditional ℓ2 alignment, Sim-CLIP optimizes a cosine similarity objective to enforce semantic consistency between clean and perturbed embeddings, ensuring robustness without sacrificing high-level visual semantics. To stabilize training and prevent representation collapse, a symmetric stop-gradient mechanism is applied, allowing one branch to guide the other in an alternating manner. Once fine-tuned, the robust vision encoder is seamlessly integrated into downstream VLMs (e.g., LLaVA, OpenFlamingo). This plug-and-play pipeline enables improved robustness against adversarial attacks.
Abstract
Experimental Results
Robust Performance under untargeted attacks
Comparing robustness of different VLMs against untargeted attacks across image captioning and visual question answering (VQA) tasks. For captioning tasks, we utilize COCO, Flickr30k datasets and report CIDEr score, and for VQA tasks, we use VizWiz, OKVQA datasets and report VQA accuracy..
Qualitative Examples
Targeted ℓ∞ attacks at ε = 4/255 radii using original and robust CLIP models as vision encoder in LLaVA. Considering the target strings from Table 2, we present generated captions (good caption, captions with mistakes, captions missing intricate details, malicious target output) on original (left) and imperceptible adversarial (right) images.
Robustness against targeted attack
Quantitative evaluation of ℓ∞ targeted attacks at ε = 2/255 and ε = 4/255 radii. Sim-CLIP consistently achieves higher CIDER score under both radii.
Performance on zero-shot classification
Funding
This material is partly based upon work supported by the U.S. National Science Foundation (NSF) under Grant No. CRII-IIS-RI-2553868. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
BibTeX
@article{hossain2024sim,
title={Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language models},
author={Hossain, Md Zarif and Imteaj, Ahmed},
journal={arXiv preprint arXiv:2407.14971},
year={2024}
}