Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Secure Prediction, Edge AI, and Multimodal LLM (SPEED) Lab
Florida Atlantic University
IJCNN 2026 (Main Track)
t-SNE visualization

Sim-CLIP is an unsupervised adversarial fine-tuning framework designed to enhance the robustness of vision encoders in Vision–Language Models (VLMs). The workflow begins by generating adversarial perturbations of input images using gradient-based methods such as PGD, creating paired clean and adversarial views. These two views are then passed through a shared-weight Siamese CLIP encoder to produce corresponding representations. Instead of relying on traditional ℓ2 alignment, Sim-CLIP optimizes a cosine similarity objective to enforce semantic consistency between clean and perturbed embeddings, ensuring robustness without sacrificing high-level visual semantics. To stabilize training and prevent representation collapse, a symmetric stop-gradient mechanism is applied, allowing one branch to guide the other in an alternating manner. Once fine-tuned, the robust vision encoder is seamlessly integrated into downstream VLMs (e.g., LLaVA, OpenFlamingo). This plug-and-play pipeline enables improved robustness against adversarial attacks.

Abstract

Vision–Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot recognition. Despite their strong performance, these encoders remain highly vulnerable to imperceptible adversarial perturbations, which can severely degrade both robustness and semantic quality in multimodal reasoning. In this work, we introduce Sim-CLIP, an unsupervised adversarial fine-tuning framework that enhances the robustness of the CLIP vision encoder while preserving fine-grained semantic representations. Sim-CLIP adopts a Siamese training architecture with a cosine similarity objective and a symmetric stop-gradient mechanism to enforce semantic alignment between clean and adversarial views. This design avoids large-batch contrastive learning and auxiliary momentum encoders, enabling robust training with low computational overhead. We evaluate Sim-CLIP across multiple Vision–Language Models and tasks, including image captioning, visual question answering, and zero-shot classification, under both targeted and untargeted adversarial attacks. Experimental results demonstrate that Sim-CLIP consistently outperforms state-of-the-art robust CLIP variants, achieving stronger adversarial robustness while maintaining or improving semantic fidelity. These findings highlight the limitations of existing adversarial defenses and establish Sim-CLIP as an effective and scalable solution for robust vision–language representation learning.

Experimental Results

Robust Performance under untargeted attacks

Comparing robustness of different VLMs against untargeted attacks across image captioning and visual question answering (VQA) tasks. For captioning tasks, we utilize COCO, Flickr30k datasets and report CIDEr score, and for VQA tasks, we use VizWiz, OKVQA datasets and report VQA accuracy..

Main results comparison

Qualitative Examples

Targeted ℓ attacks at ε = 4/255 radii using original and robust CLIP models as vision encoder in LLaVA. Considering the target strings from Table 2, we present generated captions (good caption, captions with mistakes, captions missing intricate details, malicious target output) on original (left) and imperceptible adversarial (right) images.

K-shot evaluation results
K-shot evaluation results

Robustness against targeted attack

Quantitative evaluation of ℓ∞ targeted attacks at ε = 2/255 and ε = 4/255 radii. Sim-CLIP consistently achieves higher CIDER score under both radii.

t-SNE visualization

Performance on zero-shot classification

t-SNE visualization

Funding

National Science Foundation Logo

This material is partly based upon work supported by the U.S. National Science Foundation (NSF) under Grant No. CRII-IIS-RI-2553868. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

BibTeX

@article{hossain2024sim,
  title={Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language models},
  author={Hossain, Md Zarif and Imteaj, Ahmed},
  journal={arXiv preprint arXiv:2407.14971},
  year={2024}
}