Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the intermediate activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others.
To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli (the DACO-400K dataset) and summarizing their activations into per-concept directions. Second, we show that the curated dictionary can be directly used to intervene activations via sparse coding. Third, we propose a new steering approach that uses the dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general-purpose capabilities.
The DACO pipeline has three major components. (1) Dictionary Construction: Multimodal retrieval over caption-image stimuli produces a concept dictionary of 15,000 direction vectors organized into semantic groups. (2) Sparse Coding Steering: The dictionary enables lightweight, training-free concept intervention by decomposing MLLM activations via oblique projection. (3) SAE-based Steering: The dictionary initializes SAE training, grounding the learned features in known semantics. A concept lookup mechanism then annotates each SAE atom, enabling fine-grained, interpretable control at inference time.
DACO-400K contains over 400,000 caption-image stimuli covering 15,000 multimodal concepts. Each concept is annotated as undesirable (to be suppressed for safety) or benign (to be preserved). The broad concept coverage allows DACO to accurately decompose any input representation and precisely target harmful semantics without disrupting unrelated behaviors.
Example jailbreak attempts using multimodal adversarial inputs (typographic visual triggers with harmful text prompts). DACO steers the activations of a frozen Qwen2.5-VL-7B-Instruct to produce safe, meaningful responses — without any fine-tuning or repeated queries. Sensitive words are replaced with asterisks.
UMAP visualization of the DACO-400K concept vectors in the MLLM activation space. Concepts with semantically related meanings cluster together, confirming that the activation space possesses coherent geometric structure that DACO exploits for precise, targeted steering.
If you find our work helpful, please consider citing our paper:
@inproceedings{luo2026daco,
title = {Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs},
author = {Jinqi Luo and Jinyu Yang and Tal Neiman and Lei Fan and Bing Yin and Son Tran
and Mubarak Shah and Ren{\'e} Vidal},
booktitle = {CVPR},
year = {2026}
}