Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

CVPR 2026

1University of Pennsylvania   2Amazon   3University of Central Florida

Dictionary-Aligned Concept Control (DACO) is an inference-time steering framework that leverages a curated dictionary of 15,000 multimodal concepts and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. DACO significantly improving the safety of MLLMs while preserving their general-purpose capabilities.

DACO teaser figure

Abstract

Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the intermediate activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others.

To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli (the DACO-400K dataset) and summarizing their activations into per-concept directions. Second, we show that the curated dictionary can be directly used to intervene activations via sparse coding. Third, we propose a new steering approach that uses the dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general-purpose capabilities.

Framework

DACO framework pipeline

The DACO pipeline has three major components. (1) Dictionary Construction: Multimodal retrieval over caption-image stimuli produces a concept dictionary of 15,000 direction vectors organized into semantic groups. (2) Sparse Coding Steering: The dictionary enables lightweight, training-free concept intervention by decomposing MLLM activations via oblique projection. (3) SAE-based Steering: The dictionary initializes SAE training, grounding the learned features in known semantics. A concept lookup mechanism then annotates each SAE atom, enabling fine-grained, interpretable control at inference time.

Visualization

DACO-400K Dataset Samples

DACO-400K contains over 400,000 caption-image stimuli covering 15,000 multimodal concepts. Each concept is annotated as undesirable (to be suppressed for safety) or benign (to be preserved). The broad concept coverage allows DACO to accurately decompose any input representation and precisely target harmful semantics without disrupting unrelated behaviors.

DACO-400K dataset concept examples

Safety Steering Examples

Example jailbreak attempts using multimodal adversarial inputs (typographic visual triggers with harmful text prompts). DACO steers the activations of a frozen Qwen2.5-VL-7B-Instruct to produce safe, meaningful responses — without any fine-tuning or repeated queries. Sensitive words are replaced with asterisks.

DACO safety steering examples showing safe vs. unsafe responses

Concept Space Structure

UMAP visualization of the DACO-400K concept vectors in the MLLM activation space. Concepts with semantically related meanings cluster together, confirming that the activation space possesses coherent geometric structure that DACO exploits for precise, targeted steering.

UMAP visualization of concept clusters in MLLM activation space

BibTeX

If you find our work helpful, please consider citing our paper:

@inproceedings{luo2026daco,
    title     = {Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs},
    author    = {Jinqi Luo and Jinyu Yang and Tal Neiman and Lei Fan and Bing Yin and Son Tran
                 and Mubarak Shah and Ren{\'e} Vidal},
    booktitle = {CVPR},
    year      = {2026}
}