Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable output, via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Then, given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Finally, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activation as a linear combination of the benign and undesirable components. By removing the latter ones from the activation, we reorient the behavior of LLMs towards alignment goals. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.
Pipeline of PaCE has several major steps: Step 1 collects concept vectors and constructs the concept dictionary, Step 2 decomposes the activation vector of the given input by sparse coding to get concept coefficients, and Step 3 performs editing on the concepts towards reoriented response. More details in the paper.
To remove a concept direction `red' from the latent code `red apple' (left), prior works use i) orthogonal projection (middle right), which may remove extra directions, or ii) vector addition (right), where it is hard to pick the edit strength. Instead, PaCE explicitly models the concept dictionary in the latent space and use oblique projection (middle left).
Examples of concepts and their stimuli in the collected PaCE-1M dataset. Our broad collection of concepts enables PaCE to accurately decompose a task input and modify the representation towards desired behaviors. The undesirable annotation is for the detoxification task indicating that the concept will be removed from the representation. The benign annotation indicates that the concept will be preserved in the representation.
This demo shows an example of jailbreaking LLaMA2-7B-Chat and detoxification by PaCE. PaCE successfully detoxifies the response while maintaining the instruction-following capability.
The Representation (Activation) Space of LLaMA2-13B-Chat with the first 10000 Concepts from PaCE-1M. Appendix Figure 15 in the paper shows more details. The visualization is the first two dimensions of UMAP of the concept vectors. We observe that concepts of similar semantics are clustered together, indicating that the activation space has semantic structures.
If you find our work helpful, please consider citing our paper:
@article{luo2024pace,
title={PaCE: Parsimonious Concept Engineering for Large Language Models},
author={Jinqi Luo and Tianjiao Ding and Kwan Ho Ryan Chan and Darshan Thaker and Aditya Chattopadhyay and Chris Callison-Burch and Rene Vidal},
journal={arXiv preprint arXiv:2406.04331},
year={2024}
}
or use the bib entry in NeurIPS format:
@inproceedings{luo2024pace,
title={PaCE: Parsimonious Concept Engineering for Large Language Models},
author={Jinqi Luo and Tianjiao Ding and Kwan Ho Ryan Chan and Darshan Thaker and Aditya Chattopadhyay and Chris Callison-Burch and Rene Vidal},
booktitle={NeurIPS},
year={2024}
}