Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

*EIC Lab @ Georgia Institute of Technology

Abstract


Attention is a fundamental component behind the remarkable achievements of large language models (LLMs). However, our current understanding of the attention mechanism, especially regarding how attention distributions are established, remains limited. Inspired by recent studies that explore the presence of attention sink in the initial token, which receives disproportionately large attention scores despite their lack of semantic importance, this work delves deeper into this phenomenon. We aim to provide a more profound understanding of the existence of attention sinks within LLMs and to uncover ways to enhance the achievable accuracy of LLMs by directly optimizing the attention distributions, without the need for weight finetuning. Specifically, this work begins with comprehensive visualizations of the attention distributions in LLMs during inference across various inputs and tasks. Based on these visualizations, to the best of our knowledge, we are the first to discover that (1) attention sinks occur not only at the start of sequences but also within later tokens of the input, and (2) not all attention sinks have a positive impact on the achievable accuracy of LLMs. Building upon our findings, we propose a training-free Attention Calibration Technique (ACT) that automatically optimizes the attention distributions on the fly during inference in an input-adaptive manner. Extensive experiments validate that ACT consistently enhances the accuracy of various LLMs across different applications. Specifically, ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B.

Background on the Attention Sink Phenomenon


nvs

Recent studies (e.g., StreamLLM [Xiao et al, ICLR 2024]) observes there are certain tokens extract excessively high attention but only contain limited semantic information. They only observe the attention sink phenomenon in initial tokens and consider the attention sink is helping LLMs by preserving the attention distribution.

In this paper, we aim to dive deeper in this phenomenon and answer the following three research questions:

  • Q1: Does an attention sink only exist in the initial token?
  • Q2: Will preserving attention sinks always benefit LLMs’ accuracy in different scenarios?
  • Q3: Can we enhance LLMs’ accuracy by solely manipulating attention sinks without any weight finetuning?

Q1: Does an attention sink only exist in the initial token?

Figure 1
Upper: Visualization of the averaged attention maps across all heads and layers of Llama2-7B-chat on different datasets. Lower: Visualization of the averaged attention maps across all heads in each layer when processing a sample from SST2 with Llama2-7B-chat. Identified attention sinks in the averaged attention map from SST2 are bounded with green boxes.

Figure 1
Attention score distribution of the initial token (i.e., the attention sink observed in StreamLLM), non-initial high attention tokens, and other tokens for classification tasks (top) and multiple-choice tasks (bottom).

Figure 2
Frequency of tokens appear with significantly higher attention scores (i.e., identified as attention sinks). Most of these tokens are special characters or punctuation that have limited semantic information.

Q2: Will preserving attention sinks always benefit LLMs’ accuracy in different scenarios?


nvs

Visualization of accuracy improvement in the MMLU dataset (Hendrycks et al., 2020) achieved by reducing the attention score of later attention sinks in the middle of input sequences for each individual head separately. It shows that later attention sinks in the most of attention heads are hurting the performance of LLMs as removing them can improve the accuracy.

Q3: Can we enhance LLMs’ accuracy by solely manipulating attention sinks without any weight finetuning?


To ahcieve this goal, we propose Attention Calibration Technique (ACT), featuring the following three steps:

  • Select a set of ineffective heads for each LLM
  • Reduce attention of attention sinks in the selected heads
  • Distribute redundant attention to other tokens

nvs

Attention map visualizaiton before and after applying our proposed ACT.

Evaluation Results


With our proposed ACT, we can

nvs

ACT on open-ended question-answering datasets using Llama2-chat with different sizes. Each result for SQuADv1/v2 is presented as the exact match score/F1 score. On MT-Bench, ACT reduces the gap between smaller and larger Llama2 model by 33%.

nvs

ACT on text classification datasets

nvs

ACT on domain-specific multiple choice datasets

Citation


@article{yu2024unveiling,
title={Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration},
author={Yu, Zhongzhi and Wang, Zheng and Fu, Yonggan and Shi, Huihong and Shaikh, Khalid and Lin, Yingyan Celine},
journal={arXiv preprint arXiv:2406.15765},
year={2024}
}

Acknowledgements


This work is supported by the National Science Foundation (NSF) through the CCRI funding (Award number: 2016727) and CoCoSys, one of seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.

The website template was borrowed from Instant Neural Graphics Primitives.