EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

*EIC Lab @ Georgia Institute of Technology
^University of Minnesota, Twin Cities
#University of California, Santa Barbara

Abstract


Efficient adaption of large language models (LLMs) on edge devices is essential for applications requiring continuous and privacy-preserving adaptation and inference. However, existing tuning techniques fall short because of the high computation and memory overheads. To this end, we introduce a computation- and memory-efficient LLM tuning framework, called Edge-LLM, to facilitate affordable and effective LLM adaptation on edge devices. Specifically, EdgeLLM features three core components: (1) a layer-wise unified compression (LUC) technique to reduce the computation overhead by generating layer-wise pruning sparsity and quantization bit-width policies, (2) an adaptive layer tuning and voting scheme to reduce the memory overhead by reducing the backpropagation depth, and (3) a complementary hardware scheduling strategy to handle the irregular computation patterns introduced by LUC and adaptive layer tuning, thereby achieving efficient computation and data movements. Extensive experiments demonstrate that Edge-LLM achieves a 2.92× speed up and a ~4× memory overhead reduction as compared to vanilla tuning methods with a comparable task accuracy.

The Growing Demanded for Tuning LLMs on Edge Devices


Figure 1
Healthcase
Figure 2
Personal Assistant
Figure 3
Security

The Cumbersome LLM Sizes Hinders Tuning on the Edge

Figure 1
Requires tens of A100 GPU hours to tune an LLM
Figure 2
Exceeds the available memory of most edge devices (e.g. 8GB ~ 12GB)

Edge-LLM overview


nvs

An overview of our proposed Edge-LLM algorithm framework features two key enablers, each addressing one of the aforementioned overheads. Specifically, we developed (a) a layer-wise unified compression (LUC) technique to further compress the LLM backbone with limited impact on performance, thereby reducing the computation overhead during tuning, and (b) an adaptive layer tuning and voting technique to reduce the memory overhead by decreasing the backpropagation depth required to update the intermediate layers in the target LLM.

nvs

A motivating observation we made in developing LUC is that different layers have various sensitivities to different compression techniques. In the figure above, we visualize the average mean square error (MSE) before and after compressing each layer with different compression techniques and parameters. Based on the varying sensitivities, we propose LUC to generate a layer-wise compression policy, where sensitive layers have a lower compression ratio.

nvs

To reduce the backpropagation depth, we propose adaptive layer tuning and voting. First, we add skip connections between some intermediate layers and the final classification layer, randomly updating one skip connection and a few preceding layers of it in each tuning iteration. After the model is tuned, as each intermediate layer with a skip connection can generate a reasonable output, we further design a voting mechanism to merge the outputs from each intermediate layer. This mechanism generates a calibrated output by selecting the token with the highest output probability during inference.

nvs

LUC and adaptive layer tuning introduce irregular computation patterns, which hinder our theoretical reduction in computation overhead from translating into real device efficiency improvement. To address this issue, we further develop a hardware scheduling module to search for an optimal data flow and offloading strategy, thereby transferring the theoretical improvement in computation into actual device performance gains.

Evaluation


nvs

Evaluation results on the MMLU dataset with the Llama-7B model show that our proposed Edge-LLM achieves better accuracy under similar computation overhead, with approximately a 4x reduction in memory and an overall speedup of 2.92x to 3.38x compared to the vanilla implementation.

Citation


@inproceedings{yu2024edgellm,
title={EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting},
author={Yu, Zhongzhi and Wang, Zheng and Li, Yuhan and You, Haoran and Gao, Ruijie and Zhou, Xiaoya and Bommu, Sreenidhi Reedy and Zhao, (Katie) Yang and Lin, Yingyan Celine},
booktitle={Design Automation Conference},
year={2024},
organization={ACM/IEEE}
}

Acknowledgements


This work was supported in part by CoCoSys, one of the seven centers in JUMP 2.0, a Semiconductor Research Corpo- ration (SRC) program sponsored by DARPA, and the National Science Foundation (NSF) through the NSF CAREER funding (Award number: 2048183).

The website template was borrowed from Instant Neural Graphics Primitives.