EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

*EIC Lab @ Georgia Institute of Technology

^University of Minnesota, Twin Cities

#University of California, Santa Barbara

The 61st ACM/IEEE Design Automation Conference (DAC)

description Paper description Slides description Code

Abstract

Efficient adaption of large language models (LLMs) on edge devices is essential for applications requiring continuous and privacy-preserving adaptation and inference. However, existing tuning techniques fall short because of the high computation and memory overheads. To this end, we introduce a computation- and memory-efficient LLM tuning framework, called Edge-LLM, to facilitate affordable and effective LLM adaptation on edge devices. Specifically, EdgeLLM features three core components: (1) a layer-wise unified compression (LUC) technique to reduce the computation overhead by generating layer-wise pruning sparsity and quantization bit-width policies, (2) an adaptive layer tuning and voting scheme to reduce the memory overhead by reducing the backpropagation depth, and (3) a complementary hardware scheduling strategy to handle the irregular computation patterns introduced by LUC and adaptive layer tuning, thereby achieving efficient computation and data movements. Extensive experiments demonstrate that Edge-LLM achieves a 2.92× speed up and a ~4× memory overhead reduction as compared to vanilla tuning methods with a comparable task accuracy.

The Growing Demanded for Tuning LLMs on Edge Devices

The Cumbersome LLM Sizes Hinders Tuning on the Edge

Requires tens of A100 GPU hours to tune an LLM

Exceeds the available memory of most edge devices (e.g. 8GB ~ 12GB)

Edge-LLM overview

Evaluation

Citation

@inproceedings{yu2024edgellm,
title={EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting},
author={Yu, Zhongzhi and Wang, Zheng and Li, Yuhan and You, Haoran and Gao, Ruijie and Zhou, Xiaoya and Bommu, Sreenidhi Reedy and Zhao, (Katie) Yang and Lin, Yingyan Celine},
booktitle={Design Automation Conference},
year={2024},
organization={ACM/IEEE}
}

Acknowledgements

This work was supported in part by CoCoSys, one of the seven centers in JUMP 2.0, a Semiconductor Research Corpo- ration (SRC) program sponsored by DARPA, and the National Science Foundation (NSF) through the NSF CAREER funding (Award number: 2048183).

The website template was borrowed from Instant Neural Graphics Primitives.