Hint-Aug: Drawing Hints from Foundation Vision Transformers towards Boosted Few-shot Parameter-Efficient Tuning

EIC Lab @ Georgia Institute of Technology

Abstract


Despite the growing demand for tuning foundation vision transformers (FViTs) on downstream tasks, fully unleashing FViTs' potential under data-limited scenarios (e.g., few-shot tuning) remains a challenge due to FViTs' data-hungry nature. Common data augmentation techniques fall short in this context due to the limited features contained in the few-shot tuning data. To tackle this challenge, we first identify an opportunity for FViTs in few-shot tuning: pretrained FViTs themselves have already learned highly representative features from large-scale pretraining data, which are fully preserved during widely used parameter-efficient tuning. We thus hypothesize that leveraging those learned features to augment the tuning data can boost the effectiveness of few-shot FViT tuning. To this end, we propose a framework called Hint-based Data Augmentation (Hint-Aug), which aims to boost FViT in few-shot tuning by augmenting the over-fitted parts of tuning samples with the learned features of pretrained FViTs. Specifically, Hint-Aug integrates two key enablers: (1) an Attentive Over-fitting Detector (AOD) to detect over-confident patches of foundation ViTs for potentially alleviating their over-fitting on the few-shot tuning data and (2) a Confusion-based Feature Infusion (CFI) module to infuse easy-to-confuse features from the pretrained FViTs with the over-confident patches detected by the above AOD in order to enhance the feature diversity during tuning. Extensive experiments and ablation studies on five datasets and three parameter-efficient tuning techniques consistently validate Hint-Aug's effectiveness: 0.04% sim 32.91% higher accuracy over the state-of-the-art (SOTA) data augmentation method under various low-shot settings. For example, on the Pet dataset, Hint-Aug achieves a 2.22% higher accuracy with 50% less training data over SOTA data augmentation methods.

Motivating Observations


Characteristics of Parameter-efficient Tuning


nvs

Overview of representative parameter-efficient tuning methods. Most of existing parameter-efficient tuning methods applies a learnable modules on top of the pretrained foundation vision transformer. By only tuning the applied modules, foundation vision transformers can be adopted to new downstream tasks. This paradigm has two characteristics: (1) the pretrained backbone of foundation vision transformers consists of generalizable features learned during pretraining, (2) the original weight can be easily extracted by simply removing the applied parameter-efficient tuning modules.

Attention Shift During Tuning



profile_tensorf

During tuning process, the attention distribution shifts. In earlier phase of tuning, the shift is marignal. But in later phase of tuning, the attention may shift to locations that irrelevant to classification, the reduced tuning accuracy at this stage further suggests that tuning at this stage may suffer from over-fitting issue. This further motivates us to consider if we can leverage the foundation vision transformer's response to input as an identifier to guide the augmentation process.

The Proposed Hint-Aug Framework



profile_tensorf

Our proposed Hint-Aug framework. The core idea of Hint-Aug is to leverage the foundation vision transformer to guide the tuning process. It has two key enablers, (1) Attentive Over-fitting Detector to identify the potential over-fitting in the input sample by comparing the difference between the attention map generated by pretrained foundation vision transformer and the parameter-efficient tuning one. (2) Confusion-based Feature Infusion to use adversarial attack to infuse the easy-to-confuse feature to the selected patch of input data.

Evaluation


evaluation_platform

We benchmark Hint-Aug over NPS and no-augment baselines on five datasets, three tuning methods and six different few-shot settings. We observe that Hint-Aug achieves +0.04%~+32.91% higher accuracy across different shots, tuning methods and datasets over the state-of-the-art baseline methods.

Citation


@article{yu2023hint-aug,
    author = {Zhongzhi Yu, Shang Wu, Yonggan Fu, Shunyao Zhang, and Yingyan (Celine) Lin},
    title = {Hint-Aug: Drawing Hints from Foundation Vision Transformers towards Boosted Few-shot Parameter-Efficient 
        Tuning},
    journal = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 (CVPR 2023)},
    year = {2023},
    publisher = {IEEE/CVF},
    address = {Vancouver, Canada}
}

Acknowledgements


The work was supported by the National Science Foundation (NSF) through the NSF CCF program (Award number: 2211815) and supported in part by CoCoSys, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.

The website template was borrowed from Instant Neural Graphics Primitives.