EleViT: exploiting element-wise products for designing efficient and lightweight vision transformers
Uzair Shah, Jens Schneider, Giovanni Pintore, Enrico Gobbetti, Mahmood Alzubaidi, Mowafa Househ, and Marco Agus
2024
Abstract
We introduce EleViT, a novel vision transformer optimized for image processing tasks. Aligning with the trend towards sustainable computing, EleViT addresses the need for lightweight and fast models without compromising performance by redefining the multihead attention mechanism by primarily using element-wise products instead of traditional matrix multiplication. This modification preserves attention capabilities, while enabling multiple multihead attention blocks within a convolutional projection framework, resulting in a model with fewer parameters and improved efficiency in training and inference, especially for moderately complex datasets. Benchmarks against state-of-theart vision transformers showcase competitive performance on low-data regime datasets like CIFAR-10, CIFAR-100, and Tiny-ImageNet-200.
Reference and download information
Uzair Shah, Jens Schneider, Giovanni Pintore, Enrico Gobbetti, Mahmood Alzubaidi, Mowafa Househ, and Marco Agus. EleViT: exploiting element-wise products for designing efficient and lightweight vision transformers. In Proc. T4V - IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024. To appear.
Related multimedia productions
Bibtex citation record
@inproceedings{Shah:2024:EEE, author = {Uzair Shah and Jens Schneider and Giovanni Pintore and Enrico Gobbetti and Mahmood Alzubaidi and Mowafa Househ and Marco Agus}, title = {{EleViT}: exploiting element-wise products for designing efficient and lightweight vision transformers}, booktitle = {Proc. T4V - IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)}, year = {2024}, abstract = { We introduce EleViT, a novel vision transformer optimized for image processing tasks. Aligning with the trend towards sustainable computing, EleViT addresses the need for lightweight and fast models without compromising performance by redefining the multihead attention mechanism by primarily using element-wise products instead of traditional matrix multiplication. This modification preserves attention capabilities, while enabling multiple multihead attention blocks within a convolutional projection framework, resulting in a model with fewer parameters and improved efficiency in training and inference, especially for moderately complex datasets. Benchmarks against state-of-theart vision transformers showcase competitive performance on low-data regime datasets like CIFAR-10, CIFAR-100, and Tiny-ImageNet-200. }, note = {To appear}, url = {http://vic.crs4.it/vic/cgi-bin/bib-page.cgi?id='Shah:2024:EEE'}, }
The publications listed here are included as a means to ensure timely
dissemination of scholarly and technical work on a non-commercial basis.
Copyright and all rights therein are maintained by the authors or by
other copyright holders, notwithstanding that they have offered their works
here electronically. It is understood that all persons copying this
information will adhere to the terms and constraints invoked by each
author's copyright. These works may not be reposted without the
explicit permission of the copyright holder.
Please contact the authors if you are willing to republish this work in
a book, journal, on the Web or elsewhere. Thank you in advance.
All references in the main publication page are linked to a descriptive page
providing relevant bibliographic data and, possibly, a link to
the related document. Please refer to our main
publication repository page for a
page with direct links to documents.