Scaling White-Box Transformers for Vision

Jinrui Yang*1, Xianhang Li*1, Druv Pai2, Yuyin Zhou1, Yi Ma2, Yaodong Yu†2, Cihang Xie†1

* equal contribution, † equal advising

1UC Santa Cruz, 2UC Berkeley

Framework of CRATE-α

alt text

One layer of the CRATE-α model architecture. MSSA (Multi-head Subspace Self-Attention) represents the compression block, and ODL (Overcomplete Dictionary Learning) represents the sparse coding block.


CRATE, a white-box transformer architecture designed to learn compressed and sparse representations, offers an intriguing alternative to standard vision transformers (ViTs) due to its inherent mathematical interpretability. Despite extensive investigations into the scaling behaviors of language and vision transformers, the scalability of CRATE remains an open question which this paper aims to address. Specifically, we propose CRATE-α, featuring strategic yet minimal modifications to the sparse coding block in the CRATE architecture design, and a light training recipe designed to improve the scalability of CRATE. Through extensive experiments, we demonstrate that CRATE-α can effectively scale with larger model sizes and datasets. For example, our CRATE-α-B substantially outperforms the prior best CRATE-B model accuracy on ImageNet classification by 3.7%, achieving an accuracy of 83.2%. Meanwhile, when scaling further, our CRATE-α-L obtains an ImageNet classification accuracy of 85.1%. More notably, these model performance improvements are achieved while preserving, and potentially even enhancing the interpretability of learned CRATE models, as we demonstrate through showing that the learned token representations of increasingly larger trained CRATE-α models yield increasingly higher-quality unsupervised object segmentation of images. The project page is

Comparison of CRATE, CRATE-α, and ViT

alt text

Left: We demonstrate how modifications to the components enhance the performance of the CRATE model on ImageNet-1K. Right: We compare the FLOPs and accuracy on ImageNet-1K of our methods with ViT Dosovitskiy et al., 2020 and CRATE Yu et al., 2023. CRATE is trained only on ImageNet-1K, while ours and ViT are pre-trained on ImageNet-21K.

Visualize the Improvement of Semantic Interpretability of CRATE-α

alt text

Visualization of segmentation on COCO val2017 Lin et al., 2014 with MaskCut Wang et al., 2023. Top row: Supervised ours effectively identifies the main objects in the image. Compared with CRATE (Middle row), ours achieves better segmentation performance in terms of boundary. Bottom row: Supervised ViT fails to identify the main objects in most images. We mark failed images with Red Box.