Xing, HexuHexuXingBraun, TorstenTorstenBraun0000-0001-5968-71082026-02-022026-02-022026https://boris-portal.unibe.ch/handle/20.500.12422/230903Transformer-based architectures have introduced end-to-end solutions for Multiple Object Tracking (MOT), seamlessly integrating object detection and association. However, their high computational demands—such as the need for feature map fusion across multiple frames—pose significant challenges to real-time deployment, limiting their practicality. In this paper, we present WT-MOT (Weight Table-based Multiple Object Tracking), a novel framework that addresses these limitations by leveraging the underutilized potential of attention weight tables for efficient object similarity evaluation. WT-MOT employs self- and cross-attention mechanisms to assess object similarity and directly assign identifications, integrating spatial, appearance, and temporal dimensions. By introducing the “frame embed- ding” concept, WT-MOT enhances the ability to distinguish objects across frames without relying on motion models or post-processing steps. Experimental results on the MOT17 and MOT20 benchmarks demonstrate the effectiveness of WT-MOT, achieving MOTA scores of 76.9% and 73.2%, respectively, setting new performance standards for Transformer-based MOT solutions. These findings highlight WT-MOT as a computationally efficient and robust tool for real-time MOT applications, paving the way for broader adoption of Transformer-based tracking methods in practical environments.enHarnessing Attention Weight Tables for Computationally Efficient Multiple Object Tracking with Transformersarticle10.48620/94261