BUOY 发表于 2025-3-26 23:20:25

Learning Linguistic Association Towards Efficient Text-Video Retrieval,llation strategy, which allows the student model to adaptively learn the knowledge from the teacher model. This strategy also suppresses the spurious relations introduced during the linguistic association. Extensive experiments demonstrate the effectiveness and efficiency of LINAS with various basel

mediocrity 发表于 2025-3-27 03:15:16

http://reply.papertrans.cn/24/2343/234269/234269_32.png

阴郁 发表于 2025-3-27 05:51:43

,Learning Disentanglement with Decoupled Labels for Vision-Language Navigation,ne-grained labels, we design a Disentangled Decoding Module to guide discriminative feature extraction and help alignment of multi-modalities. To reveal the generality of our proposed method, we apply it on a LSTM-based model and two recent Transformer-based models. Extensive experiments on two VLN

Encapsulate 发表于 2025-3-27 10:02:39

,Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input,nswering, image-text retrieval and referring expression comprehension experiments. Results confirm that, whereas alternative architectures including ViLBERT and UNITER may excel in particular tasks, Switch-BERT can consistently achieve better or comparable performances than the current state-of-the-

calumniate 发表于 2025-3-27 16:39:31

http://reply.papertrans.cn/24/2343/234269/234269_35.png

过滤 发表于 2025-3-27 20:20:57

,Video Question Answering with Iterative Video-Text Co-tokenization,to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150–360 to only 67, producing a highly efficient video question answering model (Code: .).

圆锥体 发表于 2025-3-27 22:45:16

Conference proceedings 2022ning; object recognition; image classification; image processing; object detection; semantic segmentation; human pose estimation; 3d reconstruction; stereo vision; computational photography; neural networks; image coding; image reconstruction; object recognition; motion estimation..

变异 发表于 2025-3-28 04:21:58

http://reply.papertrans.cn/24/2343/234269/234269_38.png

DEFER 发表于 2025-3-28 06:53:13

Studies in Contemporary EconomicsThis simple yet effective architecture of X-DETR shows good accuracy and fast speeds for multiple instance-wise vision-language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at .20 frames per second without using any LVIS annotation during training. The code is available at

欲望小妹 发表于 2025-3-28 14:06:03

,X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks,This simple yet effective architecture of X-DETR shows good accuracy and fast speeds for multiple instance-wise vision-language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at .20 frames per second without using any LVIS annotation during training. The code is available at
页: 1 2 3 [4] 5 6 7
查看完整版本: Titlebook: Computer Vision – ECCV 2022; 17th European Confer Shai Avidan,Gabriel Brostow,Tal Hassner Conference proceedings 2022 The Editor(s) (if app