突袭 发表于 2025-3-30 11:55:39
Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly,ile humans can say “.” when they are uncertain (i.e., . from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this work, we promote a problem formulation for ., where we prefer abste初学者 发表于 2025-3-30 16:08:32
http://reply.papertrans.cn/24/2343/234269/234269_52.pngFlu表流动 发表于 2025-3-30 17:15:27
http://reply.papertrans.cn/24/2343/234269/234269_53.png阻碍 发表于 2025-3-30 21:59:21
http://reply.papertrans.cn/24/2343/234269/234269_54.png抛媚眼 发表于 2025-3-31 03:52:25
http://reply.papertrans.cn/24/2343/234269/234269_55.pngHEPA-filter 发表于 2025-3-31 06:36:23
,Contrastive Vision-Language Pre-training with Limited Resources,arning. However, these works require a tremendous amount of data and computational resources (., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we propose a stack of novel methods, which significaRoot494 发表于 2025-3-31 10:44:18
http://reply.papertrans.cn/24/2343/234269/234269_57.pngMEET 发表于 2025-3-31 14:34:33
http://reply.papertrans.cn/24/2343/234269/234269_58.png凹处 发表于 2025-3-31 20:40:34
,X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks,d of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and language streams are independent until the end and they are aligned using an efficient dot-product opeHemiplegia 发表于 2025-3-31 22:52:18
,Learning Disentanglement with Decoupled Labels for Vision-Language Navigation,rld navigation. Intuitively, we find that instruction disentanglement for each viewpoint along the agent’s path is critical for accurate navigation. However, most methods only utilize the whole complex instruction or inaccurate sub-instructions due to the lack of accurate disentanglement as an inter