外科医生 发表于 2025-4-1 04:16:19
,BRAVE: Broadening the Visual Encoding of Vision-Language Models,sentation that can be directly fed as the input to a frozen LM. . achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and ha