分发 发表于 2025-3-25 04:44:35
http://reply.papertrans.cn/25/2424/242324/242324_21.png起皱纹 发表于 2025-3-25 10:11:04
http://reply.papertrans.cn/25/2424/242324/242324_22.png强行引入 发表于 2025-3-25 12:46:46
,nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding,-Occ, a novel method that encodes occupancy data into a compact latent feature space using a VQ-VAE. This approach simplifies semantic occupancy prediction into feature simulation in the VQ latent space, making it easier and more memory-efficient. Our method enables direct generation of semantic occ新娘 发表于 2025-3-25 18:23:26
http://reply.papertrans.cn/25/2424/242324/242324_24.png火光在摇曳 发表于 2025-3-25 22:48:48
,PiTe: Pixel-Temporal Alignment for Large Video-Language Model,multi-modal pre-training dataset PiTe-143k, the dataset provision of moving trajectories in pixel level for all individual objects, that appear and mention in the video and caption both, by our automatic annotation pipeline. Meanwhile, . demonstrates astounding capabilities on myriad video-related mExpediency 发表于 2025-3-26 01:01:26
http://reply.papertrans.cn/25/2424/242324/242324_26.pngGastric 发表于 2025-3-26 05:49:20
,FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models,ency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive .qu.ncy truncation to refine the guidance of .usion models for universal editing tasks (.). Our method achieves comparable results with state-of-the-art methods across a variety能量守恒 发表于 2025-3-26 09:46:02
http://reply.papertrans.cn/25/2424/242324/242324_28.png小卒 发表于 2025-3-26 13:54:07
http://reply.papertrans.cn/25/2424/242324/242324_29.png固执点好 发表于 2025-3-26 18:53:51
Text-Guided Video Masked Autoencoder,tion, we next introduce a unified framework for joint MAE and masked video-text contrastive learning. We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE on a variety of video recognition tasks,