领带 发表于 2025-3-28 14:38:56
http://reply.papertrans.cn/33/3208/320756/320756_41.pngGlaci冰 发表于 2025-3-28 21:17:50
Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Mult performance for SpMM is challenging due to the irregular distribution of non-zero elements and memory access patterns. Therefore, several sparse matrix reordering algorithms have been developed to improve data locality for SpMM. However, existing approaches for reordering sparse matrix have not conCYT 发表于 2025-3-29 02:30:30
Reduced-Precision and Reduced-Exponent Formats for Accelerating Adaptive Precision Sparse Matrix–Vecr adaptive precision algorithms dynamically adapt at runtime the precisions used for different variables or operations. For example Graillat et al. (2023) have proposed an adaptive precision sparse matrix–vector product (SpMV) which stores the matrix elements in a precision inversely proportional tography 发表于 2025-3-29 04:27:08
Mixed Precision Randomized Low-Rank Approximation with GPU Tensor Coresstigate the design and development of such methods capable of exploiting recent mixed precision accelerators like GPUs equipped with tensor core units. We combine three new ideas to exploit mixed precision arithmetic in randomized LRA. The first is to perform the matrix multiplication with mixed pre亲属 发表于 2025-3-29 08:35:09
http://reply.papertrans.cn/33/3208/320756/320756_45.pngGlossy 发表于 2025-3-29 13:12:25
Minimizing I/O in Toom-Cook Algorithmsteger multiplication algorithms frequently used in many applications, particularly for small . sizes (2, 3, and 4). Previous studies focus on minimizing Toom-Cook’s arithmetic cost, sometimes at the expense of asymptotically higher communication costs and memory footprint. For many high-performancehypertension 发表于 2025-3-29 18:36:30
GPU-Accelerated BFS for Dynamic Networkshe electronic design automation (EDA) field to social network analysis. Many contemporary real-world networks are dynamic and evolve rapidly over time. In such cases, recomputing the BFS from scratch after each graph modification becomes impractical. While parallel solutions, particularly for GPUs,火光在摇曳 发表于 2025-3-29 21:28:55
QClique: Optimizing Performance and Accuracy in Maximum Weighted Cliquet search-based MWC algorithms and show that high-accuracy weighted cliques can be discovered in the early stages of the execution if searching the combinatorial space is performed systematically. Based on this observation, we introduce QClique as an approximate MWC algorithm that processes the searcxanthelasma 发表于 2025-3-30 01:03:14
A Fast Wait-Free Solution to Read-Reclaim Races in Reference Counting major programming languages (e.g., Arc in Rust, shared_ptr and atomic<shared_ptr> in C++)..In concurrent reference counting, read-reclaim races, where a read of a mutable variable races with a write that deallocates the old value, require special handling: use-after-free errors occur if the objectProject 发表于 2025-3-30 05:57:33
How to Relax Instantly: Elastic Relaxation of Concurrent Data Structures thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-