Anyscale Enhances Ray Data with Joins and Hash-Shuffle for Improved Performance

Optimizing Zoom Transcriptions with Multichannel Audio Recording

Timothy Morano
May 20, 2025 04:25

Anyscale introduces a hash-based shuffle backend in Ray Data, enhancing joins and performance for repartitioning and aggregations. Discover the advancements in the Ray 2.46 release.

Anyscale has unveiled significant improvements to Ray Data, highlighted by the introduction of a hash-based shuffle backend, according to Anyscale. This new feature, part of the Ray 2.46 release, aims to enhance joins and improve performance for data repartitioning and aggregations, while also reducing memory pressure.

Enhancements in Ray Data

The latest release boasts several new features, including native join support via the ds.join() API, key-based repartitioning, and a simplified custom aggregation API named AggregateFnV2. Additionally, the performance of large-scale sorting has been improved, which enhances range partitioning shuffle.

The newly introduced hash-based shuffle backend addresses previous limitations of the range-based shuffle approach. In prior versions, shuffling relied on range-partitioning, which was resource-intensive and prone to bottlenecks. The new method partitions incoming data blocks based on key-value tuples, directing them to corresponding Aggregator actors for efficient processing.

Implementing Joins with Hash Shuffle

Ray 2.46 introduces support for various join types, including inner, left/right, and full outer joins. The hash-shuffle backend co-locates records with the same keys, optimizing performance. This approach utilizes Apache Arrow’s Acero engine through PyArrow’s native Table.join operation, although it can be memory-intensive.

Benchmarking Performance

Performance benchmarks demonstrate substantial improvements across multiple workloads. Tests conducted on a cluster with m7i.4xlarge and m7i.16xlarge instances reveal performance gains ranging from 3.3x to 5.6x when using the hash-based shuffle, compared to previous versions. Notably, the TPCH-Q1-SF1000 workload, which was previously unmanageable, is now feasible with the new backend.

Additional tests showed that range-partitioning shuffle has also improved, with runtime enhancements between 1.6x and 4.3x. Importantly, the hash shuffle backend significantly reduces peak memory usage, with improvements up to 3.9x.

Future Developments

Looking ahead, Anyscale plans to expand support for different join types and implement logical plan optimizations to reorder joins. Further enhancements to data preprocessors are also anticipated.

These advancements in Ray Data are set to empower developers with more efficient data processing capabilities. For more insights, visit the official Anyscale blog.

Image source: Shutterstock

Source link