Performance numbers Hive and Spark

12 views

Skip to first unread message

Vinod Gundala

unread,

Jan 25, 2025, 6:53:55 AM1/25/25

to Alluxio Users

Hi Team,

I have configured alluxio with HDFS cluster, and running TPC-DS benchmarking comparing HDFS Vs Alluxio on Hive and Spark. But I don't see any performance gain while accessing through alluxio in fact some queries are under performing with Alluxio for both Hive and Spark.

I am using alluxio 2.9.5 and Hadoop 3.3

I am testing this with the disaggregated setup of Hadoop workers Vs Alluxio as follows

HDFS test:

Hadoop DataNode and NodeManager are running on separate nodes

Alluxio test:

Alluxio workers are co-located with Hadoop Nodemanagers(compute layer) and hadoop Data node isolated from node manager.

As these are TPC-DS queries, they are mostly read heavy jobs.

- Running with 1TB TPC-DS dataset

- Alluxio configured with CACHE and ASYNC_THROUGH as read and write types.

- Tried clearing OS buffer cache as it mentioned in other docs but no luck

- Alluxio configured only with memory as a cache and it has 30% of free cache available at any given point.

- As per metrics, cache hit is happing as expected for subsequent queries

- Added following properties to see if it makes any difference but no luck

alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.DeterministicHashPolicy

alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards=3

alluxio.user.file.persistence.initial.wait.time=-1

alluxio.user.file.persist.on.rename=true

alluxio.master.persistence.blacklist=_temporary

I am missing anything here?

Regards

Vinod Gundala

Reply all

Reply to author

Forward

0 new messages