Hi Team,
I have configured alluxio with HDFS cluster, and running TPC-DS benchmarking comparing HDFS Vs Alluxio on Hive and Spark. But I don't see any performance gain while accessing through alluxio in fact some queries are under performing with Alluxio for both Hive and Spark.
I am using alluxio 2.9.5 and Hadoop 3.3
I am testing this with the disaggregated setup of Hadoop workers Vs Alluxio as follows
HDFS test:
Hadoop DataNode and NodeManager are running on separate nodes
Vs
Alluxio test:
Alluxio workers are co-located with Hadoop Nodemanagers(compute layer) and hadoop Data node isolated from node manager.
As these are TPC-DS queries, they are mostly read heavy jobs.
- Running with 1TB TPC-DS dataset
- Alluxio configured with CACHE and ASYNC_THROUGH as read and write types.
- Tried clearing OS buffer cache as it mentioned in other docs but no luck
- Alluxio configured only with memory as a cache and it has 30% of free cache available at any given point.
- As per metrics, cache hit is happing as expected for subsequent queries
- Added following properties to see if it makes any difference but no luck
alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.DeterministicHashPolicy
alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards=3
alluxio.user.file.persistence.initial.wait.time=-1
alluxio.user.file.persist.on.rename=true
alluxio.master.persistence.blacklist=_temporary
I am missing anything here?
Regards
Vinod Gundala