Performance numbers Hive and Spark

12 views
Skip to first unread message

Vinod Gundala

unread,
Jan 25, 2025, 6:53:55 AMJan 25
to Alluxio Users
Hi Team, 

I have configured alluxio with HDFS cluster, and running TPC-DS benchmarking comparing HDFS Vs Alluxio on Hive and Spark. But I don't see any performance gain while accessing through alluxio in fact some queries are under performing with Alluxio for both Hive and Spark. 

I am using alluxio 2.9.5 and Hadoop 3.3 

I am testing this with the disaggregated setup of Hadoop workers Vs Alluxio as follows 

HDFS test:
Hadoop DataNode and NodeManager are running on separate nodes 
 Vs 
Alluxio test:
Alluxio workers are co-located with Hadoop Nodemanagers(compute layer) and hadoop  Data node isolated from node manager.

As these are TPC-DS queries, they are mostly read heavy jobs. 
- Running with 1TB TPC-DS dataset
- Alluxio configured with CACHE and ASYNC_THROUGH as read and write types. 
- Tried clearing OS buffer cache as it mentioned in other docs but no luck
- Alluxio configured only with memory as a cache and it has 30% of free cache available at any given point. 
- As per metrics, cache hit is happing as expected for subsequent queries 

- Added following properties to see if it makes any difference but no luck 
alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.DeterministicHashPolicy

alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards=3

alluxio.user.file.persistence.initial.wait.time=-1

alluxio.user.file.persist.on.rename=true

alluxio.master.persistence.blacklist=_temporary


I am missing anything here?


Regards

Vinod Gundala

Reply all
Reply to author
Forward
0 new messages