Hi Chin,
sorry to disappoint you, but we never know the ground truth :-(
As you noticed, every step involves configuring a set of parameters and selection of algorithms. But even if you would know all details about those algorithms, it is still not clear which one to use. It depends on your biological question, used wet-lab protocols, ...
As a guideline, you might want to take a look into our platform QIITA:
https://qiita.ucsd.edu/There, we try to face the problem you mentioned: how to combine different studies from different researchers. Our approach is to use a strictly defined pipeline for all datasets with only a few parameters to tune. Maybe you want to adopt some of those best practice decisions. But keep in mind, you might get much better results for your specific experiment if you use settings not listed in QIITA.
Does this help?