nlines=$( zcat ${fileName} | wc -l ) # Number of reads in the tagAlign file
nlines=$(( (nlines + 1) / 2 )) # half that number
zcat "${fileName}" | shuf | split -d -l ${nlines} - "${outputDir}/${outputStub}" # This will shuffle the lines in the file and split it into two parts
What I believe it should say is:
nreps = 3 # Number of treatment replicates
nlines=$( zcat ${fileName
} | wc -l ) # Number of reads in the tagAlign file
nlines=$(( (nlines + 1) / ${nreps} )) # that number divided by $nreps
zcat "${fileName}" | shuf | split -d -l ${nlines} - "${outputDir}/${outputStub}" # This will shuffle the lines in the file and split it into $nreps parts. Two of them will be your pooled pseudo replicates.
By splitting the pool into two rather than the number of replicates (in my case nreps=3) then you are comparing true replicates of size X to pooled pseudo replicates of size 1.5X (or greater if nreps>3). Therefore, you are instead asking are the two replicates more than twice as different from each other, than when we sample the entire diversity randomly and also add half more reads. If you split by 3 instead and only compare the first two pseudo-pooled replicates then you are instead asking a more logical question which is: are the two replicates more than twice as different from each other, than when we sample the entire diversity randomly? When you only have two replicates, it makes sense to split the pool by 2, but if you have more than 2 then splitting it by 2 no longer answers the original question.
If you worry that the 2/3 pooled pseudoreplicates does not contain all the variance of the full 3 replicates then you can bootstrap for 2/3 to get a mean and variance.
Could anyone correct me if I am wrong? If not, feel free to send me vaidation.