Hi Gary,
Thank you for your email. I uploaded the 150bp markers in response to a researcher who was working with metagenome reads with an average length of 150bp. The quick answer for their use on isolate genomes is that I expect results will be similar using either set of markers. I have provided a more detailed response below.
I do not anticipate a substantial difference in quantification results between using the 101bp markers or the 150bp markers. The main differences that can arise when changing the expected read length in ShortBRED-Identify are that 1) the Quasi Markers (QM's) and Junction Markers (JM's) will have different lengths, and 2) it can affect how many JM's and QM's are produced.
Regarding #1, I set the QM and JM length to be 1/3 of the expected read length. (The 1/3 comes from converting nucleotide length to expected amino acid marker length.) Thus, the JM's and QM's will be a bit longer (50 AA's vs 34 AA's).
Regarding #2, the change in expected read length can result in more or less JM's. JM's are produced when ShortBRED cannot find a unique region long enough to build a valid True Marker (TM). We call them Junction Markers because each part of a JM overlaps with some other protein family, but the junction of these regions is unique to the family it represents. Having a longer expected read length can result in more JM's, because you can imagine instances where ShortBRED could not find a JM that is 34 AA long, but can find one that is 50 AA long.
There is also an effect that can result in less JM's, though. Early in the ShortBRED-Identify process, the program clusters sequences with CD-Hit, and then constructs consensus sequences for each family. When the AA's in a given column of the aligned sequences of a particular family are not at least 95% identical, ShortBRED represents the AA with an "X". When forming the JM's, ShortBRED first tries to find a region 40% of the expected read length that does not contain any X's. Thus, increasing the expected read length can reduce the number of JM's.
Comparing the two sets of markers, I find that we have 48 JM's in the 101bp set, and 47 in the JM set. We have 6 QM's in the 101bp set, and 9 QM's in the 150bp set. (We build 3 QM's when we cannot find a JM, so this makes sense.) Both sets have 4,078 TM's, and I believe 4,072 of these are identical. (I am not sure why six are different - I may have used a different number for the maximum length of all markers in a family or changed some other parameter between versions of ShortBRED).
The length of these JM's and QM's could affect whether some isolate genome AA ORFs or nucleotide sequences are called as valid hits, which will then affect the final ShortBRED score for the associated protein families. When dealing with metagenomic data, we require an alignment of a read to a marker to be greater than or equal to 95% of the read, or contain the full marker (in addition to the identity requirements). When dealing with genomic data, we switch these, and require either 95% of the marker length, or the full ORF/nuc seq to align. Since markers will very likely be shorter than the sequences from the isolate genome, the limit in most cases will be the marker length.
The upshot of all this is that I expect the results to be largely the same. The distribution of JM's and QM's has changed very slightly, and they are a bit longer, but 4,072 of the 4,132 original 101bp markers are the same.