The basic reason it helps Tesseract to repeat text is because
Tesseract makes an initial assumption what kind of letters it is
looking at: tall (digits, uppercase letters, tall lowercase) or
lowercase letters. Only after it makes that assumption / guess will it
try to match the letters against the proper subset of letters in the
training set.
Consider this texts submitted on their own:
aroma
usa
In the first example Tesseract is fairly likely to get it wrong and
interpret the word as a all-uppercase word. The reason: long words
where letters are all same heights are likely to be uppercase words,
because lowercase words tend to have taller letters in the mix, like
"lunch", "party" or "obscure". In the case of "usa" is may get it
right because it's shorter so could be either lowercase letters or
uppercase.
In the case of digits submitting "32 32 32" may yield better results
than just "32" because in the first case Tesseract gets 6 letters of
same height which increases the likelihood that they be tall.
One would hope that Tesseract had a feeback loop whereby a height
estimation is revisited and reversed if it produced suspicious results
but I have not seen strong evidence that Tesseract has any such check.
Patrick