Periods before dialogue tags

80 views
Skip to first unread message

Gabriel Corrado

unread,
Apr 23, 2022, 12:09:50 AM4/23/22
to Standard Ebooks
I'm reading Kim and seeing the same error over and over:

> “None⁠—none.” said the lama earnestly.

The period should be a comma. This is probably an OCR thing, since I see it in the PG transcription, too. Is anyone checking for this? If not, it's probably rather common. The regex `\.” [a-z]` should turn up most cases—14 in this book alone.

Unfortunately, sometimes a proper name immediately follows the closing quote. I didn't find any examples in Kim, but I wonder if anything can be done about it, besides catching them manually, of course.

Nick

Alexander Yankov

unread,
Apr 24, 2022, 4:29:54 PM4/24/22
to Standard Ebooks
Using your regex, I found only 11 cases of this in Kim. I verified them against the scans, and in each case it should be a comma. Opened PR 4.

Do we have examples of this error in other books?

Vince

unread,
Apr 24, 2022, 5:20:29 PM4/24/22
to Standard Ebooks
A quick run through the corpus shows that there’s quite a bit of it. The regex matches on over 60 different books, including several I did. Not all of the matches are valid, of course, but a quick glance indicates that most of them are. I’ll undertake PR’s for mine this week. The rest of the corpus will just have to be addressed over time.

That’s a useful regex to add to our review steps, though. Thanks!

David Grigg

unread,
Apr 24, 2022, 7:06:30 PM4/24/22
to Standard Ebooks
Kim was my production. Sorry I missed these, though it sounds a very common thing to miss.

I’m reminded of Mark Twain’s observation: “God first made idiots. That was for practice. Then he made proofreaders.”
--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/74A0FB6E-EF5A-4003-A277-51E7DE10F43F%40letterboxes.org.

Vince

unread,
Apr 24, 2022, 7:23:40 PM4/24/22
to Standard Ebooks
Yeah, apparently a period in that situation just doesn’t raise a red flag in most of our eyes. Two of the Dumas’ I did have a couple of dozen between them. Missing one or two is one thing, but jiminy…

Vince

unread,
Apr 24, 2022, 7:57:39 PM4/24/22
to Standard Ebooks
I underestimated by a third. Here’s the list for anyone that has some time on their hands. If a bunch of people take a few it shouldn’t take too long. It’s possible they could be fixed on the server all in one fell swoop, but someone would have to look at the grep output first to make sure 100% of the matches were valid, i.e. needed to be fixed. (I removed mine from the list; I’ve PR'd several, three left to go.)

a-merritt_the-moon-pool
adam-smith_the-wealth-of-nations
aleksandr-kuprin_yama_bernard-guilbert-guerney
alexandre-dumas_the-black-tulip_p-f-collier-and-son
anthony-trollope_doctor-thorne
arthur-conan-doyle_the-valley-of-fear
banjo-paterson_an-outback-marriage
benjamin-disraeli_sybil
carey-rockwell_stand-by-for-mars
charles-dickens_a-christmas-carol
charles-dickens_great-expectations
charles-dickens_oliver-twist
charlotte-bronte_villette
d-h-lawrence_the-rainbow
e-m-forster_a-room-with-a-view
e-m-forster_howards-end
e-nesbit_the-enchanted-castle_h-r-millar
e-nesbit_the-phoenix-and-the-carpet
e-nesbit_the-story-of-the-amulet
edgar-allan-poe_short-fiction
eleanor-h-porter_pollyanna
eleanor-h-porter_pollyanna-grows-up
elizabeth-gaskell_cranford
ella-cheever-thayer_wired-love
epictetus_the-enchiridion_elizabeth-carter
ernest-hemingway_short-fiction
erskine-childers_the-riddle-of-the-sands
f-scott-fitzgerald_the-beautiful-and-damned
fergus-hume_the-mystery-of-a-hansom-cab
ford-madox-ford_no-more-parades
fyodor-dostoevsky_poor-folk_c-j-hogarth
g-k-chesterton_the-man-who-was-thursday
george-eliot_daniel-deronda
george-eliot_middlemarch
george-eliot_the-mill-on-the-floss
gustave-flaubert_madame-bovary_eleanor-marx-aveling
h-g-wells_short-fiction
h-g-wells_tono-bungay
henry-james_the-golden-bowl
henry-james_the-wings-of-the-dove
hermann-hesse_siddhartha_gunther-olesch_anke-dreher_amy-coulter_stefan-langer_semyon-chaichenets
hilaire-belloc_the-servile-state
ivan-turgenev_fathers-and-children_constance-garnett
j-s-fletcher_scarhaven-keep
j-s-fletcher_the-charing-cross-mystery
j-s-fletcher_the-middle-temple-murder
jacob-grimm_wilhelm-grimm_household-tales_margaret-hunt
james-joyce_dubliners
john-galsworthy_the-forsyte-saga
joseph-furphy_such-is-life
jules-verne_five-weeks-in-a-balloon_william-lackland
jules-verne_round-the-moon_ward-lock-co
jules-verne_the-mysterious-island_stephen-w-white
laozi_tao-te-ching_james-legge
leo-tolstoy_a-confession_aylmer-maude
leo-tolstoy_anna-karenina_constance-garnett
leo-tolstoy_hadji-murad_aylmer-maude
leo-tolstoy_short-fiction_louise-maude_aylmer-maude_nathan-haskell-dole_constance-garnett_j-d-duff_leo-weiner_r-s-townsend_hagberg-wright_benjamin-tucker_everymans-library_vladimir-chertkov_isabella-fyvie-mayo
lewis-carroll_alices-adventures-in-wonderland_john-tenniel
lewis-carroll_through-the-looking-glass_john-tenniel
ludovico-ariosto_orlando-furioso_william-stewart-rose
m-e-braddon_lady-audleys-secret
martin-andersen-nexo_pelle-the-conqueror_jessie-muir_bernard-miall
maurice-leblanc_813_alexander-teixeira-de-mattos
p-g-wodehouse_piccadilly-jim
p-g-wodehouse_right-ho-jeeves
rudyard-kipling_just-so-stories
rudyard-kipling_kim
samuel-pepys_the-diary
selma-lagerlof_the-story-of-gosta-berling_pauline-bancroft-flach
selma-lagerlof_the-wonderful-adventures-of-nils_velma-swanston-howard
sinclair-lewis_babbitt
thomas-hardy_far-from-the-madding-crowd
thomas-hardy_jude-the-obscure
thornton-w-burgess_green-meadow-stories
victor-hugo_les-miserables_isabel-f-hapgood
vladimir-korolenko_short-fiction_aline-delano_sergius-stepniak_william-westall_thomas-seltzer_the-russian-review_marian-fell_clarence-manning
w-e-b-du-bois_the-souls-of-black-folk
walter-de-la-mare_memoirs-of-a-midget
walter-scott_ivanhoe
wilkie-collins_man-and-wife
wilkie-collins_no-name
wilkie-collins_the-dead-secret
wilkie-collins_the-moonstone
william-shakespeare_coriolanus
zane-grey_betty-zane

Alex Cabal

unread,
Apr 24, 2022, 7:59:02 PM4/24/22
to standar...@googlegroups.com
It will be easier for me to do this on my end, instead of reviewing 100
separate PRs. I'm not in front of a computer right now but I'll work on
this later in the week.

On 4/24/22 6:57 PM, Vince wrote:
> I underestimated by a third. Here’s the list for anyone that has some
> time on their hands. If a bunch of people take a few it shouldn’t take
> too long. It’s /possible/ they could be fixed on the server all in one
>> <mailto:vr_se...@letterboxes.org>> wrote:
>>
>> Yeah, apparently a period in that situation just doesn’t raise a red
>> flag in most of our eyes. Two of the Dumas’ I did have a couple of
>> dozen between them. Missing one or two is one thing, but jiminy…
>>
>>> On Apr 24, 2022, at 6:06 PM, David Grigg <david...@gmail.com
>>> <mailto:david...@gmail.com>> wrote:
>>>
>>> Kim was my production. Sorry I missed these, though it sounds a very
>>> common thing to miss.
>>>
>>> I’m reminded of Mark Twain’s observation: “God first made idiots.
>>> That was for practice. Then he made proofreaders.”
>>> On 25 Apr 2022, 7:20 AM +1000, Vince <vr_se...@letterboxes.org
>>> <mailto:vr_se...@letterboxes.org>>, wrote:
>>>> A quick run through the corpus shows that there’s quite a bit of it.
>>>> The regex matches on over 60 different books, including several I
>>>> did. Not all of the matches are valid, of course, but a quick glance
>>>> indicates that most of them are. I’ll undertake PR’s for mine this
>>>> week. The rest of the corpus will just have to be addressed over time.
>>>>
>>>> That’s a useful regex to add to our review steps, though. Thanks!
>>>>
>>>>
>>>>> On Apr 24, 2022, at 3:29 PM, Alexander Yankov
>>>>> <yanko...@gmail.com <mailto:yanko...@gmail.com>> wrote:
>>>>>
>>>>> Using your regex, I found only 11 cases of this in Kim. I verified
>>>>> them against the scans, and in each case it should be a comma.
>>>>> Opened PR 4
>>>>> <https://github.com/standardebooks/rudyard-kipling_kim/pull/4>.
>>>>>
>>>>> Do we have examples of this error in other books?
>>>>> On Saturday, April 23, 2022 at 12:09:50 AM UTC-4 Gabriel Corrado wrote:
>>>>>
>>>>> I'm reading /Kim/ and seeing the same error over and over:
>>>>>
>>>>> > “None⁠—none.” said the lama earnestly.
>>>>>
>>>>> The period should be a comma. This is probably an OCR thing,
>>>>> since I see it in the PG transcription, too. Is anyone checking
>>>>> for this? If not, it's probably rather common. The regex `\.”
>>>>> [a-z]` should turn up most cases—14 in this book alone.
>>>>>
>>>>> Unfortunately, sometimes a proper name immediately follows the
>>>>> closing quote. I didn't find any examples in /Kim/, but I
>>>>> wonder if anything can be done about it, besides catching them
>>>>> manually, of course.
>>>>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/C0CCF692-2691-40F0-83BA-2B1E14935AF6%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/C0CCF692-2691-40F0-83BA-2B1E14935AF6%40letterboxes.org?utm_medium=email&utm_source=footer>.

Vince

unread,
Apr 24, 2022, 8:24:01 PM4/24/22
to Standard Ebooks
Even better, thanks, Alex. You can ignore my PR’s, then. :)

To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/40b0de49-54da-f5d5-96aa-58be3945bf76%40standardebooks.org.

Alex Cabal

unread,
Apr 25, 2022, 2:07:24 PM4/25/22
to standar...@googlegroups.com
That's a great find Nick, thanks.

It looks like this is a surprisingly common error that nobody spotted
before.

We can adapt a different regex we currently use to check for missing
punctuation before dialog tags. It would looks something like this: `\.”
(said|[a-z]+ed)`

Since this error is so common, I'm going to run this across the corpus
to see what I find. Then, we'll add the final regex to lint.

Alex, thanks for your PRs to fix them in some of these books. I'm going
to close them for now, because I want to test this regex against the
corpus and then update it that way. Once we're sure of a solution then
I'll apply the changes to the entire corpus all at once.

On 4/22/22 11:09 PM, 'Gabriel Corrado' via Standard Ebooks wrote:
> I'm reading /Kim/ and seeing the same error over and over:
>
> > “None⁠—none.” said the lama earnestly.
>
> The period should be a comma. This is probably an OCR thing, since I see
> it in the PG transcription, too. Is anyone checking for this? If not,
> it's probably rather common. The regex `\.” [a-z]` should turn up most
> cases—14 in this book alone.
>
> Unfortunately, sometimes a proper name immediately follows the closing
> quote. I didn't find any examples in /Kim/, but I wonder if anything can
> be done about it, besides catching them manually, of course.
>
> Nick
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/df45ff19-60d7-454f-b366-bcd18f69be72n%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/df45ff19-60d7-454f-b366-bcd18f69be72n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Alex Cabal

unread,
Apr 25, 2022, 6:38:07 PM4/25/22
to standar...@googlegroups.com
OK, I've finished adding a lint test for this. The final xpath ended up
being a little more sophisticated, and it caught 95% of the errors in
the corpus with only a few false positives for extreme edge cases.

The volume of these errors is shocking - we fixed 179 files across 89
ebooks. It's amazing that this kind of error has been totally overlooked
for years despite it being very very common. Great find!

Fixes should all be pushed later today.

On 4/22/22 11:09 PM, 'Gabriel Corrado' via Standard Ebooks wrote:
> I'm reading /Kim/ and seeing the same error over and over:
>
> > “None⁠—none.” said the lama earnestly.
>
> The period should be a comma. This is probably an OCR thing, since I see
> it in the PG transcription, too. Is anyone checking for this? If not,
> it's probably rather common. The regex `\.” [a-z]` should turn up most
> cases—14 in this book alone.
>
> Unfortunately, sometimes a proper name immediately follows the closing
> quote. I didn't find any examples in /Kim/, but I wonder if anything can
> be done about it, besides catching them manually, of course.
>
> Nick
>
Reply all
Reply to author
Forward
0 new messages