Feedback requested re. refining rules for whether or not to italicize non-English words.

223 views
Skip to first unread message

Alex Cabal

unread,
Jan 7, 2022, 9:21:50 PMJan 7
to standar...@googlegroups.com
This discussion came up while David was working on _Wet Magic_ and I
thought I'd bring it to the list to get some feedback.

As all of you know, in older books some non-English words are italicized
that today have been assimilated into English. As such we remove italics
because it would read strangely to modern readers. For example, an old
book might italicize `menu` as a French import word, even though today
it's a regular English word.

Our rule is that if it's in the online Merriam-Webster dictionary, then
we remove italics. However this has sometimes resulted in surprising
removals. For example, Vince unitalicized `panem et circenses` in a
book, because it was in M-W. In _Wet Magic_, David unitalicized `plage`
and `haute ecole` because they were also in M-W.

I found such removals surprising because they are certainly not common
words, even if they're in M-W.

So the question is, should we refine this rule in order to not catch
some of these rarer exceptions that still appear to be in M-W? If so, how?

One possibility is to also look at Google N-Gram to see if the word is
actually common or not. But we'd have to pick some threshold to decide
on that, and I don't know what that might look like.

Robin Whittleton

unread,
Jan 8, 2022, 2:12:41 AMJan 8
to standar...@googlegroups.com
I hit this in With Fire and Sword recently: it seemed pretty random whether an unusual Cossack word would be in MW or not. I’d be happy to add a judgment call to the manual, and I suspect this will mostly affect titles that are medium and up difficulty anyway.


> On 8 Jan 2022, at 03:21, Alex Cabal <al...@standardebooks.org> wrote:
>
> This discussion came up while David was working on _Wet Magic_ and I thought I'd bring it to the list to get some feedback.
> --
> You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/769c88a7-8db7-73c3-28f7-13ffc61563cd%40standardebooks.org.

David Grigg

unread,
Jan 8, 2022, 5:28:43 AMJan 8
to standar...@googlegroups.com
I think it has to be left to the common sense of the producer. 

In earlier versions of the manual, or the step by step guide I seem to recall it said something like: “if you think the reader won’t recognise it as a foreign word or phrase, leave it in italics even if it’s in MW”. 

That’s why I wanted ‘plage’ and ‘haute ecole’ italicised, because (at least in theory) the readers of “Wet Magic” are children, who almost certainly wouldn’t be able to recognise these phrases as not being English (particularly if unlike the children in the story, they don’t learn French at school). They may not understand the words, but at least the italics alert them that there’s something special about them.

B Keith

unread,
Jan 8, 2022, 10:24:51 AMJan 8
to Standard Ebooks
I think it's best to leave the MW as the guiding document. This something that any editor wrestles with throughout their career. The “correct” decision is always going to be a moving target— sometimes dictated by changing usages and sometimes dictated by context. I really don’t think using N-Gram as a metric would be making a “better” decision in a lot of cases.   I suppose we could build another tool like modernize-spelling that checks against a master list for exceptions but that won’t take into account the choice to only italicize  the first instance etc.

I would also argue that the italics are there as a signal for the human reader, not the  mechanical one and  the biggest problems are going to occur in the more complex books where we can more readily depend on the reader being able to discern the “foreignness” of the  word regardless of what choice we make regarding italicization.

So… leave the MW rule and any exceptions to the producer? Not sure about putting the last bitt in the Manual as it might lead to new producers trying too hard (I know I did back then…)
_________

Guadeamus igitur iuvenes dum sumus

John Rambow

unread,
Jan 8, 2022, 11:17:48 AMJan 8
to standar...@googlegroups.com
I agree that there should be some latitude given to the producer to keep some words italicized, even if they are listed in the main M-W dictionary.

One thing worth considering is how common the word is in the book itself -- for instance if "panem et circenses" just appears once or twice, I'd be more likely to keep the italics. But if somehow the phrase came up over and over in the same book, it might be better to just make it roman. 







Vince Rice

unread,
Jan 8, 2022, 12:49:51 PMJan 8
to standar...@googlegroups.com
Well, David’s examples were in a children’s book, so… :)

My concern (maybe too strong a word) with just leaving it up to the producer’s discretion is that then it becomes very subjective, which is going to make it harder for producers and reviewers, who have their own, maybe different, POV. Which will inevitably lead back to Alex, who has enough to do. :)

On Jan 8, 2022, at 9:24 AM, B Keith <bois...@gmail.com> wrote:

Alex Cabal

unread,
Jan 8, 2022, 1:05:28 PMJan 8
to standar...@googlegroups.com
Exactly, that's my concern too. Part of the impetus behind the original
rule is to just have something to point to and say "that's the rule, do
what it says" and it would hopefully be correct most of the time.
Leaving it entirely up to the producer isn't too helpful because they'll
inevitably contact me about it, and often beginner producers will have
questionable knowledge of what is or isn't common vocabulary in the
kinds of work we do.

I wonder if at least trying the n-gram supplemented idea for a while
would be worth it. What might a good threshold look like?

On 1/8/22 11:49 AM, Vince Rice wrote:
> Well, David’s examples were in a children’s book, so… :)
>
> My concern (maybe too strong a word) with just leaving it up to the
> producer’s discretion is that then it becomes very subjective, which is
> going to make it harder for producers /and/ reviewers, who have their
> own, maybe different, POV. Which will inevitably lead back to Alex, who
> has enough to do. :)
>
>> On Jan 8, 2022, at 9:24 AM, B Keith <bois...@gmail.com> wrote:
>> …
>> I would also argue that the italics are there as a signal for the
>> human reader, not the  mechanical one and  the biggest problems are
>> going to occur in the more complex books where we can more readily
>> depend on the reader being able to discern the “foreignness” of the
>>  word regardless of what choice we make regarding italicization.
>>
>> So… leave the MW rule and any exceptions to the producer? Not sure
>> about putting the last bitt in the Manual as it might lead to new
>> producers trying too hard (I know I did back then…)
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/AC808F71-D229-469D-ACE5-28703B6279A4%40therices.name
> <https://groups.google.com/d/msgid/standardebooks/AC808F71-D229-469D-ACE5-28703B6279A4%40therices.name?utm_medium=email&utm_source=footer>.

David Zitzelsberger

unread,
Jan 8, 2022, 1:19:57 PMJan 8
to standar...@googlegroups.com
Even using Merriam-Webster I think untela sizing Panem et circenses was a mistake. Merriam-Webster specifically calls it out as a Latin phrase (AKA not English) and further states that their example is from a Roman poet.

So according to Merriam Webster it is not English nor American.

Plage is an interesting question though.

--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/769c88a7-8db7-73c3-28f7-13ffc61563cd%40standardebooks.org.

Vince

unread,
Jan 8, 2022, 1:25:28 PMJan 8
to Standard Ebooks
All the words under discussion are foreign. The point is that they’re in M-W, and have therefore worked their way into English by being in an English dictionary. By our rule, it’s “in M-W," and therefore should not be italicized.

Vince

unread,
Jan 8, 2022, 1:27:49 PMJan 8
to Standard Ebooks
I only use ngrams to try to determine the odd spelling issue, so I’ve never paid attention to the numbers. Here are some examples. (I’m only including the final #, as I assume that’s what we care about, i.e. how often is it used today). We would need a bigger sample, obviously, but from these, the “common” ones have four zeroes after the decimal, the others have more.

ad infinitum—0.0000300%
ad loc—0.0000200%
in extremis—0.0000140%
panem et circenses—0.00000070%
plage—0.0000060%
haute école—0.000000050% (this is surprising to me; I’ve heard of haute ecole, but I’ve never seen plage, and yet plage is 100 times more prevalent?).

John Rambow

unread,
Jan 8, 2022, 2:43:54 PMJan 8
to standar...@googlegroups.com
"Plage" may be more common because of the additional meaning M-W lists, for a "bright region on the sun." 

Lots of scientific lit has gone into Ngram. 

--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/2935B987-0579-4333-85EA-22D3BFD14AB7%40letterboxes.org.

Lukas Bystricky

unread,
Jan 8, 2022, 4:46:08 PMJan 8
to Standard Ebooks

Just a thought, but maybe we could stick information on MW. There's a section that provides examples, for example "plage" was used in an article from 2017 in Vogue. For both "panem et circenses" and "haute ecole" MW doesn't list any recent examples, so perhaps that's enough to conclude that they're not in common use and should remain italicized. (Using this rule plage would have the italics removed.) 

Lukas Bystricky

unread,
Jan 8, 2022, 4:50:01 PMJan 8
to Standard Ebooks
*stick to information on MW

Alex Cabal

unread,
Jan 8, 2022, 7:18:43 PMJan 8
to standar...@googlegroups.com
Maybe but who knows how detailed M-W is in that info. And even so, doing
so would lead to the wrong result, because plage was used in 2017 but we
actually want to italicize it! (Now that I type it out, plage isn't even
included in my spell checker!)

On 1/8/22 3:46 PM, Lukas Bystricky wrote:
>
> Just a thought, but maybe we could stick information on MW. There's a
> section that provides examples, for example "plage" was used in an
> article from 2017 in Vogue. For both "panem et circenses" and "haute
> ecole" MW doesn't list any recent examples, so perhaps that's enough to
> conclude that they're not in common use and should remain italicized.
> (Using this rule plage would have the italics removed.)
> On Saturday, January 8, 2022 at 8:43:54 PM UTC+1 ram...@gmail.com wrote:
>
> "Plage" may be more common because of the additional meaning M-W
> lists, for a "bright region on the sun."
>
> Lots of scientific lit has gone into Ngram.
>
> On Sat, Jan 8, 2022 at 1:27 PM Vince <vr_se...@letterboxes.org> wrote:
>
> I only use ngrams to try to determine the odd spelling issue, so
> I’ve never paid attention to the numbers. Here are some
> examples. (I’m only including the final #, as I assume that’s
> what we care about, i.e. how often is it used /today/). We would
> <https://groups.google.com/d/msgid/standardebooks/2935B987-0579-4333-85EA-22D3BFD14AB7%40letterboxes.org?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/30aa65c9-acda-4120-a8d6-c187ff9e9ac0n%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/30aa65c9-acda-4120-a8d6-c187ff9e9ac0n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Alex Cabal

unread,
Jan 8, 2022, 8:06:55 PMJan 8
to standar...@googlegroups.com
When using ngram you can select the "English Fiction 2019" corpus which
is probably what we should use. But even in that corpus, plage is still
100x more common than haute ecole!

On 1/8/22 1:43 PM, John Rambow wrote:
> "Plage" may be more common because of the additional meaning M-W lists,
> for a "bright region on the sun."
>
> Lots of scientific lit has gone into Ngram.
>
> On Sat, Jan 8, 2022 at 1:27 PM Vince <vr_se...@letterboxes.org
> <mailto:vr_se...@letterboxes.org>> wrote:
>
> I only use ngrams to try to determine the odd spelling issue, so
> I’ve never paid attention to the numbers. Here are some examples.
> (I’m only including the final #, as I assume that’s what we care
> about, i.e. how often is it used /today/). We would need a bigger
> sample, obviously, but from these, the “common” ones have four
> zeroes after the decimal, the others have more.
>
> ad infinitum—0.0000300%
> ad loc—0.0000200%
> in extremis—0.0000140%
> panem et circenses—0.00000070%
> plage—0.0000060%
> haute école—0.000000050% (this is surprising to me; I’ve heard of
> haute ecole, but I’ve never seen plage, and yet plage is 100 times
> more prevalent?).
>
>
>
>> On Jan 8, 2022, at 12:05 PM, Alex Cabal <al...@standardebooks.org
>> <mailto:al...@standardebooks.org>> wrote:
>>
>> Exactly, that's my concern too. Part of the impetus behind the
>> original rule is to just have something to point to and say
>> "that's the rule, do what it says" and it would hopefully be
>> correct most of the time. Leaving it entirely up to the producer
>> isn't too helpful because they'll inevitably contact me about it,
>> and often beginner producers will have questionable knowledge of
>> what is or isn't common vocabulary in the kinds of work we do.
>>
>> I wonder if at least trying the n-gram supplemented idea for a
>> while would be worth it. What might a good threshold look like?
>>
>> On 1/8/22 11:49 AM, Vince Rice wrote:
>>> Well, David’s examples were in a children’s book, so… :)
>>> My concern (maybe too strong a word) with just leaving it up to
>>> the producer’s discretion is that then it becomes very
>>> subjective, which is going to make it harder for producers /and/
>>> reviewers, who have their own, maybe different, POV. Which will
>>> inevitably lead back to Alex, who has enough to do. :)
>>>> On Jan 8, 2022, at 9:24 AM, B Keith <bois...@gmail.com
>>>> <mailto:bois...@gmail.com>> wrote:
>>>> …
>>>> I would also argue that the italics are there as a signal for
>>>> the human reader, not the  mechanical one and  the biggest
>>>> problems are going to occur in the more complex books where we
>>>> can more readily depend on the reader being able to discern the
>>>> “foreignness” of the  word regardless of what choice we make
>>>> regarding italicization.
>>>>
>>>> So… leave the MW rule and any exceptions to the producer? Not
>>>> sure about putting the last bitt in the Manual as it might lead
>>>> to new producers trying too hard (I know I did back then…)
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> <https://groups.google.com/d/msgid/standardebooks/2935B987-0579-4333-85EA-22D3BFD14AB7%40letterboxes.org?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/CAFu_-g-VJajR3ORbkaFi-1_dKbtQDy0S4GMc6rg%2Boh4Ex_%3DRhw%40mail.gmail.com
> <https://groups.google.com/d/msgid/standardebooks/CAFu_-g-VJajR3ORbkaFi-1_dKbtQDy0S4GMc6rg%2Boh4Ex_%3DRhw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Evan Hall

unread,
Jan 9, 2022, 1:13:24 AMJan 9
to Standard Ebooks
I compiled some data that I hope you will find useful. My goal was to expand on Vince's list and get some Ngram frequencies for a larger list of words.

Here's a link to the google sheet.

I started by looking for words in SE books that were in italics at some point during a book's production. To do this, I just opened the commit histories of some recently updated books on GitHub and searched on the commit history page for "italic". Then I skimmed those commits individually, looking for relevant words. If the word was listed on M-W.com, I added it to this spreadsheet. I looked at about 50–60 books and found roughly 100 words. (Some of these words might have been pulled from early in the production process, so they might not follow the SE standardized spelling in my spreadsheet.)

The spreadsheet contains a few interesting data columns:
  • The word in question.
  • The part of speech as listed on M-W.com. This was to identify phrases that were specifically identified by M-W as "Latin phrase" or similar, like Lukas suggested upthread. I also noted a few cases where the exact word wasn't in M-W, but a similar phrase was.
  • The Google Ngram frequency on the "English Fiction (2019)" corpus.
  • In some cases, I checked Google Ngram for variations (like "tete-a-tete" for "tête-a-tête"). If a variation had a higher frequency, I recorded that higher frequency and made a note that it was for a variant, not the listed word.
Looking at this spreadsheet, I'm inclined to echo Vince's finding that "four zeros" in the Ngram frequency looks like a pretty sensible cut-off as a general guideline for deciding whether to remove or keep italics. This easily removes italics from all of the fully naturalized words like "menu", "boutique", and "elite", and keeps them on the less familiar foreign phrases like "haute école" and "panem et circenses".

The rest of this email has some more opinions on this topic, both as a reader and a (long ago) SE producer.

In my opinion, there are really three categories that are worth talking about distinctly:
  1. Fully naturalized words that are "normal English" to most readers. These include "menu", "personnel" and "alias", for example.
  2. Common foreign words and phrases that will be recognized and understood by many readers, but are still considered "not English", even by people who know what they mean. Personally, I might include "tête-a-tête", "ad infinitum" and "ipso facto" in this category.
  3. Uncommon foreign words and phrases that are unfamiliar to many readers and might not be understood without looking them up. I would put "panem et circenses", "haute école", and "Weltschmerz" in this category.
To me, the most important goal for Standard Ebooks is to make sure that the words in category 1 are not italicized. The second priority is to make sure the words in category 3 are italicized. After that, we get into category 2, where, in my opinion, it's not as critical one way or the other.

I would expect some healthy difference of opinion about which words are in category 1 vs 2, and which words are in category 2 vs 3, but my hope is that there are not many (any?) words that we might disagree about being in category 1 vs 3. That suggests that we should try to draw our italicization threshold somewhere in the middle of category 2, and not worry too much about which category 2 words end up on either side of the line.

As with most parts of SE production, I would expect experienced producers (including reviewers) to feel free to make judgement calls about which words should keep or lose their italics, but I feel like the "four zeros Ngram frequency" guideline could be a clear and useful addition to the manual for less experienced producers.


David Grigg

unread,
Jan 9, 2022, 1:26:17 AMJan 9
to Standard Ebooks
Thank you, Evan, for this helpful work and your sensible thoughts.
On 9 Jan 2022, 5:13 PM +1100, Evan Hall <eeh...@gmail.com>, wrote:
I compiled some data that I hope you will find useful. My goal was to expand on Vince's list and get some Ngram frequencies for a larger list of words.

Here's a link to the google sheet.
https://docs.google.com/spreadsheets/d/1XdOeYh1tJ5CwcVNbQsa_fVqjQfMK4-CGRRF3ChKEBz8/edit?usp=sharing

I started by looking for words in SE books that were in italics at some point during a book's production. To do this, I just opened the commit histories of some recently updated books on GitHub and searched on the commit history page for "italic". Then I skimmed those commits individually, looking for relevant words. If the word was listed on M-W.com, I added it to this spreadsheet. I looked at about 50–60 books and found roughly 100 words. (Some of these words might have been pulled from early in the production process, so they might not follow the SE standardized spelling in my spreadsheet.)

The spreadsheet contains a few interesting data columns:
 • The word in question.
 • The part of speech as listed on M-W.com. This was to identify phrases that were specifically identified by M-W as "Latin phrase" or similar, like Lukas suggested upthread. I also noted a few cases where the exact word wasn't in M-W, but a similar phrase was.
 • The Google Ngram frequency on the "English Fiction (2019)" corpus.
 • In some cases, I checked Google Ngram for variations (like "tete-a-tete" for "tête-a-tête"). If a variation had a higher frequency, I recorded that higher frequency and made a note that it was for a variant, not the listed word.
Looking at this spreadsheet, I'm inclined to echo Vince's finding that "four zeros" in the Ngram frequency looks like a pretty sensible cut-off as a general guideline for deciding whether to remove or keep italics. This easily removes italics from all of the fully naturalized words like "menu", "boutique", and "elite", and keeps them on the less familiar foreign phrases like "haute école" and "panem et circenses".

The rest of this email has some more opinions on this topic, both as a reader and a (long ago) SE producer.

In my opinion, there are really three categories that are worth talking about distinctly:
 1. Fully naturalized words that are "normal English" to most readers. These include "menu", "personnel" and "alias", for example.
 2. Common foreign words and phrases that will be recognized and understood by many readers, but are still considered "not English", even by people who know what they mean. Personally, I might include "tête-a-tête", "ad infinitum" and "ipso facto" in this category.
 3. Uncommon foreign words and phrases that are unfamiliar to many readers and might not be understood without looking them up. I would put "panem et circenses", "haute école", and "Weltschmerz" in this category.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/21db6ae6-131a-4302-946a-82349f8e0934n%40googlegroups.com.

Vince

unread,
Jan 9, 2022, 1:47:22 AMJan 9
to Standard Ebooks
Interesting, thanks so much for doing this, Evan! I love your categories. 

The data shows, to me, how difficult this is—of the five zero words, I would personally consider these in #2, and thus wouldn’t italicize them.
  • prima facie
  • alumnus
  • esprit de corps
  • au fait
  • ipso facto
  • s’il vous plait
  • sui generis
  • conquistadores
  • ne plus ultra
  • reductio ad absurdum
  • Fräulein
  • juntas
  • ex nihilo

For the sake of simplicity, we could either:
  • Keep the four zero cutoff, which would italicize the above, or
  • Make it a five zero cutoff, which would not italicize the above, but also a small # of words that are more obscure (e.g. estaminet, Weltschmerz, etc.).

Even though I would put the above words in category #2, I would still probably lean towards the four zero cutoff, because maybe erring on the side of having a few italicized that are “common” is better than not italicizing some that are unrecognizable to most readers.

To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/21db6ae6-131a-4302-946a-82349f8e0934n%40googlegroups.com.

Lukas Bystricky

unread,
Jan 9, 2022, 3:34:07 AMJan 9
to Standard Ebooks
I think no matter what method we use there will be edge cases. Regarding "plage", one could argue that in the Vogue article they used "à la plage" which isn't an English phrase, so technically MW didn't really find a valid example for our purposes. 

The ngram approach is nice, but I wonder if it's not a bit of an overkill since we'll still be using MW anyways as our language guide.  Going through Evan's spreadsheet and looking at MW examples:
  • italics on everything under 0.0000013898 
  • no italics on everything above 0.0000219819 
  • no italics on the following words:
    • ex nihilo
    • Weltschmerz
    • junta
    • cui bono
    • status quo
    • reductio ad absurdum
    • plage (* see above)
    • ne plus ultra
    • conquistador
    • andante
    • sui generis
    • largo
    • ipso facto
    • esprit de corps
    • alumnus
    • prima facie
    • sangfroid
    • ad nauseam
    • coup de grâce
    • garçon
    • ad infinitum
    • élite
    • quid pro quo
    • abattoir
    • wanderlust
    • in situ
  • keep italics on:
    • camion
    • pari passu
    • poilu
    • en avant
    • estaminet
    • Fräulein
    • tout ensemble
    • ma foi
    • s'il vous plait
    • au fait
    • table d'hôte
    • éclat
    • triste
    • merci (beaucoup)
    • tête-a-tête
It's possible to quibble with those results, but I think that's roughly how I would sort them. I like keeping italics on "s'il vous plait", "merci beaucoup" and "table d'hôte" for example because even if the reader recognizes the words, they're probably still meant to be read in a French accent. 


Lukas Bystricky

unread,
Jan 9, 2022, 4:17:44 AMJan 9
to Standard Ebooks
A more strict criteria would be only allowing MW's own examples (i.e. not examples from the web). In that case, the following additional words would be italicized:
  • Weltschmerz
  • cui bono
  • reductio ad absurdum
  • plage 
  • conquistador
  • andante
  • largo
  • ad infinitum

I think this seems more robust. 

Evan Hall

unread,
Jan 9, 2022, 12:20:59 PMJan 9
to Standard Ebooks
Looking over the data in the morning, I came to another realization. In most cases from this list, I would prefer to see italics on phrases and not miss them on individual words. Cutting italics from a phrase is more likely to create a confusing string of text, and keeping italics on single words is more likely to be distracting to a reader (if the reader is me).

This suggests an alternative guideline that doesn't rely on Ngram frequency at all:
  1. Is the foreign text a phrase with multiple words? Keep the italics as in the original text.
  2. Otherwise, is the foreign word in M-W? Drop the italics.
  3. Otherwise, keep the italics as in the original text.
Note that this doesn't suggest adding italics anywhere that they don't exist in the source.

Following this guideline naively would drop italics on some esoteric words which are in M-W, like "domnei", "roturiers", "Weltschmerz", and "camion". But these are, at first glance, all coming from moderate or difficult productions, where the producer would be expected to have enough experience to make a judgement call where appropriate.

Even on very familiar phrases, like "quid pro quo", "a priori", or even "en masse", I personally don't find italics distracting, even though they aren't necessary either. I realize that not everyone will share this opinion. (It already conflicts with the lists offered by Vince and Lukas!) Going back to my three categories, I would consider these examples firmly in category 2: familiar phrases that are, nevertheless, not made of English words.

Anyway, that's something else to think about.

Evan.

Vince

unread,
Jan 9, 2022, 1:48:18 PMJan 9
to Standard Ebooks
I don’t think we want to be dependent on the original text for any determination. Our style guide is based on how we want SE productions to look.
We definitely add italics where they don’t exist in the source—as just the current example I’m working on, Gibbon doesn’t italicize any foreign (Latin, French, Italian, etc.) quotations, which are usually sentences long at least. We italicize those per our style guide, and we should italicize those, IMO. The same for individual words; whether they are italicized or not in the original, we should format them according to our style guide (whatever that ends up being).

I also don’t think I see any difference between a word or phrase; e.g. quid pro quo may be foreign in origin, but it’s generally accepted/used in English, so I don’t see a need to italicize it.

If we do use ngrams, I believe the goal would be to have a tool that did the work (se ngram or something); it would no more difficult that some of the interactive replaces we do now. (I don’t know what that entails, and it might be a down the road thing, but I think that would be where we want to get to, if the decision was made to use them.)


Matt Chan

unread,
Jan 9, 2022, 4:34:10 PMJan 9
to standar...@googlegroups.com
I've thought about this a lot and I really don't have much to contribute in terms of a solution, but I think before we get into the weeds too much we should take a step back and think about what is the challenge we're trying to solve. It seems to me that we are trying to balance making consistent, readable, and beneficial (to the readers) ebook productions with regards to italics and potential non-English phrases, with the ease of production and consistency in maintaining a standard for the producers.

Keeping an easy to verify standards, that is, with entries in the M-W, clearly benefit the latter (the producers), but what Alex brought up is that maybe using this standard may not be the best for achieving the former (for the readers), as it might italicize more familiar phrases, while leaving less familiar phrases untouched.

So qualitatively, the question is how much of the ease of production (i.e. ease of coming to a consistent standard, ease of verification for producers and reviewers) we want to sacrifice to improve the reader's experience? If we stay with the status quo (retain using M-W as the italicizing standard), how detrimental is that for the reader?




Lukas Bystricky

unread,
Jan 10, 2022, 12:02:53 PMJan 10
to Standard Ebooks

The point Evan brings up about words vs. phrases is a good one, and I think another reason why we should stick to MW. 

Consider bon mot, which I would consider to be a common enough expression (and MW helpfully provides an example) and thus probably shouldn't be italicized. The word bon by itself though doesn't include the definition "French for 'good'", and thus should be italicized if that's what it means in context. However, among the other definitions of bon, we find "broad bean", which MW tells us perhaps comes from the the Dutch word "boon", but in any case it is a valid (though uncommon) English word in that context and should not be italicized. On the other hand if bon refers to the Japanese festival, it's a foreign word and should be italicized with the appropriate language tag*. Relying on a raw ngram score I think would miss all this context.

It's also worth considering that assigning a cutoff value is arbitrary, while the decision MW made on whether or not to include an example phrase isn't exactly arbitrary and was at least made on a case-by-case basis. I'm sure someone at MW could have thought of an example phrase containing "merci beaucoup", but they made a decision not to. 

*Under the proposed rule change; under current rules bon would in roman script in this context too

Matt Chan

unread,
Jan 10, 2022, 12:10:42 PMJan 10
to standar...@googlegroups.com
This might be a long shot but I wonder if someone else (e.g. the M-W itself) has encountered this problem before and had come up with a criteria that we might be able to reference? Not sure if it's worth reaching out to M-W or some other publishing house about this.

B Keith

unread,
Jan 10, 2022, 12:44:18 PMJan 10
to Standard Ebooks
If you are asking what a traditional publishing house does then the answer is the editor decides on a case-by-case basis, taking into account the text, changes in language, house style and potential readership. There really is no way to make hard and fast rules about this stuff in a traditional milieu.

I think the issue here is that we have producers of multiple levels of experiences as well as knowledge about the finer points of editorial decision making. In order to maintain a standard “use your best judgement” isn’t going to cut it. I think M-W is a fine metric for most of it but I can see how more complex texts start to get increasingly problematic.

Bruce
_________

Guadeamus igitur iuvenes dum sumus

Lukas Bystricky

unread,
Jan 10, 2022, 1:06:56 PMJan 10
to Standard Ebooks
Yeah I don't think they have any hard rules, but I imagine there is some sort of internal consistency. 

In our case if we chose for instance a cutoff of 0.0001, then we would italicize "s'il vous plait", but not "merci beaucoup", which in my opinion would be inconsistent and potentially noticeable to the readers.

Alex Cabal

unread,
Jan 10, 2022, 1:34:13 PMJan 10
to standar...@googlegroups.com
This is a great conversation everyone, please keep discussing. I want to
get a lot of opinions in before I sit back and try to digest it all.

To clarify slightly, there is always room for editorial discretion in
everything we do, including this topic. But as Bruce pointed out, the
reason we have a rule at all is because our producers are of varying
skill levels and there's only one of me. There has to be some kind of
baseline for someone who's never done this before to go off of, without
having to ask me about every single thing; and even guidance for
experienced producers who might have just not come across not-uncommon a
word before.
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/a589b5de-6c9e-4ef7-89de-290506a32d1cn%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/a589b5de-6c9e-4ef7-89de-290506a32d1cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

B Keith

unread,
Jan 10, 2022, 1:34:47 PMJan 10
to Standard Ebooks
This is a bit OT but you are not really thinking about it correctly. As you say it is impossible to have a hard and fast rules. As a result in a small house like the ones I work with, the house stylesheets are generally way less complex than the Standard ebooks one and, other than stipulating something like Chicago as the primary style guide, the decisions are left up to the editor on a book by book basis. This works better than trying to mandate and then enforce an overly complicated set of guidelines, which they don’t have time to do anyway. The consistency comes from the exercise of common sense based on being a professional editor (something I am not—but I live with one).

The sort of work we are discussing here is more akin to a scholarly house or university press and their experience with these kinds of decisions are what they are paid for—the arguments, opinions and personal preferences about this kind of minutia are what you are forced to listen to if you are ever stuck having a coffee with a bunch of them. Turns out there are almost as many exceptions as there are rules and they have lots and lots of rules. But trust me, they are almost always making it (the style guide) up as they go along because the next project is not quite going to fit the same rule set.

I imagine that the Editor of Penguin Classics or Signet Editions has a more rigid set of guidelines about this sort of thing but it would be more like our “modernize spelling” list. Decisions get made and then added to the list to be used throughout the series. But the very issue we are having proves that the rules are constantly in flux.

 
--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Matt Chan

unread,
Jan 10, 2022, 1:38:34 PMJan 10
to standar...@googlegroups.com
I also keep coming back to the question: how bad is the reading experience is for the reader if, say, haute ecole was italicized and plage is not? For me I think the M-W policy we have right now is great purely from a producer perspective, because it's very easy to adhere to and generally doesn't raise a lot of ambiguity. Is moving away from that worth it? (Note that I'm just putting this out there, I don't necessarily 100% endorse keeping the M-W policy as is).

Alex Cabal

unread,
Jan 10, 2022, 1:41:37 PMJan 10
to standar...@googlegroups.com
We're not moving away from it, merely refining it possibly.

A first-time producer who has no editing experience may not realize that
plage is indeed uncommon. They would see the rule and do what it says
because they don't yet know better. We want to see if we can refine the
rules to avoid that situation more often, so less experienced producers
have a better baseline to start from, while keeping in mind that many
times it's a case-by-case decision and that's OK too.


On 1/10/22 12:38 PM, Matt Chan wrote:
> I also keep coming back to the question: how bad is the reading
> experience is for the reader if, say, haute ecole was italicized and
> plage is not? For me I think the M-W policy we have right now is great
> purely from a producer perspective, because it's very easy to adhere to
> and generally doesn't raise a lot of ambiguity. Is moving away from that
> worth it? (Note that I'm just putting this out there, I don't
> necessarily 100% endorse keeping the M-W policy as is).
>
> On Mon, Jan 10, 2022 at 1:34 PM B Keith <bois...@gmail.com
> <mailto:bois...@gmail.com>> wrote:
>
> This is a bit OT but you are not really thinking about it correctly.
> As you say it is impossible to have a hard and fast rules. As a
> result in a small house like the ones I work with, the house
> stylesheets are generally way less complex than the Standard ebooks
> one and, other than stipulating something like /Chicago/ as the
> primary style guide, the decisions are left up to the editor on a
> book by book basis. This works better than trying to mandate and
> then enforce an overly complicated set of guidelines, which they
> don’t have time to do anyway. The consistency comes from the
> exercise of common sense based on being a professional editor
> (something I am not—but I live with one).
>
> The sort of work we are discussing here is more akin to a scholarly
> house or university press and their experience with these kinds of
> decisions are what they are paid for—the arguments, opinions and
> personal preferences about this kind of minutia are what you are
> forced to listen to if you are ever stuck having a coffee with a
> bunch of them. Turns out there are almost as many exceptions as
> there are rules and they have lots and lots of rules. But trust me,
> they are almost always making it (the style guide) up as they go
> along because the next project is not quite going to fit the same
> rule set.
>
> I imagine that the Editor of Penguin Classics or Signet Editions has
> a more rigid set of guidelines about this sort of thing but it would
> be more like our “modernize spelling” list. Decisions get made and
> then added to the list to be used throughout the series. But the
> very issue we are having proves that the rules are constantly in flux.
>
>> On Jan 10, 2022, at 11:06 AM, Lukas Bystricky
>> <mailto:standardebook...@googlegroups.com>.
>> <https://groups.google.com/d/msgid/standardebooks/a589b5de-6c9e-4ef7-89de-290506a32d1cn%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/58E019A2-5711-4C38-9417-5A9255CB3014%40gmail.com
> <https://groups.google.com/d/msgid/standardebooks/58E019A2-5711-4C38-9417-5A9255CB3014%40gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/CAB6ohTfXDYFTg00D47%2B__%2BrsjMJN75H8Z2CfnnWY1QxsOD%3DmYA%40mail.gmail.com
> <https://groups.google.com/d/msgid/standardebooks/CAB6ohTfXDYFTg00D47%2B__%2BrsjMJN75H8Z2CfnnWY1QxsOD%3DmYA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

B Keith

unread,
Jan 10, 2022, 2:17:14 PMJan 10
to Standard Ebooks
Is maintaining some sort of list out of the question? Too cumbersome? I imagine after we build a basic list it would be like modernize spelling with the occasional suggestions on what to add. But making a black and white rule is really going to be hard. I was just discussing this with Leslie (a book editor) and when I stated "bon mot" was definitely not italicised, she quirked her head and said she probably would italicize it…. depending on whether you said it with an accent or not…

Failing a list I think we should stick with M-W, arbitrarily pick one of the Ngram occurrence suggestions and again, exercise the best judgement we can.
> To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/392be46d-ba4e-0618-f6f7-8e4620b04be4%40standardebooks.org.

Vince

unread,
Jan 10, 2022, 2:53:34 PMJan 10
to standar...@googlegroups.com
But there isn’t any other way of pronouncing it (bon mot), is there? I mean, when I say it it certainly doesn’t sound French, but still…

Ignoring any SE standards, I would italicize any non-English phrase—bon mot, s’il vous plait, merci boucoup, in extremis, all of them. They’re not English, they should be italicized. The fact they’re recognized and, very occasionally (compared to their English equivalents) used by English speakers doesn’t make them English. Neither does their appearance in an English dictionary, AFAIC. This isn’t the case for menu (since that was mentioned earlier) or naive; yes, they came from French, but they're for all intents and purposes also English at this point. There is no “English” equivalent (like there is for s’il vous plait, e.g.), they're the word we use in English. Most people wouldn’t even know they aren’t “English” (the same “most people” would all know the above phrases aren’t English).

But, all of that is a judgment call. I don’t think it’s a particularly hard judgment call, but it is a judgment call. So maybe the cleanest solution is to pick a point on ngrams, and keep an exception list like for modernize spelling for the exceptions to that point. E.g., we could pick five zeroes, and maintain an exception list for the ones we wanted to except from the rule (either four zeroes we want to italicize, or five+ zeroes we don’t). The se tools would take both into account, so se <toolname> <word/phrase> would just say yea or nay on whether to italicize (assuming it is in M-W; if it’s not in M-W, the current rule stands), and the mechanics of how it came to that conclusion is unimportant to the end user.

The challenge with that is spelling—especially for non-English phrases in English works, there are a lot of spelling differences. That’s not as big a problem when looking up in M-W, because the M-W website gives a list of close hits, and what you want is almost always on that list if it was just a spelling difference. But for our own tool, I guess it would be quite a bit more difficult.

As Bruce said, this is hard. It’s why editors (should?) get paid so much. :)

B Keith

unread,
Jan 10, 2022, 3:30:50 PMJan 10
to Standard Ebooks
Heh. When Bertie Wooster pronounces it is in most certainly an English phrase :-) (bon mot pronounced like ron pot). In 19th century literature there is whole host of French  (and latin etc.) words that make their way into English and by the 20s bon mot is most certainly an  anglicized word  imho. I cercainly use it that way. I am with you on s’il vous plait but would argue with in extremis which has no need of italics and is again imho pretty much standard English like et cetera and ad infinitum. As we acknowledge it is all an exercise of judgement.

Is this not the actual point of this discussion? When does a word enter into common usage? 

But I am onboard with your thinking. M-W… a list… Spelling should just be taken care of  just like &c vs. etc.

And as for editor’s pay, well lets just say you should stay far far away from the cultural industries if you have a liking for the finer things in life :-)

-- 
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Lukas Bystricky

unread,
Jan 10, 2022, 4:19:24 PMJan 10
to Standard Ebooks
Yeah I think for "bon mot" it depends on how you read it. I've certainly seen it written both ways. Ideally it, and all other foreign phrases, would be the producer's choice, but there should be some rule (on paper at least) to help out producers with less experience (such as myself). 

I suppose one advantage of ngram is that it could help with automation, but that's only partially true. Having a low ngram score means that word is a candidate for italics, but it remains to determine whether or not it's actually an obscure English word, or if it's not, what language it is. Setting up such a database would be difficult to automate (as far I as I can tell). If we're going to require manual labour anyways to set up the DB then I don't see what advantages ngram has over the MW example requirement (as I've now dubbed it). From what I've seen so far the MW example requirement does essentially the same thing as ngram for the "obvious" cases and seems to handle the edge cases in a more consistent/less arbitrary way. Additionally it allows the producer to take into account phrases, spelling variants, and different contexts. Obviously I haven't looked extensively at this, so I expect there would be exceptions, but to me it seems much less problematic than ngram. 

I'm not saying that this all couldn't or shouldn't be automated (it can and should), but just that ngram seems to be the wrong metric to use. 

Vince

unread,
Jan 10, 2022, 4:41:21 PMJan 10
to Standard Ebooks
The only words/phrases under discussion are foreign ones. An obscure English word isn’t italicized either under the current rules or under the new ones.

The presence or absence of an example in M-W doesn’t mean it’s common or uncommon, it just means M-W did or didn’t include an example. An example doesn’t mean it’s common, and the absence of one doesn’t mean it isn’t. Either way, it’s subjective.
The ngrams are objective, based on actual occurrence data.
If we chose to use an exception list in addition, or instead, that list would be objective in practice, as a ruling would have been already been issued for the list.

Whatever we do should definitely be automated; we want a producer to be able to do what they’re doing now—look it up in M-W. Only if it’s 1) present, 2) foreign, and 3) still suspect as to “common” would they would move on to whatever the se tool was.

Lukas Bystricky

unread,
Jan 10, 2022, 5:15:02 PMJan 10
to Standard Ebooks
Sure I understand the current rule, and that English words shouldn't be italicized. I suppose that if we already have a DB we wouldn't include those words, so that point can be forgotten, but in any case we still need to determine what language a word is and that would require manual work to set up. 

I disagree that the presence or absence of an example has no meaning. I showed earlier how it actually follows the ngram criteria perfectly for the "obvious" cases in Evan's spreadsheet, it's only in the cases that could go either way where there was a difference. Ngrams are objective*, sure, but choosing a cutoff certainly isn't, that's entirely arbitrary. At least with MW someone (or a team) has put some thought into whether or not to include an example on a case-by-case basis. I trust that they have some criteria to determine that. It might be (probably is) subjective, but it's probably internally consistent (for example I don't think it's a coincidence that neither "s'il vous plait" or "merci beaucoup" have an example phrase, I assume they're part of the same "group" of phrases). 

I agree that is should be automated, but I don't think that precludes using MW. The rule I proposed isn't ambiguous. 

*even this is only true to an extent. For example the score for "bon" would include hits for its French, English, or Japanese definition, or as part of "bon mot", which are all different things and should be treated differently.  

Matt Chan

unread,
Jan 10, 2022, 5:25:41 PMJan 10
to standar...@googlegroups.com
I still think we are getting too much into the weeds... I think the average reader may not even know that foreign phrases are often italicized in books. The reason why we do so, I think (at least I think it is), is to alert the reader that the phrase they are reading isn't English, correct? Or is there some other reason? Does italicizing these words or phrases etc. makes for a better reading experience for the readers? And if so, at what point, when they are done in maybe an inappropriate way (e.g. italicizing words/phrases that are very familiar to the average reader or failing to do so for obscure phrases), do they create a poor reading experience for the reader? That's what we're trying to accomplish and solve, right?

--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Vince

unread,
Jan 10, 2022, 5:31:20 PMJan 10
to Standard Ebooks
M-W tells us what the language is (in 99.99% cases), so there’s nothing to be setup. Again, the only things under discussion are foreign words/phrases that are in M-W.
What you showed is the definition of “anecdotal,” which is the opposite of objective, and confusing the two leads to a whole mess of problems (see vaccination status in the US). :) It’s anecdotal because we know nothing about how M-W decides on examples and whether to include them. (Maybe they use ngrams to decide! That would be both ironic and slightly scary.)

But you and I can just agree to disagree on this one and move on.

B Keith

unread,
Jan 10, 2022, 6:10:24 PMJan 10
to Standard Ebooks
I really think we are getting a bit too off course. I think the M-W rule is fine. And if we want to  refine it then specify a print edition of M-W. 

Merriam Webster and every other dictionary has a board and they decided what words to include the logic will be idiosyncratic to the board. In olden day that might mean years between editions. These days they work hard to  keep up online. I have a print copy of the OED which was my “bible” but the last edition was printed 1989—I doubt they will ever go ahead with the proposed 3rd edition… same thing with the M-W which I believe was 2004 as Merriam-Webster's Collegiate Dictionary.

Watch Victoria Coren Mitchell’s Balderdash and Piffle if you wan to see a light-hearted take on how meanings and words are included in a dictionary (https://youtu.be/oYFLDjmyJ-g)

Anyway this is the Chicago entry. I think it's clear enough and Standard is already doing pretty good…the rest is always going to be opinion:


Chicago 11.3

General Principles 
Words and Phrases from Other Languages 

Non-English words and phrases in an English context. Italics are used for isolated words and phrases from another language, especially if they are not listed in a standard English-language dictionary like Mer­riam-Webster's Collegiate (see 7.1) or are likely to be unfamiliar to read­ers (see also 7.54). (For proper nouns, see 11.4.) If such a word or phrase becomes familiar through repeated use throughout a work, it need be italicized only on its first occurrence. If it appears only rarely, however, italics may be retained. 

Unless the term appears in a standard English-language dictionary and is being used as such, observe the capitalization conventions of the orig­inal language. In the following examples, the German word for com­puter (which is the same as the English word) is capitalized because it is a noun, and the French adjective franfaise is lowercase even though it would be capitalized in English (as "French"). See also 11.18. 

-- 
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

David Grigg

unread,
Jan 10, 2022, 6:48:27 PMJan 10
to Standard Ebooks
Quote: "Italics are used for isolated words and phrases from another language, especially if they are not listed in a standard English-language dictionary like Mer­riam-Webster's Collegiate (see 7.1) or are likely to be unfamiliar to read­ers." (my emphases).

Says it all, really.

But I'm happy to go with whatever the final consensus is.

Alex Cabal

unread,
Jan 10, 2022, 6:51:21 PMJan 10
to standar...@googlegroups.com
We're not arguing that point, rather we want to discuss if the rule can
be refined in some way to assist producers who don't know what is or
isn't unfamiliar with readers.

On 1/10/22 5:48 PM, David Grigg wrote:
> Quote: "Italics are used for isolated words and phrases from another
> language, /especially/ if they are not listed in a standard
> English-language dictionary like Mer­riam-Webster's Collegiate (see 7.1)
> or /are likely to be unfamiliar to read­ers/." (my emphases).
>
> Says it all, really.
>
> But I'm happy to go with whatever the final consensus is.
> On 11 Jan 2022, 10:10 AM +1100, B Keith <bois...@gmail.com>, wrote:
>> I really think we are getting a bit too off course. I think the M-W
>> rule is fine. And if we want to  refine it then specify a print
>> edition of M-W.
>>
>> Merriam Webster and every other dictionary has a board and they
>> decided what words to include the logic will be idiosyncratic to the
>> board. In olden day that might mean years between editions. These days
>> they work hard to  keep up online. I have a print copy of the OED
>> which was my “bible” but the last edition was printed 1989—I doubt
>> they will ever go ahead with the proposed 3rd edition… same thing with
>> the M-W which I believe was 2004 as Merriam-Webster's Collegiate
>> Dictionary.
>>
>> Watch Victoria Coren Mitchell’s Balderdash and Piffle if you wan to
>> see a light-hearted take on how meanings and words are included in a
>> dictionary (https://youtu.be/oYFLDjmyJ-g <https://youtu.be/oYFLDjmyJ-g>)
>>
>> Anyway this is the Chicago entry. I think it's clear enough and
>> Standard is already doing pretty good…the rest is always going to be
>> opinion:
>>
>> /
>> /
>> /Chicago/ 11.3
>>
>> *General Principles *
>> /Words and Phrases from Other Languages /
>>
>> Non-English words and phrases in an English context. Italics are
>> used for isolated words and phrases from another language,
>> especially if they are not listed in a standard English-language
>> dictionary like Mer­riam-Webster's Collegiate (see 7.1) or are
>> likely to be unfamiliar to read­ers (see also 7.54). (For proper
>> nouns, see 11.4.) If such a word or phrase becomes familiar
>> through repeated use throughout a work, it need be italicized only
>> on its first occurrence. If it appears only rarely, however,
>> italics may be retained.
>>
>> Unless the term appears in a standard English-language dictionary
>> and is being used as such, observe the capitalization conventions
>> of the orig­inal language. In the following examples, the German
>> word for com­puter (which is the same as the English word) is
>> capitalized because it is a noun, and the French adjective
>> franfaise is lowercase even though it would be capitalized in
>> English (as "French"). See also 11.18.
>>
>>
>>> On Jan 10, 2022, at 3:31 PM, Vince <vr_se...@letterboxes.org
>>> <mailto:vr_se...@letterboxes.org>> wrote:
>>>
>>> M-W tells us what the language is (in 99.99% cases), so there’s
>>> nothing to be setup. Again, the only things under discussion are
>>> foreign words/phrases that are in M-W.
>>> What you showed is the definition of “anecdotal,” which is the
>>> opposite of objective, and confusing the two leads to a whole mess of
>>> problems (see vaccination status in the US). :) It’s anecdotal
>>> because we know nothing about how M-W decides on examples and whether
>>> to include them. (Maybe they use ngrams to decide! That would be both
>>> ironic and slightly scary.)
>>>
>>> But you and I can just agree to disagree on this one and move on.
>>>
>>>
>>>> On Jan 10, 2022, at 4:15 PM, Lukas Bystricky
>>> send an email tostandardeboo...@googlegroups.com
>>> <mailto:standardebook...@googlegroups.com>.
>>> <https://groups.google.com/d/msgid/standardebooks/ACB2B748-6BE4-41A5-B50F-B34869B0C9DC%40letterboxes.org?utm_medium=email&utm_source=footer>.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Standard Ebooks" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to standardebook...@googlegroups.com
>> <mailto:standardebook...@googlegroups.com>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/standardebooks/9764C72E-AA4A-46F3-86F1-B4691B3C6A43%40gmail.com
>> <https://groups.google.com/d/msgid/standardebooks/9764C72E-AA4A-46F3-86F1-B4691B3C6A43%40gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/f7ef0318-d34e-4133-a3d7-39c4ca1d379d%40Spark
> <https://groups.google.com/d/msgid/standardebooks/f7ef0318-d34e-4133-a3d7-39c4ca1d379d%40Spark?utm_medium=email&utm_source=footer>.

Matt Chan

unread,
Jan 10, 2022, 6:55:41 PMJan 10
to standar...@googlegroups.com
To Alex's most recent point: In that case, maybe keep the M-W rules in general (for producers of all varying experience), but keep a list of exceptions that we can accumulate over time for reviewers (who tend to be more experienced)? Down the road an automatic tool can be built akin to modernized-spelling?

To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/82af206f-057d-6522-a002-293635616c97%40standardebooks.org.

David Grigg

unread,
Jan 10, 2022, 7:01:32 PMJan 10
to standar...@googlegroups.com
Well, I'm circling back to the MW rule because of its simplicity. (Yes, I know I'm backtracking)

But what about this: 

1) If it's a foreign word or phrase which ISN'T in MW, italicise and semanticate it with xml:lang.
2) If it's a foreign word or phrase which IS in MW, remove italics but semanticate it in a <span> tag.
3) The reviewer checks both italics and semanticated span tags and makes a judgement call as to whether to convert any of the latter into italic.

tadra...@gmail.com

unread,
Jan 10, 2022, 8:58:58 PMJan 10
to Standard Ebooks
As a reader, I think what's most important is internal consistency within each book--not consistency across books. I wouldn't expect that any more than I do for the various print books I read.
Message has been deleted

Lukas Bystricky

unread,
Jan 11, 2022, 1:46:19 AMJan 11
to Standard Ebooks
That's true, and I think for advanced producers that's enough of a guideline. What we want for now (as I understand it) is a rule "on paper" for beginners to follow and then more advanced producers could treat it more a guideline and make their own exceptions to taste.

As for getting into the weeds, I didn't bring up "bon" or "bon mot" because I especially care either way what happens to them (that's not entirely true, I do have a preference, but I recognize that opinions differ). Those were meant to be illustrative of a more general problem, namely that the same word can have multiple meanings, some of which we might want to italicize and others not depending on context. Another good example is "gymnasium", which could be either English or German. How could we assign a ngram score to specifically the German case? Again I don't especially care about this particular word, but I think these are illustrative examples of things that need to be considered before changing the rule. I imagine that as we get started with this project examples like these might turn out to be non uncommon. 

Asher Smith

unread,
Jan 11, 2022, 5:55:03 AMJan 11
to Standard Ebooks
I think my primary problem with relying on ngram is that it is dependant on the modern usage of the word/phrase, not the sum total of historical usage. Phrases that entered English, were common, and then declined should, I believe, be treated similarly to how we treat archaic English words: once they have attained a status where they are considered English enough to not be italicised, I don't think they should lose that. I can't imagine someone looking at our corpus a century from now and deciding that since some of the phrases we've left unitalicised have fallen out of use, they should have italics added back in. MW is good for a lot of reasons, one of which is that they don't remove words.

FWIW, I remember running into a huge amount of foreign words on one production I did where MW had definitions for them, but listed them as being in another language (e.g. 'Latin Phrase'), and the advice I was given at the time was that words considered by MW as being in another language could be considered by us as being in another language. I like a lot of the above ideas, but I think I'd suggest including the following nuances to the criterion of if it's included in MW:
  • Where the text in question is a phrase, not a single word, italicising the whole thing helps the reader to consider it as a single unit, which is a good thing. It additionally avoids the situation in which you read a word or two that have English meanings, get to a non-English word, and then have to backtrack and retroactively consider the preceding words as part of a non-English phrase.
  • Where the text in question is listed in MW as being in another language (e.g. 'French phrase'), consider it to be in another language. When it is listed with a normal part of speech, consider it English.

Vince

unread,
Jan 11, 2022, 2:17:31 PMJan 11
to Standard Ebooks
Ngram shows usage over time, by default from 1800 to today. We could pick any time we want, use the max, whatever. So using ngrams doesn’t restrict us to current usage.
However, “common usage” is by definition common today; just as we’re removing italics from words that were italicized 100 years ago, the opposite could be true 100 years from now.

Adding “if it isn’t listed as a non-English phrase” is an intriguing option. It would italicize merci beaucoup and s'il vous plait, which I personally agree with but I suspect others wouldn’t; it would handle panem et circenses (“Latin quotation”), and it would not italicize most of the list of five-zero ngrams I listed earlier that fall in category 2.

However, it would leave plage and haute école unitalicized. Which just shows that whatever method we use, we’re always going to have exceptions, it’s just a matter of how we want to handle those (case-by-case, keep a list for an se tool that’s accessible by producers, etc.). But adding that qualification might indeed help reduce the list of exceptions.

Lukas Bystricky

unread,
Jan 11, 2022, 2:56:04 PMJan 11
to Standard Ebooks

Agreed, I think it's almost a no-brainer to say that if MW explicitly tells us a phrase is foreign then we should treat it as such. It seems like Asher (and maybe others) were already told to do that anyways.

Vince is right to point out that unfortunately (as far as I can tell) MW only lists phrases as foreign so we would need some other criteria if we decide we want to italicize certain words. I think I've made my objections to ngram known already so I won't repeat myself ;-)
Reply all
Reply to author
Forward
0 new messages