Question on sample size

Paul Evans

unread,

Jun 13, 2018, 8:16:54 PM6/13/18

to computation...@googlegroups.com

All,

I've been advised that 3,000 words is the minimum sample size threshold for authorship attribution. I don't know if that's just a rule of thumb, or if there's a strong math-statistics argument for it. If any of you have looked into this question, I'd appreciate pointers to a relatively accessible discussion of the issue.

In any event, I note that the sample sizes in the case of the Federalist, the classic demonstration of statistical authorship attribution, fall well under 3,000 words, even taken as an average, and no one seems to question its validity. (There are 86 samples containing 194,989 words, so the average sample size is 2,267.3 words. The smallest individual sample -- Federalist 13 -- is 985 words.)

Thanks,

Paul Evans

PS Every year I use stylo to do a demo for my students (and not infrequently faculty) replicating Mosteller and Wallace's attribution of the authorship of the disputed numbers of the Federalist. (See https://github.com/decretist/Federalist). It's a wonderful way introduce the concept and show off the capabilities of stylo.

David L. Hoover

unread,

Jun 13, 2018, 9:23:29 PM6/13/18

to computation...@googlegroups.com

Hi Paul,

Maciej should weigh in, but 3000 isn't magic or statistically required.
Lots of people use 2000-word samples, and Burrows worked with 1000-word
poems. I've had good results on texts as small as 500 words and even
smaller. The rule of thumb is that results typically weaken as the
samples get smaller. The size answer is also different for different
languages.

By the way, when I've tried the Federalist, the disputed #55 almost
always goes to Hamilton.

Best,

David Hoover

On 6/13/2018 8:16 PM, Paul Evans wrote:
> All,
> I've been advised that 3,000 words is the minimum sample size
> threshold for authorship attribution. I don't know if that's just a
> rule of thumb, or if there's a strong math-statistics argument for it.
> If any of you have looked into this question, I'd appreciate pointers
> to a relatively accessible discussion of the issue.
>
> In any event, I note that the sample sizes in the case of the

> /Federalist/, the classic demonstration of statistical authorship

> attribution, fall well under 3,000 words, even taken as an average,
> and no one seems to question its validity. (There are 86 samples
> containing 194,989 words, so the average sample size is 2,267.3 words.

> The smallest individual sample -- /Federalist/ 13 -- is 985 words.)

>
> Thanks,
> Paul Evans
>
> PS Every year I use stylo to do a demo for my students (and not
> infrequently faculty) replicating Mosteller and Wallace's attribution
> of the authorship of the disputed numbers of the Federalist. (See
> https://github.com/decretist/Federalist). It's a wonderful way
> introduce the concept and show off the capabilities of stylo.

> --
> You received this message because you are subscribed to the Google
> Groups "computationalstylistics" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to computationalstyl...@googlegroups.com
> <mailto:computationalstyl...@googlegroups.com>.
> Visit this group at
> https://groups.google.com/group/computationalstylistics.
> For more options, visit https://groups.google.com/d/optout.

--
David L. Hoover, Professor of English, NYU
212-998-8832 244 Greene Street, Room 409
http://wp.nyu.edu/davidlhoover

While one who sings with his tongue on fire / Gargles in the rat race choir
Bent out of shape from society's pliers / Cares not to come up any higher
But rather get you down in the hole / That he's in
But I mean no harm nor put fault / On anyone that lives in a vault
But it's alright, Ma, if I can't please him --Bob Dylan, "It's All Right, Ma," '65

Maciej Eder

unread,

Jun 14, 2018, 12:41:21 AM6/14/18

to computationalstylistics

Dear Paul,

David is right – there is no single answer to the question of the sample size. In my previous study (https://doi.org/10.1093/llc/fqt066) I claimed that 5,000 words in a sample is a safe amount for most languages, 3,000 words for the most robust ones.

Here's a recent paper of mine in which I try to show that it hugely depends on your problem: https://dh2017.adho.org/program/abstracts/ (the paper is actually here: https://dh2017.adho.org/abstracts/341/341.pdf ). It turns out that there are some authors, in the case of which 100 words in a sample (!) is enough to get a very strong signal (e.g. William Morris), whereas in come other cases 10,000 is still not enough (e.g. Elisabeth Gaskell or Virginia Woolf). Moreover, the problem does depend on the number of candidates to be tested: in order to tell apart Arthur Conan Doyle and Agatha Christie a few hundred words in a sample should be enough, while distinguish between Conan Doyle from the pool of Christie, Chesterton, Chandler and a few other crime story authors, you'd need much longer samples.

I promised myself to provide a function in 'stylo' to test the robustness of an input sample in terms of the sample size. I have to brush it – maybe in a couple of weeks.

All the best,

Maciej

To unsubscribe from this group and stop receiving emails from it, send an email to computationalstyl...@googlegroups.com.

Reply all

Reply to author

Forward