Initial investigation of PDFs in the most recent Common Crawl crawl

32 views
Skip to first unread message

Tim Allison

unread,
Sep 27, 2021, 4:23:23 PMSep 27
to Common Crawl
All,
  Apologies for the self-serving post, but I'm giving a talk at the PDF Association's PDF Days this Wednesday on an initial analysis of PDFs in the most recent crawl[0].  Registration is free.  Please join if you have an interest and the time.
  Many thanks for all you do!

  Cheers,

        Tim

Sebastian Nagel

unread,
Sep 27, 2021, 4:43:40 PMSep 27
to common...@googlegroups.com
Hi Tim,

thanks! Amazing! I've registered for the talk.

Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/c18ca852-f824-4b06-98aa-0aa151bad05bn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/c18ca852-f824-4b06-98aa-0aa151bad05bn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Greg Lindahl

unread,
Sep 27, 2021, 11:52:47 PMSep 27
to common...@googlegroups.com
On Mon, Sep 27, 2021 at 01:23:23PM -0700, Tim Allison wrote:

> Apologies for the self-serving post, but I'm giving a talk at the PDF
> Association's PDF Days this Wednesday on an initial analysis of PDFs in the
> most recent crawl[0]. Registration is free. Please join if you have an
> interest and the time.

Tim,

This sounds like a great talk, and if you eventually post the slides,
please also mention that here!

In 2010, Dan Kaminsky asked me for a web-wide list of Microsoft office
documents. His goal was similar to what you're doing. He doesn't
appear to have published any results from the project, alas! He always
seemed to have 100 times as many great ideas than he had time to
finish.

-- greg


Aaron Kempf

unread,
Nov 2, 2021, 5:29:53 AMNov 2
to Common Crawl
Wait, this dataset lists all the PDFs on the internet? holy crap, i need to download this data.
Reply all
Reply to author
Forward
0 new messages