Can an mime type error have an influence on WARC file content?

32 views
Skip to first unread message

Vincent Boiteau-Robert

unread,
Aug 1, 2016, 1:31:19 PM8/1/16
to Common Crawl

Hi,


I'm working on a Gradle script that use WAT file to get WARC item. The items are filtered by mime type. I want to get a powerpoint .pptx files, the mime type for those files is application/vnd.openxmlformats-officedocument.presentationml.presentation. The files all open with the same background color, black and same font format, white little and pixelated.


So while testing and developing, I noticed that the mime type in WAT JSON is actually application/vnd.openxmlformats-officedocument.presentationml.persentation, the e and r are reversed. I assume it shouldn't have any influence on the readability of the file for the software and OS, but just making sure.

I tested with .ppt files and the files open normally with what seems to be there normal styles.

Thanks for all.


tl;dr
Is the header Content-Type influence how the crawler process response body? Should I worry if a content-type is misspelled?



Sebastian Nagel

unread,
Aug 1, 2016, 5:37:43 PM8/1/16
to common...@googlegroups.com
Hi Vincent,


> I'm working on a Gradle script that use WAT file to get WARC item. The items are filtered by mime type.
Have a look at the Common Crawl index. If filtering WARC records by MIME type is all you want to do
the index maybe more efficient.
  http://index.commoncrawl.org/
Index file are available on AWS S3 at
  s3://commoncrawl/cc-index/collections/


> The files all open with the same background color, black and same font format, white little and pixelated.
> ... is actually application/vnd.openxmlformats-officedocument.presentationml.persentation, the e and r are reversed.
In the WARC file the "MIME type" is just what the server sends as "Content-Type". Afaik, it isn't changed
in the WAT files and the index.

The MIME type contains even more noise and garbage than trivial spell errors. Have a look at the output of the following command:
  aws s3 cp s3://commoncrawl/crawl-analysis/CC-MAIN-2016-26/stats/part-00000.gz - | gzip -cd | grep '^\["mimetype"'

To integrate a reliable MIME type detection into the WARC/WAT/WET processing tools is a TODO
among others, such as language identification.

Btw., a Google query for the exact misspelled "...ml.persentation" suggests that these documents are
sent by hacked servers.


> Is the header Content-Type influence how the crawler process response body?
This may only influence the WAT/WET generation. The WARC files contain the response payload as received
from the server. The generator is fairly robust but in doubt, there may be a negative impact
by bad Content-Types in the HTTP header. Check the code at
  https://github.com/commoncrawl/ia-hadoop-tools/
  https://github.com/commoncrawl/ia-web-commons/
WAT/WET files are generated from the WARC files by
  https://github.com/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java

Best,
Sebastian


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Vincent Boiteau-Robert

unread,
Aug 1, 2016, 7:03:04 PM8/1/16
to Common Crawl
Thanks Sebastian for the helpful answer.

Vincent
Reply all
Reply to author
Forward
0 new messages