detected mimes?

163 views
Skip to first unread message

Tim Allison

unread,
Sep 3, 2021, 10:52:58 AM9/3/21
to Common Crawl
I recently extracted info on ~pdfs from the CC-MAIN-2021-31 index files.

I found some odd "detected" mimes.  Did Tika actually return these values?  Several make sense, like "application/octet-stream", but others are odd...like "pdf" or "application/x-pdf".

This is a select on disagreements between http-mime and detected_mime where detected_mime <> 'application/pdf'

mime                     detected_mime                       count
"application/pdf" "application/octet-stream" 16514
"application/pdf" "pdf" 3049
"application/pdf" "application/download" 814
"application/pdf" "application/x-msdownload" 210
"application/pdf" "unk" 142
"application/pdf" "application/force-download" 117
"application/pdf" "application/octetstream" 85
"application/pdf" "application/postscript" 76
"application/pdf" "application/x-pdf" 70
"application/pdf" "image/ipeg" 59
"application/pdf" "file/unknown" 49
"application/pdf" "1" 46
"application/pdf" "application/save-as" 42
"application/pdf" "binary/octet" 28
"application/pdf" "application/save" 25
"application/pdf" "application/x-octet-stream" 17
"application/pdf" "applicatio" 13
"application/x-pdf" "application/octetstream" 11
"application/pdf" "application/x-download" 9
"application/pdf" "application%2fpdf" 9
"application/pdf" "text/html" 9
"application/pdf" "application/force-download,application/octet-stream,application/download,application/pdf" 8
"application/pdf" "applicaton/octet-stream" 8
"application/pdf" "undefined" 8
"application/pdf" "file" 7
"application/pdf" "pharmig_rare_diseases_covid_19_umfrage.pdf" 5
"application/pdf" "20210608-schnell-digital-und-barrierefrei-lehren-aus-der-covid-19-pandemie-fuer-die-versorgung-von-menschen-mit-seltenen-erkrankungen.pdf" 5
"application/pdf" "application/vnd.ms-excel" 4
"application/pdf" "application/*" 4
"application/pdf" "application/.pdf" 4
"pdf" "application/x-pdf" 4
"application/pdf" "application/x-unknown" 3
"application/pdf" "application/unknown" 3
"application/pdf" "image/jpeg" 3
"application/pdf" "''" 3
".pdf" "application/x-pdf" 2
"application/pdf" "application/zip" 2
"application/pdf" "binary/octet-stream" 2
"application/pdf" "content-type:" 2
"application/pdf" "application/x-troff-man" 2
"application/pdf" "octet/stream" 2
"application/pdf" "application/msword" 2
"application/pdf" "adobe/pdf" 2
"application/pdf" "{application/pdf}" 2
"image/pdf" "application/x-pdf" 2
"application/pdf" "pharmig-academy_phv_modul-5_folder_neu.pdf" 1
"application/pdf" "download-datei" 1
"application/pdf" "application-x/force-download" 1
"application/pdf" "text/plain" 1
"application/pdf" "ai_tdb_agentur-checkliste_v02.pdf" 1
"pdf" "application/octet-stream" 1
"application/pdf" "unknown" 1
"text/pdf" "force-download" 1
"application/pdf" "mime/type" 1
"application/pdf" "octet-stream" 1
"application/pdf" "image/jpg" 1
"application/pdf" "application/pdf,charset=utf-8" 1
"application/pdf" "pharmig-academy_parallelhandel_folder.pdf" 1

Tom Morris

unread,
Sep 3, 2021, 12:06:50 PM9/3/21
to common...@googlegroups.com
Could the columns be swapped? That would make more sense.

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/01f1d467-2491-4960-87fe-7465abbd430an%40googlegroups.com.

Tim Allison

unread,
Sep 3, 2021, 12:28:40 PM9/3/21
to Common Crawl
I thought about that, and it is possible I've botched something!
There are the results (occurrences > 4) when I remove where detected <> 'application/pdf'


http-mime detected-mime cnt
application/pdf application/pdf 8134909
application/octet-stream application/pdf 145722
text/html application/pdf 22901
application/pdf application/octet-stream 16514
application/download application/pdf 14011
application/force-download application/pdf 12740
unk application/pdf 11460
content-type: application/pdf 7153
pdf application/pdf 7109
application/x-download application/pdf 6078
application/pdf pdf 3049
binary/octet-stream application/pdf 2166
application/x-pdf application/pdf 2077
application/unknown application/pdf 1692
application/postscript application/pdf 1459
application/binary application/pdf 1417
file/unknown application/pdf 1400
application/x-msdownload application/pdf 1252
application/octetstream application/pdf 1223
('application/pdf', application/pdf 1172
text/plain application/pdf 1014
file application/pdf 834
application/pdf application/download 814
application/x-unknown application/pdf 724
force-download application/pdf 596
null application/pdf 524
application/save application/pdf 479
application application/pdf 475
application/save-as application/pdf 428
application/application-pdf application/pdf 402
application/.pdf application/pdf 375
multipart/form-data application/pdf 361
application/octet-stream,charset=utf-8 application/pdf 348
application/pdf: application/pdf 339
application/force-download,application/octet-stream application/pdf 337
octet/stream application/pdf 317
x-type/subtype application/pdf 259
application/octet application/pdf 253
application/pdf' application/pdf 229
.pdf application/pdf 219
application/pdf application/x-msdownload 210
application/ application/pdf 204
application/adobe application/pdf 183
application/application/pdf application/pdf 151
application/zip application/pdf 143
application/pdf unk 142
application/illustrator application/pdf 140
application/octer-stream application/pdf 120
application/pdf application/force-download 117
application/force-download,application/octet-stream,application/download,application/pdf application/pdf 110
text/pdf application/pdf 106
application/vnd.oasis.opendocument.text application/pdf 99
applicatoin/pdf application/pdf 98
$mimetype application/pdf 93
0 application/pdf 91
applicaton/octet-stream application/pdf 90
application/pdf application/octetstream 85
application/octect-stream application/pdf 84
application/x-forcedownload application/pdf 81
mime/type application/pdf 81
image/pdf application/pdf 78
application/pdf application/postscript 76
{$post->post_mime_type} application/pdf 76
application/x-pkcs7-mime application/pdf 71
application/x-force-download application/pdf 70
application/pdf application/x-pdf 70
application/x-octetstream application/pdf 69
dasdarius.de-supercontent application/pdf 69
application/acrobat application/pdf 67
misc application/pdf 60
application/pkcs7-mime application/pdf 59
application/pdf image/ipeg 59
octet-stream application/pdf 59
pb application/pdf 57
application/x-octet-stream application/pdf 52
application/pdf file/unknown 49
a application/pdf 48
application-download application/pdf 48
application/pdf 1 46
application/octet-stream+ application/pdf 43
application/pdf application/save-as 42
application/ai application/pdf 41
application-x/force-download application/pdf 38
application/stream application/pdf 38
datenblatt_01 application/pdf 37
application/x-some-explanation application/pdf 37
application/forcedownload application/pdf 35
application-msdownload application/pdf 35
text application/pdf 35
application/files application/pdf 33
application/x-extension-pdf application/pdf 31
system.net.mime.mediatypenames.application.octet application/pdf 31
application/* application/pdf 31
' application/pdf 30
image/jpg application/pdf 30
application/pdf binary/octet 28
application/octet-pdf application/pdf 28
application/x-zip-compressed application/pdf 26
application/msword application/pdf 26
doesn/matter application/pdf 26
application/pdf application/save 25
applicationpdf application/pdf 25
'application/octet-stream' application/pdf 24
application/pdf,application/pdf application/pdf 24
application/hypershop application/pdf 23
document/pdf application/pdf 23
application/oct-stream application/pdf 22
application/csv application/pdf 21
.* application/pdf 20
application/mdb application/pdf 20
application/pdf,charset=utf-8 application/pdf 20
application/x-unknown-content-type application/pdf 19
application/docx application/pdf 19
application/octet-binary application/pdf 19
image/jpeg application/pdf 18
application/download\n application/pdf 18
ms_excel application/pdf 18
appliction/octet-stream application/pdf 18
system.io.fileinfo application/pdf 18
application/pdf application/x-octet-stream 17
betriebsanleitung_01 application/pdf 16
force-download/liveserver application/pdf 15
$tipoarchivo application/pdf 15
application/octec-stream application/pdf 14
application/oc-stream application/pdf 14
application/smime application/pdf 14
'' application/pdf 13
application/pdf applicatio 13
$contenttype application/pdf 13
file/download application/pdf 13
text/x-pascal application/pdf 12
unknown/unknown application/pdf 12
mime application/pdf 12
application/x-www-form-urlencoded application/pdf 11
application/proses application/pdf 11
application/vnd.ms-excel application/pdf 11
application/pdfapplication/pdf application/pdf 11
application/pdf, application/pdf 11
directory application/pdf 11
application/x-pdf application/octetstream 11
application/rfc822 application/pdf 10
application/xls application/pdf 10
array application/pdf 10
applicaiton/download application/pdf 10
video/mp4 application/pdf 10
/ application/pdf 9
application/pdf application/x-download 9
xxx/xxx application/pdf 9
application/ms-download application/pdf 9
$filetipus application/pdf 9
unknown application/pdf 9
application/pdf text/html 9
document/unknown application/pdf 9
appliation/octet-stream application/pdf 9
application/x-troff-man application/pdf 9
application/pdf application%2fpdf 9
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet application/pdf 9
application/pdf 8
application/asp-unknown application/pdf 8
application/image application/pdf 8
application/multipart application/pdf 8
application/pdf application/force-download,application/octet-stream,application/download,application/pdf 8
application/pdf applicaton/octet-stream 8
application/pdf undefined 8
application/vnd.openxmlformats-officedocument.wordprocessingml.document application/pdf 8
application/x-ms-download application/pdf 8
application/x-unknown-file application/pdf 8
download/application/pdf application/pdf 8
filecontenttype application/pdf 8
image/gif application/pdf 8
x-application/pdf application/pdf 8
binary application/pdf 7
download application/pdf 7
application/x-empty application/pdf 7
application/file application/pdf 7
application\octet-stream application/pdf 7
application/vnd.adobe.pdf application/pdf 7
application/unkown application/pdf 7
application/smnet application/pdf 7
application/x-unknown-application-octet-stream application/pdf 7
contenttype application/pdf 7
application/doc application/pdf 7
binary/octet application/pdf 7
httpd/unix-directory application/pdf 7
1 application/pdf 7
application/pdf file 7
application/ofx application/pdf 7
'.application/pdf.' application/pdf 6
application/text-plain:formatted application/pdf 6
application/x-tgif application/pdf 6
image application/pdf 6
application/vnd.dvb.ait application/pdf 6
konformitätserklärung application/pdf 6
application/pda application/pdf 6
applicaton/pdf application/pdf 6
text/mixed application/pdf 6
applicationx/pdf application/pdf 6
application/x application/pdf 6
application/timestamped-data application/pdf 5
content/pdf application/pdf 5
charset=utf-8 application/pdf 5
application/x-unknown-application-pdf application/pdf 5
application/application/octet-stream application/pdf 5
application/pdf pharmig_rare_diseases_covid_19_umfrage.pdf 5
produktbroschüre_01 application/pdf 5
pdf/application application/pdf 5
application/pdf 20210608-schnell-digital-und-barrierefrei-lehren-aus-der-covid-19-pandemie-fuer-die-versorgung-von-menschen-mit-seltenen-erkrankungen.pdf 5
adobe/pdf application/pdf 5
application/pdf/force-download application/pdf 5
image/* application/pdf 5
file/pdf application/pdf 5
application%2fpdf application/pdf 5

Tim Allison

unread,
Sep 3, 2021, 12:31:44 PM9/3/21
to Common Crawl
Maybe a race condition in my ingest code?  Let me look at the actual indexed records...

Tim Allison

unread,
Sep 3, 2021, 1:51:43 PM9/3/21
to Common Crawl
User error... my sql call was joining on the wrong table... time to put down the keyboard for the weekend...

Sorry and thank you!

Kaleem Ullah

unread,
Sep 21, 2021, 3:41:56 PM9/21/21
to Common Crawl
Hello All Members, How i can integrate common crawler to my website? 
Reply all
Reply to author
Forward
0 new messages