IOError: Expected '\r\n', found 'WARC/1.0\r\n'
Basically i am trying to iterate over the records of news WARC file to get HTML content and process the HTML content. I am using python warc package
snippet to read warc file:
import warc
f = warc.open("CC-NEWS-20161001224340-00008.warc")
for record in f:
if record['Content-Type'] == 'application/http; msgtype=response':
payload = record.payload.read()
headers, body = payload.split('\r\n\r\n', 1)
if 'Content-Type: text/html' in headers:
#do my processing with html content (body)
But when i run this i am getting this error:
Traceback (most recent call last):
warc_process.py", line 69, in
read_entire_warc("CC-NEWS-20160926211809-00000.warc")
File "warc_process.py", line 54, in read_entire_warc
for record in f:
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 360, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'WARC/1.0\r\n'
Sample WARC files facing issues with: CC-NEWS-20160926211809-00000.warcCC-NEWS-20161001122244-00007.warc.gz
CC-NEWS-20161001224340-00008.warc.gz
CC-NEWS-20161002224346-00009.warc.gz
CC-NEWS-20161003130443-00010.warc.gz
CC-NEWS-20161004130444-00011.warc.gz
CC-NEWS-20161005130450-00012.warc.gz
CC-NEWS-20161005152607-00013.warc.gz
CC-NEWS-20161006152607-00014.warc.gz
CC-NEWS-20161006191324-00015.warc.gz
CC-NEWS-20161007191326-00016.warc.gz
CC-NEWS-20161008015559-00017.warc.gz
CC-NEWS-20161009015614-00018.warc.gz
CC-NEWS-20161010001731-00019.warc.gz