above mentioned error has been solved. problem was with the latest versions of the libraries used in newsplease. now i installed all python packages that are mentioned in requirement.txt file and it is working fine now..
i have another question now.
i want to download warc files of specific time interval only.
eg -
start_date = '2020-03-01 00:00:00'
end_date = '2020-03-20 06:00:00'
this time duration has 2 or 3 warc files only.
i want news data that was crawl in these 6 hours only.
changes that i made for this -
this is in commoncrawl_crawler.py file -
warc_dates = __iterate_by_month(warc_files_start_date, datetime.datetime.today())
warc_dates = __iterate_by_month(warc_files_start_date, datetime.datetime(2020,3,21))
this is in both config.cfg and config_lib.cfg -
start_date = '2020-03-01 00:00:00'
end_date = '2020-03-20 06:00:00'
this is commoncrawl_extractor.py -
_filter_start_date = None
__filter_start_date = datetime.datetime(2020, 3, 1)
# end date (if None, any date is OK as end date)
__filter_end_date = None
__filter_end_date =datetime.datetime(2020, 3, 20)
this is in commoncrawl.py -
# if date filtering is strict and news-please could not detect the date of an article, the article will be discarded
my_warc_files_start_date = datetime.datetime(2020, 3, 1)
after making this changes it is still want to download all 441 warc files of march 2020. which results in "no space left on disk" warning.
these are the changes i made. i think most of this change are limited to article filter process.