Errors related to news-please installation

127 views
Skip to first unread message

Thao Nguyen

unread,
Mar 19, 2022, 11:19:28 AM3/19/22
to Common Crawl
I have installed Python 3.7, but when running this code:  
pip install news-please

I saw this error
ERROR: elastic-transport 8.1.0 has requirement urllib3<2,>=1.26.2, but you'll have urllib3 1.25.11 which is incompatible.

I have little knowledge about coding, just followed the instruction on https://github.com/fhamborg/news-please#run-the-crawler-via-the-cli

Can anyone help me with this? Thanks in advance.

Sebastian Nagel

unread,
Mar 21, 2022, 5:30:52 AM3/21/22
to common...@googlegroups.com
Hi,

there are two ways to get around the error:

(a) run instead
pip install -U news-please
Using the flag -U / --upgrade packages are recursively upgraded
to the latest version.

(b) use a virtual environment
https://docs.python.org/3/tutorial/venv.html
I'd strongly recommend this way in order to avoid package conflicts
and that forced package dependency upgrades break the dependencies
of some other package.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/d0dedcb6-390b-4fbb-8c19-6eae5db65496n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/d0dedcb6-390b-4fbb-8c19-6eae5db65496n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Thao Nguyen

unread,
Mar 24, 2022, 4:24:10 AM3/24/22
to Common Crawl
Thank you so much, Sebastian.
 
I have also tried removed number 3 after pip, and it works. 
Is that ok to run the code without number 3?

Best, 
Thao

Sebastian Nagel

unread,
Mar 24, 2022, 7:00:40 AM3/24/22
to common...@googlegroups.com
Hi,

> I have also tried removed number 3 after pip, and it works.
> Is that ok to run the code without number 3?

you mean, running
pip install ...
instead of
pip3 install ...

If both Python 2 and 3 installed on your system,
"pip" and "pip3" (maybe also "pip2", "pip2.7", etc.) are
used to install a module for one of the installed Python
versions.

Of course, you should use the pip command corresponding
to the Python version used to run your processing code.

Best,
Sebastian
> <https://groups.google.com/d/msgid/common-crawl/d0dedcb6-390b-4fbb-8c19-6eae5db65496n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/d0dedcb6-390b-4fbb-8c19-6eae5db65496n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/2f100e5d-58b8-4b4a-b4b9-d177c10e4a47n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/2f100e5d-58b8-4b4a-b4b9-d177c10e4a47n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Thao Nguyen

unread,
Mar 24, 2022, 9:11:03 AM3/24/22
to Common Crawl
Thank you Sebastian for your help. 

By the way, I would like to ask about how to filter date of the articles using news-please.
I have edited the start-date and end-date in the config. file as follows: 

[DateFilter]

start_date = '2019-12-30 00:00:00'
end_date = '2019-12-31 00:00:00'

# If 'True' articles without a publishing date are dropped.
strict_mode = False

However, the retrieved results includes  articles whose publishing dates are outside the given time (in 2011, 2020, or even 1996). Maybe I have made some mistakes that I have not figured out. 
Hope you can help me with this error.

Thanks, 
Thao.

Sebastian Nagel

unread,
Mar 24, 2022, 6:40:16 PM3/24/22
to common...@googlegroups.com
Hi Thao,

news-please uses the CC-NEWS WARC file timestamps to filter by time
range. These indicate when the data was fetched but is not bound to
the publication date of the articles.

The news crawler uses the publication dates in feeds and news sitemaps
to skip over old articles (older than 30 days). But if there is no
pubdate or it is wrong an outdated article may be fetched.

You might also want to look at these discussions:
https://groups.google.com/g/common-crawl/c/1OjM4sJ18dE/m/qC1rgV7zCgAJ
https://groups.google.com/g/common-crawl/c/SkGNdov1Mh4/m/G-NF8cxHDwAJ

Best,
Sebastian
> <https://groups.google.com/d/msgid/common-crawl/2f100e5d-58b8-4b4a-b4b9-d177c10e4a47n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/2f100e5d-58b8-4b4a-b4b9-d177c10e4a47n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/e46888e8-4641-4756-864a-d84258a7ac43n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/e46888e8-4641-4756-864a-d84258a7ac43n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Thao Nguyen

unread,
Apr 5, 2022, 4:44:06 AM4/5/22
to Common Crawl
Thank you so much, Sebastian. 

Reply all
Reply to author
Forward
0 new messages