Release 1.4?

9 views
Skip to first unread message

Julien Nioche

unread,
May 24, 2023, 3:05:18 AM5/24/23
to crawler-commons
Hi, 

We have loads of improvements and bugfixes committed since 1.3 (see CHANGES.txt) and it would be great to have a new release.

There is 1 PR in process, should we wait for it or go ahead with the release?

Thanks

Julien

Richard Zowalla

unread,
May 24, 2023, 3:24:24 AM5/24/23
to crawler...@googlegroups.com
+1 for doing 1.4

--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/70564f1c-6db2-408b-b7ad-7550ede8a7fdn%40googlegroups.com.

Sebastian Nagel

unread,
May 24, 2023, 3:43:18 AM5/24/23
to crawler...@googlegroups.com
Hi Julien,

> and it would be great to have a new release.

Yes, definitely.

> There is 1 PR in process
> <https://github.com/crawler-commons/crawler-commons/pull/360>, should we wait
> for it or go ahead with the release?

I would opt to wait for it. If we get it done, SimpleRobotRulesParser will pass
all unit tests of Google's robots.txt parser which is also the reference
implementation of RFC 9309.

#360 is blocked by #390 (or #114) - one unit test fails because
SimpleRobotRulesParser closes a rule block when a Crawl-Delay instruction is
observed. The RFC says that instructions not specified in the RFC should not
change the behavior of the parser - they might be ignored or followed without
any side effects. But it's mostly about a decision how the parser should behave.
Please have a look!

I should be able to work on #360 during the next days. Because there are quite a
few changes in the robots.txt parser, I would also like to test the current
parser with some real robots.txt. If the release can wait for another week...

Best,
Sebastian


On 5/24/23 09:05, Julien Nioche wrote:
> Hi,
>
> We have loads of improvements and bugfixes committed since 1.3 (see CHANGES.txt
> <https://github.com/crawler-commons/crawler-commons/blob/master/CHANGES.txt>)
> and it would be great to have a new release.
>
> There is 1 PR in process
> <https://github.com/crawler-commons/crawler-commons/pull/360>, should we wait

Julien Nioche

unread,
May 24, 2023, 3:57:20 AM5/24/23
to crawler...@googlegroups.com
Thanks guys, let's wait then! All the recent improvements are fantastic, thanks everyone

--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.

Sebastian Nagel

unread,
Jul 10, 2023, 9:44:20 AM7/10/23
to crawler...@googlegroups.com
Hi everybody,

there are a couple of pull requests open to address all remaining
issues to make SimpleRobotRulesParser comply with RFC 9309. Would be
great if someone could have a look and review the PRs. Thanks!

A closer look is recommended for #390/#114/#430: the "do not close rule blocks
on statements not defined in RFC 9309" principle (which also includes the
Crawl-delay statement) seems to contradict the intention in some robots.txt files.
See the example in #390.

I've run the SimpleRobotRulesParser on robots.txt WARC files from Common Crawl
(see the link in #123) and compared the resulting parsed rules of 1.3 with a
test branch based on the current master and all open PRs merged into it. Except
for changes related to #390/#114/#430 the differences in parsed rules look good,
generally in favor of 1.4. Thanks also and especially to Eduardo Jimenez for
implementing #351 (merging of rule groups)!

Best,
Sebastian

Sebastian Nagel

unread,
Jul 12, 2023, 11:23:22 AM7/12/23
to crawler...@googlegroups.com
Hi,

thanks for the many reviews! Everything done for the release of 1.4!

If there are no objections, I'll start to prepare a release candidate...
Should be ready tomorrow.

Best,
Sebastian

Avi Hayun

unread,
Jul 13, 2023, 2:14:08 AM7/13/23
to crawler...@googlegroups.com
Thank you so much Sebastian!

--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages