Exception while parsing a robots.txt

Altruist

unread,

Feb 1, 2013, 12:20:16 AM2/1/13

to ldsp...@googlegroups.com

Hello All,

When ldspider parses a robots.txt file it seems it does not escape certain characters properly as can be seen in the exception stack trace below.Can you please advise how to overcome this issue.Please note that I have masked certain data.

Thanks.

INFO: ERROR: URLDecoder: Incomplete trailing escape (%) pattern: http://www.xxxxxx.com/robots.txt
1359695278 266 127.0.0.1 TCP_MISS/200 -1 GET http://www.xxxxxxx.com/robots.txt - NONE/- text/plain
Exception in thread "LT-2:http://www.xxxxx.com/xxxxxx-x--x--x---x-x-/xxx-x--x--x--x-x--x-xxxxx" java.lang.NullPointerException
    at org.osjava.norbert.NoRobotClient.isUrlAllowed(Unknown Source)
    at com.ontologycentral.ldspider.http.robot.Robot.isUrlAllowed(Unknown Source)
    at com.ontologycentral.ldspider.http.robot.Robots.accessOk(Unknown Source)
    at com.ontologycentral.ldspider.http.LookupThread.run(Unknown Source)

Andreas Harth

unread,

Feb 2, 2013, 4:03:11 PM2/2/13

to ldsp...@googlegroups.com

Hi,

we're using the Norbert robots.txt parser as-is, so I don't know
off the top of my head what's going on. It would help to have an
example file that causes the problem for diagnosis. There is
always the possibility that the problem is on the target server's
side.

Best regards,
Andreas.

On 01/02/13 06:20, Altruist wrote:
>
> Hello All,
>
> When ldspider parses a robots.txt file it seems it does not escape
> certain characters properly as can be seen in the exception stack trace
> below.Can you please advise how to overcome this issue.Please note that
> I have masked certain data.
>
> Thanks.
>

> INFO: ERROR: URLDecoder: *Incomplete trailing escape (%) pattern:
> http://www.xxxxxx.com/robots.txt*
> *1359695278 266 127.0.0.1 TCP_MISS/200 -1 GET

> http://www.xxxxxxx.com/robots.txt - NONE/- text/plain

> *Exception in thread
> "LT-2:http://www.xxxxx.com/xxxxxx-x--x--x---x-x-/xxx-x--x--x--x-x--x-xxxxx"
> *java.lang.NullPointerException*

> at org.osjava.norbert.NoRobotClient.isUrlAllowed(Unknown Source)
> at com.ontologycentral.ldspider.http.robot.Robot.isUrlAllowed(Unknown
> Source)
> at com.ontologycentral.ldspider.http.robot.Robots.accessOk(Unknown Source)
> at com.ontologycentral.ldspider.http.LookupThread.run(Unknown Source)
>

> --
> You received this message because you are subscribed to the Google
> Groups "LDSpider" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to ldspider+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Meraj Khan

unread,

Feb 2, 2013, 4:05:11 PM2/2/13

to ldsp...@googlegroups.com

Thank you for the reply Andreas its the www.sears.com/robots.txt.

Thanks.

an email to ldspider+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "LDSpider" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ldspider+unsubscribe@googlegroups.com.

Andreas Harth

unread,

Feb 2, 2013, 4:35:21 PM2/2/13

to ldsp...@googlegroups.com

Hi,

On 02/02/13 22:05, Meraj Khan wrote:
> Thank you for the reply Andreas its the www.sears.com/robots.txt

> <http://www.sears.com/robots.txt>.

ok, Norbert chokes on the "%7" which is not a valid escape code.

I'm loath to make changes in the Norbert part of the code, but
if you want to just skip the faulty line add a try/catch block
in NoRobotClient around line 160.

Best regards,
Andreas.

Meraj Khan

unread,

Feb 2, 2013, 4:54:55 PM2/2/13

to ldsp...@googlegroups.com

Thanks a lot Andreas though I will make that change thanks.

So is this an issue with Norbert code is the robots.txt invalid.Could you please let me know?

Thanks again.

Reply all

Reply to author

Forward