Questions about crawled links

Nick Gilmour

unread,

Dec 24, 2017, 11:36:18 AM12/24/17

to pyspider-users

Hi all,

I have following questions regarding the crawled links:

1. How to distinguish new incoming links from old ones in the results? I mean which links are coming from which crawl.

2. In case there are duplicate links, how to indicate the links with duplicates in the results?

3. How to mark the links that do not exist anymore in the results?

Any ideas?

Regards,

Nick

Roy Binux

unread,

Dec 24, 2017, 2:34:09 PM12/24/17

to Nick Gilmour, pyspider-users

1. You could use 'save' pass through information from previous request.
2. You can't, they are discard implicitly.
3.what do you mean by "links not exist"?

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyspider-users/CAH-drozAr4oPGdJUwNO0AxseCcjAqEFqWzGsP3iFS3DxN4adXA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Nick Gilmour

unread,

Dec 24, 2017, 3:28:15 PM12/24/17

to Roy Binux, pyspider-users

Hi Roy,

thanks for the quick response!

1. What information do you mean? Is there something pyspider specific I could save e.g. a crawl Id?

3. Assuming a list of links, some of them are removed over time. I would like to know which of them are removed and save this information in the results.

On Sun, Dec 24, 2017 at 8:33 PM, Roy Binux <r...@binux.me> wrote:

1. You could use 'save' pass through information from previous request.
2. You can't, they are discard implicitly.
3.what do you mean by "links not exist"?

On Sun, 24 Dec 2017, 16:36 Nick Gilmour, <nicke...@gmail.com> wrote:

Hi all,
I have following questions regarding the crawled links:
1. How to distinguish new incoming links from old ones in the results? I mean which links are coming from which crawl.
2. In case there are duplicate links, how to indicate the links with duplicates in the results?
3. How to mark the links that do not exist anymore in the results?

Any ideas?

Regards,
Nick

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-users+unsubscribe@googlegroups.com.
To post to this group, send email to pyspider-users@googlegroups.com.

Roy Binux

unread,

Dec 24, 2017, 4:19:56 PM12/24/17

to Nick Gilmour, pyspider-users

1. What ever information you want, like URL in your case.
3. Do it in your own database. The result DB of pyspider is designed for its own usage. If you have any other logic of data. Build you own, and feed it from pyspider through result worker.

On Sun, 24 Dec 2017, 20:28 Nick Gilmour, <nicke...@gmail.com> wrote:

Hi Roy,

thanks for the quick response!

1. What information do you mean? Is there something pyspider specific I could save e.g. a crawl Id?
3. Assuming a list of links, some of them are removed over time. I would like to know which of them are removed and save this information in the results.

On Sun, Dec 24, 2017 at 8:33 PM, Roy Binux <r...@binux.me> wrote:

1. You could use 'save' pass through information from previous request.
2. You can't, they are discard implicitly.
3.what do you mean by "links not exist"?

On Sun, 24 Dec 2017, 16:36 Nick Gilmour, <nicke...@gmail.com> wrote:

Hi all,
I have following questions regarding the crawled links:
1. How to distinguish new incoming links from old ones in the results? I mean which links are coming from which crawl.
2. In case there are duplicate links, how to indicate the links with duplicates in the results?
3. How to mark the links that do not exist anymore in the results?

Any ideas?

Regards,
Nick

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.

Nick Gilmour

unread,

Dec 24, 2017, 4:26:31 PM12/24/17

to Roy Binux, pyspider-users

OK, many thanks. I'll try...

On Sun, Dec 24, 2017 at 10:19 PM, Roy Binux <r...@binux.me> wrote:

1. What ever information you want, like URL in your case.
3. Do it in your own database. The result DB of pyspider is designed for its own usage. If you have any other logic of data. Build you own, and feed it from pyspider through result worker.

On Sun, 24 Dec 2017, 20:28 Nick Gilmour, <nicke...@gmail.com> wrote:

Hi Roy,

thanks for the quick response!

1. What information do you mean? Is there something pyspider specific I could save e.g. a crawl Id?
3. Assuming a list of links, some of them are removed over time. I would like to know which of them are removed and save this information in the results.

On Sun, Dec 24, 2017 at 8:33 PM, Roy Binux <r...@binux.me> wrote:

1. You could use 'save' pass through information from previous request.
2. You can't, they are discard implicitly.
3.what do you mean by "links not exist"?

On Sun, 24 Dec 2017, 16:36 Nick Gilmour, <nicke...@gmail.com> wrote:

Hi all,
I have following questions regarding the crawled links:
1. How to distinguish new incoming links from old ones in the results? I mean which links are coming from which crawl.
2. In case there are duplicate links, how to indicate the links with duplicates in the results?
3. How to mark the links that do not exist anymore in the results?

Any ideas?

Regards,
Nick

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-users+unsubscribe@googlegroups.com.
To post to this group, send email to pyspider-users@googlegroups.com.

Reply all

Reply to author

Forward