Time for re-indexing, finding new content uploaded

27 views
Skip to first unread message

Iain Watson Smith

unread,
Nov 6, 2015, 7:30:16 PM11/6/15
to DataparkSearch Engine
I'd like to know how dpsearch can discover new content from the sites given in the Server list?
If ten new articles go up each day to the site, but the site itself is massive. how is it going to recognise new content? 


Maxim Zakharov

unread,
Nov 9, 2015, 4:31:54 PM11/9/15
to DataparkSearch Engine

Dpsearch discover new content in reindexing all pages already in its database. The period is controlled by Period and PeriodByHops commands which may set on per Server basis.

Also dpsearch supports sitemaps introduced by Google a while ago to speedup reindexing massive sites. It is enabled with “sitemaps yes” command (and you need to have robots.txt support enabled as well).

Best regards,
Maxim Zakharov


On Sat, 7 Nov 2015 11:30 Iain Watson Smith <iai...@gmail.com> wrote:
I'd like to know how dpsearch can discover new content from the sites given in the Server list?
If ten new articles go up each day to the site, but the site itself is massive. how is it going to recognise new content? 


--
You received this message because you are subscribed to the Google Groups "DataparkSearch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataparksearc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Iain Watson Smith

unread,
Nov 18, 2015, 4:19:27 AM11/18/15
to DataparkSearch Engine
Thanks Maxime, 

I have taken your advice and introduced the sitemaps.
Also when I use the search.cgi interface and search on "last-modified-date", I get results of the crawl organised in date order (of crawl), not actually modified the most recent.
Is it possible to alter the cgi script to return modified most recent?

For example I have a news article that is posted on 2015-11-06 and another 2013-01-02 but it was crawled after the most recent, and so its returned like this:

2013-01-02
2015-11-06

Thanks again,
Iain

Maxim Zakharov

unread,
Nov 24, 2015, 5:46:38 PM11/24/15
to DataparkSearch Engine

Hi Iain,
The value for last-modified-date is taken from Last-modified HTTP header returned by remote server. If this header is not present, then the Date header is taken which usually gives the time of indexing.
Most “modern” site driving systems neglect setting up correct Last-modified header, so it is mostly not correct in giving actual last modified date.

Reply all
Reply to author
Forward
0 new messages