Problem crawling with http connector

tbaumann

unread,

Aug 22, 2012, 10:51:18 AM8/22/12

to const...@googlegroups.com

I am having an issue that I am currently unable to find a work around for with the http connector with Constellio 1.3. When I start crawling a site I get roughly 65 ~ 85 documents in and the crawl stops for sometimes 20 min or an hour before restarting again and maybe gets 30 or so documents in and halts again. When I look into the logs I am seeing a pretty generic error that some people have reported before

"SEVERE: Push Exception during traversal.

com.google.enterprise.connector.pusher.PushException:

at com.google.enterprise.connector.pusher.DocPusher.submitFeed(DocPusher.java:579)

at com.google.enterprise.connector.pusher.DocPusher.access$000(DocPusher.java:57)

at com.google.enterprise.connector.pusher.DocPusher$1.call(DocPusher.java:503)

at com.google.enterprise.connector.pusher.DocPusher$1.call(DocPusher.java:499)

at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)

at java.util.concurrent.FutureTask.run(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

at java.lang.Thread.run(Unknown Source)"

I have tried increasing -Xmx from 1024 to 2048 with no change in behavior. Not quite sure where to go with this one from here.

Any help would be appreciated.

Thanks

Nicolas Bélisle

unread,

Aug 22, 2012, 1:44:52 PM8/22/12

to const...@googlegroups.com

Are you using MySQL ?

Also, the crawler tries to discard duplicated pages. That could explain the behaviour you're seeing.

What website are you trying to index ?

Regards,

Nicolas

--
Vous recevez ce message, car vous êtes abonné au groupe Google Groupes Constellio.
Cette discussion peut être lue sur le Web à l'adresse https://groups.google.com/d/msg/constellio/-/dPRnlPQYc04J.
Pour envoyer un message à ce groupe, adressez un e-mail à const...@googlegroups.com.
Pour vous désabonner de ce groupe, envoyez un e-mail à l'adresse constellio+...@googlegroups.com.
Pour plus d'options, consultez la page de ce groupe : http://groups.google.com/group/constellio?hl=fr

tbaumann

unread,

Aug 22, 2012, 1:49:59 PM8/22/12

to const...@googlegroups.com

Hi Nicolas-

Yes, I am using MySQL for a backend database.

I am attempting to index an internal site that is just an apache web page with directory index lists of my departments documents. I am talking maybe a grand total of 10000 documents and it is taking roughly 24 hours to crawl it all (with a 0:00 - 0:00 traversal schedule). If I take a smaller subset of documents and just point it at what is listed in my home directory it still does the same thing. When it pauses if I go in and click on restart traversal it will continue for a short amount of time and then pause again.

Nicolas Bélisle

unread,

Aug 22, 2012, 1:53:47 PM8/22/12

to const...@googlegroups.com

Use a greater depth for your crawl.

Also, be aware that the crawler is limited to around one document per host per 1-2 seconds. This is meant to protect against denial of service.

Regards

Cette discussion peut être lue sur le Web à l'adresse https://groups.google.com/d/msg/constellio/-/e1t_PzGEBOYJ.

Message has been deleted

Anshul Tiwari

unread,

Mar 26, 2013, 2:17:45 AM3/26/13

to const...@googlegroups.com

Hi Nicolas,

I do have same problem but in my case Constellio is crawling the duplicate links.
For eg:
http://msi-twiki.metricstream.com/twiki/bin/view/Knowledge/ForumKnowledgeMisc00113

http://msi-twiki.metricstream.com/twiki/bin/view/Knowledge/ForumKnowledgeMisc00113?cover=print

http://msi-twiki.metricstream.com/twiki/bin/view/Knowledge/ForumKnowledgeMisc00113?sortcol=0;table=3;up=0

these are the same links and same topic (duplicate 1400+),
could you please guide me so that i can remove the duplicates or crawling should exclude the duplicates.

Thanks for helping me here.

Anshul

Anshul Tiwari

unread,

Apr 12, 2013, 2:22:32 AM4/12/13

to const...@googlegroups.com

Hi Guys,

Can someone please help me on this.

Thanks

Anshul

Nicolas Bélisle

unread,

Apr 12, 2013, 11:51:07 AM4/12/13

to const...@googlegroups.com

Hi,

Constellio detects duplicates based on the text content of a page.

In you case, I suggest using regular expression to exclude the patterns of near-duplicate pages. Some examples :

cover=print$

sortcol=

Regards,

Nicolas

--

Vous recevez ce message, car vous êtes abonné au groupe Google Groupes Constellio.

Pour vous désabonner de ce groupe et ne plus recevoir d'e-mails le concernant, envoyez un e-mail à l'adresse constellio+...@googlegroups.com.

Pour envoyer un message à ce groupe, adressez un e-mail à const...@googlegroups.com.

Visitez ce groupe à l'adresse http://groups.google.com/group/constellio?hl=fr .
Pour plus d'options, visitez le site https://groups.google.com/groups/opt_out .

Anshul Tiwari

unread,

Apr 15, 2013, 8:22:29 AM4/15/13

to const...@googlegroups.com

Hi Nicolas,

Thanks a ton for helping.

I was trying to escape the URL/grep the pattern like below.

+^http://([a-z0-9]*\.)*\?cover=print*

i wrote so many RE's but didn't worked.

Thanks for your help, it was a road blocker for me.

Thanks

Anshul

Reply all

Reply to author

Forward