Solr and Stormcrawler - [WARN] Found data point value of class class java.util.HashMap

38 views
Skip to first unread message

Suman Mallela

unread,
Feb 28, 2018, 4:51:41 PM2/28/18
to digita...@googlegroups.com
Hello,

I am trying to integrate solr and stormcrawler and the crawler log just halts at below log:

mer Thread-27 [WARN] Found data point value of class class java.util.HashMap
2018-02-28 16:34:38.899 c.d.c.s.m.MetricsConsumer Thread-27 [WARN] Found data point value of class class java.util.HashMap

Attached is the entire log. Can you please guide me?

Thanks.
worker.log

DigitalPebble

unread,
Mar 1, 2018, 11:36:24 AM3/1/18
to DigitalPebble
It is a warning, not an error. If you had googled the message, you would have found https://github.com/DigitalPebble/storm-crawler/issues/526



--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebble+unsubscribe@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.



--

msum...@gmail.com

unread,
Mar 8, 2018, 9:55:23 AM3/8/18
to DigitalPebble
Hello Julien,

I understand that WARN is not an error. I have used the same URL that you pasted to correct the MetricsConsumer.java file for External resource Solr code from log.error to log.warn. I have been running the topology for more than 15 hrs and still I don't see any emissions or transfers from Spout. Can you please comment on this?

Here is the screenshot:


Thanks,
Suman

DigitalPebble

unread,
Mar 8, 2018, 10:08:16 AM3/8/18
to DigitalPebble
Hi 

Have you checked that there are URLs in the status index?

J

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebble+unsubscribe@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.

msum...@gmail.com

unread,
Mar 8, 2018, 10:20:06 AM3/8/18
to DigitalPebble
Yes, I have checked and it shows the URL that I used in seeds.txt. Here is the screenshot:



On Wednesday, February 28, 2018 at 4:51:41 PM UTC-5, Suman Mallela wrote:

msum...@gmail.com

unread,
Mar 9, 2018, 4:44:30 PM3/9/18
to DigitalPebble
Hello Julie,

Do you have any thoughts on no url's are getting indexed ? Do you want to see any of my config files?

Thanks 


On Wednesday, February 28, 2018 at 4:51:41 PM UTC-5, Suman Mallela wrote:

DigitalPebble

unread,
Mar 10, 2018, 5:18:52 AM3/10/18
to DigitalPebble
Anything in the  solr logs?

--

Suman Mallela

unread,
Mar 10, 2018, 3:37:22 PM3/10/18
to digita...@googlegroups.com
No errors in solr log. Attached is the solr.log for your reference.

Thanks

--
You received this message because you are subscribed to a topic in the Google Groups "DigitalPebble" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/digitalpebble/rP5ikOva6lg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to digitalpebble+unsubscribe@googlegroups.com.
solr.log

Suman Mallela

unread,
Mar 10, 2018, 7:09:37 PM3/10/18
to digita...@googlegroups.com
Only the metrics core is filling up and the indexer bolt is not getting executed at all. 

Thanks

DigitalPebble

unread,
Mar 11, 2018, 1:34:47 AM3/11/18
to DigitalPebble
That's normal given that the spout is not getting anything from SOLR. Look at the solr logs =>

2018-03-10 20:36:45.468 INFO  (qtp205125520-19) [   x:status] o.a.s.c.S.Request [status]  webapp=/solr path=/select params={q=*:*&expand=true&expand.rows=100&start=0&fq=nextFetchDate:[*+TO+NOW]&fq={!collapse+field%3Dhost}&rows=100&wt=javabin&version=2} hits=0 status=0 QTime=0

this shows that 1) the communication between the spout and SOLR does happen 2) no docs are returned.

the query looks correct. Try running this query yourself using the SOLR UI and modify elements of it until you are getting documents returned.




Suman Mallela

unread,
Mar 12, 2018, 4:18:13 PM3/12/18
to digita...@googlegroups.com
Thanks, Julien.

I had to update the type (from string to tdate) for nextFetchDate in the status (schema.xml) for the query to work in Solr UI.
    <field name="nextFetchDate" type="tdate" stored="true" indexed="true"/>

Here is some of the additional logs in the solr.log but still no docs are indexed in the eclkc_solr641 core(solr.indexer.url: "http://localhost:8983/solr/eclkc_solr641")

2018-03-12 20:00:50.032 INFO  (qtp205125520-17) [   x:eclkc_solr641] o.a.s.u.p.LogUpdateProcessorFactory [eclkc_solr641]  webapp=/solr path=/update params={wt=javabin&version=2}{add=[https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program (1594763352507154432)]} 0 0

But nothing after this for the eclkc_solr641 core. Attached is the full solr.log.

Thanks,
Suman

 
solr.log

DigitalPebble

unread,
Mar 12, 2018, 4:48:24 PM3/12/18
to DigitalPebble
I am a bit surprised that you had to modify the schema, I used the SOLR module not long ago and it worked without me having to do anything to it.

looks like it is the same URL being fetched over and over (https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program); your log shows attempts at updating and indexing it
what do you get when you query the status index for that URL? 
anything in the storm logs?

Suman Mallela

unread,
Mar 12, 2018, 6:00:03 PM3/12/18
to digita...@googlegroups.com


Storm logs looks fine. No Errors. Looks like it's going into IndexerBolt.java,  printing all the writeln statements. Attached is the worker.log (storm log) for your reference.

Thanks.

worker.log
worker.log

DigitalPebble

unread,
Mar 13, 2018, 3:08:50 AM3/13/18
to DigitalPebble
As shown in the screenshot, the status of the URL has not been updated and is still marked as DISCOVERED with a fetch date set in the past. That's why this URL is being fetched over and over. what you need to work out is why the status updater bolt is not successful in updating the URL

Julien Nioche

unread,
Mar 13, 2018, 6:46:19 AM3/13/18
to DigitalPebble
I tried with your URL and indeed had to define the date field as you did below. This has been fixed in https://github.com/DigitalPebble/storm-crawler/issues/544, thanks!

There was also an issue with the definition of the docs index -> https://github.com/DigitalPebble/storm-crawler/issues/545

Having said that, I can see the URL status getting updated, new URLs being added etc... so I don't know what the problem is with your setup

Suman Mallela

unread,
Mar 13, 2018, 3:04:53 PM3/13/18
to digita...@googlegroups.com
Can you look at the settings in the crawler-conf.yaml. Attached is the file. Anything you see configured incorrectly?

Can you share you config files to compare?

Thanks!
crawler-conf.yaml

DigitalPebble

unread,
Mar 13, 2018, 3:16:06 PM3/13/18
to DigitalPebble
I used the example solr config file in the repo.

On 13 Mar 2018 19:04, "Suman Mallela" <msum...@gmail.com> wrote:
Can you look at the settings in the crawler-conf.yaml. Attached is the file. Anything you see configured incorrectly?

Can you share you config files to compare?

Thanks!

Suman Mallela

unread,
Mar 13, 2018, 3:56:12 PM3/13/18
to digita...@googlegroups.com
This is my Streams config within solr-crawler.flux. I suppose the flow is wrong here. I am using flux to inject and crawl. Can you confirm?

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE
      
  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"


On Tue, Mar 13, 2018 at 3:16 PM, DigitalPebble <jul...@digitalpebble.com> wrote:
I used the example solr config file in the repo.
On 13 Mar 2018 19:04, "Suman Mallela" <msum...@gmail.com> wrote:
Can you look at the settings in the crawler-conf.yaml. Attached is the file. Anything you see configured incorrectly?

Can you share you config files to compare?

Thanks!

Suman Mallela

unread,
Mar 13, 2018, 5:49:28 PM3/13/18
to digita...@googlegroups.com
Hello Julien,

I put in some debug statements in the StatusUpdaterBolt.java file and I see the URL's being discovered but not recursively fetched except for the seed URL. I understand that the DISCOVERED status is not changing for the seed url and so it's getting fetched repeatedly. I am not sure why the status stream is not getting updated. Here is more log that may help you guide me. Attached is the full log, I had to cancel the crawl since it's fetching the same url.

6303 [Thread-24-fetcher-executor[3 3]] INFO  c.d.s.b.FetcherBolt - Using queue mode : byHost
6686 [Thread-24-fetcher-executor[3 3]] INFO  o.a.s.d.executor - Prepared bolt fetcher:(3)
6736 [Thread-24-fetcher-executor[3 3]] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Threads : 0 queues : 0 in_queues : 0
7472 [FetcherThread #126] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program with status 200 in msec 60
7495 [Thread-34-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsing : starting https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program
8076 [Thread-34-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsed https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program in 539 msec
 DOC: url https://eclkc.ohs.acf.hhs.gov/job-center - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/human-resources - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/school-readiness - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/physical-health - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/transportation - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/cas-login - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/curriculum - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/oral-health - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/policy/pi - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/facilities - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/programs - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/topics - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/about - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/family-community - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/upcoming-events - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/program-planning - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/parenting - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/state-systems - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/mental-health - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/subscribe - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/about-us - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/health - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/safety-practices - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/policy/im - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/ - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/disclaimers - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/privacy - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/culture-language - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/nutrition - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/policy - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/transitions - Status DISCOVERED
 DOC: url https://eclkc.ohs.acf.hhs.gov/center-locator - Status DISCOVERED
8860 [Thread-24-fetcher-executor[3 3]] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Threads : 0 queues : 0 in_queues : 0
8940 [FetcherThread #175] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program with status 200 in msec 58
8962 [Thread-34-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsing : starting https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program
9024 [Thread-34-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsed https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program in 57 msec
9119 [Thread-24-fetcher-executor[3 3]] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Threads : 0 queues : 0 in_queues : 0
9188 [FetcherThread #175] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program with status 200 in msec 44
9195 [Thread-34-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsing : starting https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program
9251 [Thread-34-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsed https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program in 50 msec
9448 [Thread-24-fetcher-executor[3 3]] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Threads : 0 queues : 0 in_queues : 0
9536 [FetcherThread #83] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched 


Thanks
crawler.log

Suman Mallela

unread,
Mar 13, 2018, 6:02:43 PM3/13/18
to digita...@googlegroups.com
And below is the screenshot for storm ui which shows that status is not emitted or transferred (always 0). 



DigitalPebble

unread,
Mar 14, 2018, 2:45:58 AM3/14/18
to DigitalPebble
I understand that the DISCOVERED status is not changing for the seed url and so it's getting fetched repeatedly.

That's not the case. From your log
its status should have been changed to fetched. The other URLs should have been added as well.

https://github.com/DigitalPebble/stormcrawlerfight/tree/solr contains a basic setup with SOLR, check that there are no major differences with your project in the flux and conf

Suman Mallela

unread,
Mar 14, 2018, 12:12:26 PM3/14/18
to digita...@googlegroups.com
The setup looks the same for flux and conf. 

its status should have been changed to fetched. The other URLs should have been added as well.

Are you suggesting if there might be a problem with solr config not updating the status database (status core write operations issue)? 

I ran the crawler overnight  and in the morning I saw 1 document indexed (https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program)
in the indexer core (eclkc_solr641) and a lots of documents in metrics core but still the status core didn't show all the discovered URL's nor the status of the URL  (https://eclkc.ohs.acf.hhs.gov/publication/choosing-parenting-curriculum-your-program) didn't change to FETCHED.

I can only think of re-installing solr and create a new brand new StormCrawler-based project based on mvn archetype.

Any thoughts?

DigitalPebble

unread,
Mar 14, 2018, 2:11:15 PM3/14/18
to DigitalPebble
Hi Suman

Yes, a fresh setup from the archetype would be a good idea. I'll release 1.8 soon but in the meantime you can download the master from Github.

Make sure you use the configs from the core dir, should be a matter of linking them to the right place in your SOLR setup (can't remember where from top of my head)

Julien

Suman Mallela

unread,
Mar 15, 2018, 11:50:30 AM3/15/18
to digita...@googlegroups.com
I did the following steps:

2) mvn clean install

Am not sure what to do after this to build a SOLR setup. Can you send me the steps where to update the pom.xml and config files.

Thanks

Suman Mallela

unread,
Mar 15, 2018, 11:59:12 AM3/15/18
to digita...@googlegroups.com
/opt/storm-crawler/archetype/src/main/resources/archetype-resources. Looks like I have to update the files in this location. Can you confirm?

Thanks

DigitalPebble

unread,
Mar 15, 2018, 12:03:59 PM3/15/18
to DigitalPebble
Suman

Instead of posting every 10 minutes, why don't you try to work things out by yourself a bit and look at the tutorials, WIKI and videos available?

Thanks
Reply all
Reply to author
Forward
0 new messages