Storm crawler - SOLR

211 views
Skip to first unread message

andym...@yahoo.fr

unread,
Aug 15, 2015, 1:13:23 PM8/15/15
to DigitalPebble
Hi, 

Could you please show to me how to launch storm crawler with SOLR.
I have already red the README file on the solr repository, it didn't work.

Thanks
Andy

Jorge Luis Betancourt Gonzalez

unread,
Aug 15, 2015, 4:23:05 PM8/15/15
to digita...@googlegroups.com
Do you care to elaborate a little more on what happened? can you share the relevant portions of the config file? Do you have your own Solr collections? 

Regards,

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.

andym...@yahoo.fr

unread,
Aug 16, 2015, 4:21:47 PM8/16/15
to DigitalPebble
Re,

I have my own solr with collection called "alfred".
i have already modified the solr-conf.yaml that i included this collection

# Solr indexer bolt
solr.indexer.threads: 10
solr.indexer.queue.size: 10000
solr.indexer.commit.size: 1

I have also added the dependency in the pom.xml inside of the core repository
<dependency>
<groupId>com.digitalpebble</groupId>
<artifactId>storm-crawler-solr</artifactId>
<version>0.6</version>
</dependency>

When i launched storm crawler, i got this error message
$ mvn clean compile exec:java -Dstorm.topology=com.digitalpebble.storm.crawler.CrawlTopology -Dexec.args="-conf crawler-conf.yaml -local"
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building storm-crawler-core 0.6-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[WARNING] The POM for com.digitalpebble:storm-crawler-solr:jar:0.6 is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 6.423 s
[INFO] Finished at: 2015-08-16T11:30:48+02:00
[INFO] Final Memory: 11M/123M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project storm-crawler-core: Could not resolve dependencies for project com.digitalpebble:storm-crawler-core:jar:0.6-SNAPSHOT: Could not find artifact com.digitalpebble:storm-crawler-solr:jar:0.6 in central (https://repo.maven.apache.org/maven2) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:


Do i something wrong?

Regards,
Andy

DigitalPebble

unread,
Aug 17, 2015, 6:26:42 AM8/17/15
to digita...@googlegroups.com
Hi Andy

I have also added the dependency in the pom.xml inside of the core repository
<dependency>
<groupId>com.digitalpebble</groupId>
<artifactId>storm-crawler-solr</artifactId>
<version>0.6</version>
</dependency>

that won't work as 0.6 hasn't been published yet + there should be no dependency from core to SOLR. Instead run mvn clean install from the root of storm-crawler then cd core and call mvn exec etc...

Alternatively cd external/solr and run mvn clean install then mvn exec  with SolrCrawlTopology 
HTH

Julien


andym...@yahoo.fr

unread,
Aug 17, 2015, 5:52:41 PM8/17/15
to DigitalPebble, jul...@digitalpebble.com
Re,

It still doesn't work with Maven (i'm originally a C# developer, i'm new maven user), i'll focus on it later.
Finally, i decided to pass through Eclipse directly and i run the SolrCrawlTopology on it.
I've created 3 cores on Solr, Status, Metrics and Alfred and i've decided to pass from solr 4.10.2 to 5.1.
When i run the SolrCrawlTopology, i got this error message :

21903 [Thread-21-spout] ERROR com.digitalpebble.storm.crawler.solr.persistence.SolrSpout - Can't query Solr: {}

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/status: Index: 0, Size: 0

at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:556) ~[solr-solrj-5.1.0.jar:5.1.0 1672403 - timpotter - 2015-04-09 10:37:56]

at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:233) ~[solr-solrj-5.1.0.jar:5.1.0 1672403 - timpotter - 2015-04-09 10:37:56]

at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:225) ~[solr-solrj-5.1.0.jar:5.1.0 1672403 - timpotter - 2015-04-09 10:37:56]

at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.request(ConcurrentUpdateSolrClient.java:296) ~[solr-solrj-5.1.0.jar:5.1.0 1672403 - timpotter - 2015-04-09 10:37:56]

at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) ~[solr-solrj-5.1.0.jar:5.1.0 1672403 - timpotter - 2015-04-09 10:37:56]

at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:943) ~[solr-solrj-5.1.0.jar:5.1.0 1672403 - timpotter - 2015-04-09 10:37:56]

at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:958) ~[solr-solrj-5.1.0.jar:5.1.0 1672403 - timpotter - 2015-04-09 10:37:56]

at com.digitalpebble.storm.crawler.solr.persistence.SolrSpout.populateBuffer(SolrSpout.java:172) [classes/:na]

at com.digitalpebble.storm.crawler.solr.persistence.SolrSpout.nextTuple(SolrSpout.java:154) [classes/:na]

at backtype.storm.daemon.executor$fn__3371$fn__3386$fn__3415.invoke(executor.clj:565) [storm-core-0.9.5.jar:0.9.5]

at backtype.storm.util$async_loop$fn__460.invoke(util.clj:463) [storm-core-0.9.5.jar:0.9.5]

at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]

at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]


Have you seen this message before?
Have i missed something?

Regards,
Andy

jorge.dig...@gmail.com

unread,
Aug 21, 2015, 1:36:54 PM8/21/15
to digita...@googlegroups.com
Did you check that Solr is running in http://localhost:8983/solr/status ? A sample query like http://localhost:8983/solr/status/select?q=*:* returns any result? Did you adjust the configuration file for storm-crawler? I think that the artifacts for the 0.6 version are not published just yet. 



From: digita...@googlegroups.com<digita...@googlegroups.com>
To: DigitalPebble<digita...@googlegroups.com>
cc: <jul...@digitalpebble.com>
Sent: Monday, August 17, 2015
Subject: Re: Storm crawler - SOLR

msum...@gmail.com

unread,
Feb 7, 2018, 4:34:14 PM2/7/18
to DigitalPebble

msum...@gmail.com

unread,
Feb 7, 2018, 4:39:56 PM2/7/18
to DigitalPebble
Hi,

I get nullpointerexception when I run this commmand:

root@searchvm:/opt/stormcrawler# mvn clean compile exec:java -Dexec.mainClass=com.digitalpebble.stormcrawler.solr.SolrCrawlTopology -Dexec.args="-conf solr-conf.yaml -conf crawler-conf.yaml -local" > test4.log

fetch-executor[4 4]] INFO  o.a.s.d.executor - Prepared bolt fetch:(4)
15504 [Thread-30-spout-executor[9 9]] ERROR c.d.s.s.p.SolrSpout - Can't query Solr: {}
java.lang.NullPointerException: null
at com.digitalpebble.stormcrawler.solr.persistence.SolrSpout.populateBuffer(SolrSpout.java:201) [storm-crawler-solr-1.7.jar:?]
at com.digitalpebble.stormcrawler.solr.persistence.SolrSpout.nextTuple(SolrSpout.java:171) [storm-crawler-solr-1.7.jar:?]
at org.apache.storm.daemon.executor$fn__4976$fn__4991$fn__5022.invoke(executor.clj:644) [storm-core-1.1.0.jar:1.1.0]
at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:484) [storm-core-1.1.0.jar:1.1.0]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
15525 [Thread-30-spout-executor[9 9]] ERROR c.d.s.s.p.SolrSpout - Can't query Solr: {}
java.lang.NullPointerException: null
at com.digitalpebble.stormcrawler.solr.persistence.SolrSpout.populateBuffer(SolrSpout.java:201) [storm-crawler-solr-1.7.jar:?]
at com.digitalpebble.stormcrawler.solr.persistence.SolrSpout.nextTuple(SolrSpout.java:171) [storm-crawler-solr-1.7.jar:?]
at org.apache.storm.daemon.executor$fn__4976$fn__4991$fn__5022.invoke(executor.clj:644) [storm-core-1.1.0.jar:1.1.0]
at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:484) [storm-core-1.1.0.jar:1.1.0]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
15535 [Thread-30-spout-executor[9 9]] ERROR c.d.s.s.p.SolrSpout - Can't query Solr: {}
java.lang.NullPointerException: null
at com.digitalpebble.stormcrawler.solr.persistence.SolrSpout.populateBuffer(SolrSpout.java:201) [storm-crawler-solr-1.7.jar:?]
at com.digitalpebble.stormcrawler.solr.persistence.SolrSpout.nextTuple(SolrSpout.java)

solr-conf.yaml

# configuration for SOLR resources
  
config:
  solr.indexer.threads: 10
  solr.indexer.queue.size: 10000
  solr.indexer.commit.size: 1
  # Solr spout and persistence bolt
  solr.status.url: "http://localhost:8983/solr/status"
  solr.status.bucket.field: host
  solr.status.bucket.maxsize: 100
  solr.status.metadata.prefix: metadata
  
  # Solr MetricsConsumer
  solr.metrics.url: "http://localhost:8983/solr/metrics"
  # solr.metrics.ttl.field: '__ttl__'
  # solr.metrics.ttl: '1HOUR'

  # For SolrCloud, use this settings instead of solr.indexer.url
  #
  #   solr.indexer.zkhost: "http://localhost:9983/"
  #   solr.indexer.collection: docs
  #
  # the same applies for the spout/persistence bolt and the metricsconsumer

  topology.metrics.consumer.register:
       - class: "com.digitalpebble.stormcrawler.solr.metrics.MetricsConsumer"
         parallelism.hint: 1

THanks


On Saturday, August 15, 2015 at 1:13:23 PM UTC-4, andym...@yahoo.fr wrote:
Reply all
Reply to author
Forward
0 new messages