StormCrawler StdOutIndexer shows content with a count but no content - text- is visible

19 views
Skip to first unread message

jbri...@gmail.com

unread,
Sep 5, 2019, 2:20:15 PM9/5/19
to DigitalPebble
Hi,

I am new to StormCrawler and have successfully run the "Getting Started" tutorial. I am using StormCrawler 1.14 and Storm 1.2.2. The output of the crawl looks like this:

content 2382 chars
domain  apache.org
title   Apache Storm
http://storm.apache.org/releases/2.0.0/index.html       DISCOVERED      Thu Sep 05 00:29:42 UTC 2019
        url.path: http://storm.apache.org/
        depth: 1

http://storm.apache.org/about/integrates.html   DISCOVERED      Thu Sep 05 00:29:42 UTC 2019
        url.path: http://storm.apache.org/
        depth: 1
.....


Having read this blog: http://digitalpebble.blogspot.com/2017/04/crawl-dynamic-content-with-selenium-and.html, I am guessing that this is a successful run and the StdOutIndexer is performing as specified. My question, as someone new, is: how do I get to see the actual text and not, what appears to be, a count of the characters in the content? Eventually, I will need to parse both text and images but - to start with - will need to see the text. Does this capability come out of the box or is there some other bolt or filter that needs to be applied?

Thanks,

John

DigitalPebble

unread,
Sep 6, 2019, 2:43:45 AM9/6/19
to DigitalPebble
Hi John,

Thanks for getting in touch. Best to ask questions like this on StackOverflow, where you'll get a larger audience.

It's up to you to decide what you want to do with the content: StdOutIndexer is mainly for demo / debugging purposes, a real application would send the content to an indexer like Elasticsearch or a database. If you haven't done so yet, have a look at the Elasticsearch tutorial on https://www.youtube.com/watch?v=KTerugU12TY&t=973s

Hope this helps

Julien

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/digitalpebble/714fd8ef-ac9c-47f2-b61a-e2abc12e672b%40googlegroups.com.


--

jbri...@gmail.com

unread,
Sep 6, 2019, 4:04:07 PM9/6/19
to DigitalPebble
Hi Julien,

Thanks for the response. Next time I will post questions like this on StackOverflow. I suspected I might have to add Elasticsearch into the deployment in order to surface the text, but I wasn't sure. Now it is more clear.

Thanks again,

John
To unsubscribe from this group and stop receiving emails from it, send an email to digita...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages