Tika processing - boiler pipe library

kiran

unread,

Feb 1, 2013, 12:20:05 PM2/1/13

to digita...@googlegroups.com

Hi,

I am working with HTML documents and i want to clean the data before processing. Tika can be very helpful in extracting text from HTML and also different tags.

I am wondering how i can use boiler pipe with tika and also extract all the tags in to the metadata.

What is the underlying parser if i specify the mimeType ?

Thank you,

Kiran

DigitalPebble

unread,

Feb 1, 2013, 12:51:31 PM2/1/13

to digita...@googlegroups.com

Hi Kiran,

I am working with HTML documents and i want to clean the data before processing. Tika can be very helpful in extracting text from HTML and also different tags.

I am wondering how i can use boiler pipe with tika and also extract all the tags in to the metadata.

It's more a question for the Tika mailing list but assuming you want to use it with Behemoth you would need to modify the code in the tika module so that it uses the BoilerpipeContentHandler. I don't think it would affect the extraction of the metadata but that could easily be checked.

What is the underlying parser if i specify the mimeType ?

Not sure I understand your question. The mapping between the mimeType and the parser is done by Tika itself with each parser implementation listing the mime types it supports

HTH

Julien

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com
http://www.digitalpebble.com

kiran

unread,

Feb 1, 2013, 1:09:43 PM2/1/13

to digita...@googlegroups.com, jul...@digitalpebble.com

Thank you Julien!

I will play around with tika parsers and code to check what is more suitable for my needs.

Is it possible to test Behemoth in standalone mode ?

Patricia Gorla

unread,

Feb 1, 2013, 1:35:55 PM2/1/13

to digita...@googlegroups.com

Hi Kiran,

Tika uses the AutoDetectParser, which will default to whatever mimetype it detects, unless you specify your own.

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.

--

Patricia Gorla

202 713 9162

This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

kiran

unread,

Mar 21, 2013, 2:54:44 AM3/21/13

to digita...@googlegroups.com

Thank you all. I was able to get Boilerpipe working with Tika component in Behemoth by modifying the TikaProcessor class. It is just 3-4 lines of changing the content handler.

I guess the Tika annotation's won't work if we use Boilerpipe to process the text. I wouldn't need this for my usecase right now though.

Julien, I have noticed previously that you have commented on a blog post that the annotations can be collected and then it can be sent to Boilerpipe library.

Would that be double processing in our case ? Do you think it is possible to do this in the TikaProcessor class in Behemoth ?

Thanks,

Kiran.

DigitalPebble

unread,

Mar 21, 2013, 4:37:50 AM3/21/13

to digita...@googlegroups.com

Hi Kiran

Thank you all. I was able to get Boilerpipe working with Tika component in Behemoth by modifying the TikaProcessor class. It is just 3-4 lines of changing the content handler.

you can also implement your own version of the TikaProcessor and specify it with the -t argument, but it is pretty much the same thing

I guess the Tika annotation's won't work if we use Boilerpipe to process the text. I wouldn't need this for my usecase right now though.

unless you use Tika's TeeContentHandler somehow to for the parsing and have BoilerPipe on the one hand and the 'normal' parsing on the other. Just a thought, am not quite sure how this would be done in practice

Julien, I have noticed previously that you have commented on a blog post that the annotations can be collected and then it can be sent to Boilerpipe library.

Did I? Can't remember that one but I trust you ;-) What I probably meant is that Boilerpipe did his own parsing and instead could have received the Tika annotations from the underlying parser

Would that be double processing in our case ?

The double processing is in the way Boilerpipe worked or at least that's the way it was. It might have changed since.

Do you think it is possible to do this in the TikaProcessor class in Behemoth ?

Well nothing prevents you from having a bespoke TikaProcessor and call the Tika parsers twice with and without Boilerpipe or better use the TeeContentHandler so that you call Tika only once

Julien

--

kiran

unread,

Mar 21, 2013, 1:41:34 PM3/21/13

to digita...@googlegroups.com, jul...@digitalpebble.com

Thank you Julien for your helpful comments. I have noticed that TikaGUI also used TeeContentHandler for the Tika app.

How are behemoth dependencies managed ? I am guessing behemoth is using 1.3 version of Tika. I want to update BoilerPipe with the latest trunk, rather than their last version ( almost 2 years old).

Is this something I have to talk in Tika community on how to update Tika with the latest trunk of BoilerPipe or else Can I update in behemoth the latest version of BoilerPipe ?

Please let me know your suggestions.

Thanks,

Kiran.

DigitalPebble

unread,

Mar 22, 2013, 4:50:37 AM3/22/13

to digita...@googlegroups.com

I suppose BoilerPipe will get updated in Tika at some point (I should probably take care of this as I am a Tika committer) but you won't be able to use that until a new version of Tika is published.

the dependencies in Behemoth are managed with Maven. You could add the latest available version of BP to the dependencies of Behemoth-Tika but it looks like the latest version on Maven is http://mvnrepository.com/artifact/de.l3s.boilerpipe/boilerpipe/1.1.0 which is the same as the one used by Tika.

Alternatively add the jar manually and specify it with 'mvn install:install-file ...' before building the tika module

Julien

kiran chitturi

unread,

Mar 22, 2013, 11:12:16 AM3/22/13

to digita...@googlegroups.com

Thank you Julien. I will add BoilerPipe latest trunk manually. Their last version is released in 2011 (1.2.0).

--
You received this message because you are subscribed to a topic in the Google Groups "DigitalPebble" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/digitalpebble/8SixJg620nE/unsubscribe?hl=en-GB.
To unsubscribe from this group and all of its topics, send an email to digitalpebbl...@googlegroups.com.

To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.

--

Kiran Chitturi

kiran chitturi

unread,

Mar 22, 2013, 11:29:22 AM3/22/13

to digita...@googlegroups.com

BoilerPipe has a local maven repository at http://boilerpipe.googlecode.com/svn/repo/ and they have a 'boilerpipe.pom' file [1] which I renamed to pom.xml and 'mvn install' worked :)

[1] - http://boilerpipe.googlecode.com/svn/repo/de/l3s/boilerpipe/boilerpipe/1.2.0/

--

Kiran Chitturi

Reply all

Reply to author

Forward