I am working with HTML documents and i want to clean the data before processing. Tika can be very helpful in extracting text from HTML and also different tags.I am wondering how i can use boiler pipe with tika and also extract all the tags in to the metadata.
What is the underlying parser if i specify the mimeType ?
--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.
Thank you all. I was able to get Boilerpipe working with Tika component in Behemoth by modifying the TikaProcessor class. It is just 3-4 lines of changing the content handler.
I guess the Tika annotation's won't work if we use Boilerpipe to process the text. I wouldn't need this for my usecase right now though.
Julien, I have noticed previously that you have commented on a blog post that the annotations can be collected and then it can be sent to Boilerpipe library.
Would that be double processing in our case ?
Do you think it is possible to do this in the TikaProcessor class in Behemoth ?
--
You received this message because you are subscribed to a topic in the Google Groups "DigitalPebble" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/digitalpebble/8SixJg620nE/unsubscribe?hl=en-GB.
To unsubscribe from this group and all of its topics, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.