[DuraSpace JIRA] (FCREPO-1010) Use Apache Tika for extraction

6 views
Skip to first unread message

A. Soroka (Created) (DuraSpace JIRA)

unread,
Oct 12, 2011, 10:23:04 AM10/12/11
to fcrepo-...@googlegroups.com
Use Apache Tika for extraction
------------------------------

Key: FCREPO-1010
URL: https://jira.duraspace.org/browse/FCREPO-1010
Project: Fedora Repository Project
Issue Type: New Feature
Components: GSearch
Reporter: A. Soroka


Apache Tika is a toolkit that can extract text and metadata from a wide variety of mimetyped formats (including PDF, via PDFBox). Employing Tika as an extraction engine in GSearch would immediately expand enormously the possible range of material over which GSearch could operate, and going forward, GSearch would benefit from new parsers and better-performing parsers created as part of that effort.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Chris Wilper (Commented) (DuraSpace JIRA)

unread,
Oct 18, 2011, 11:35:04 AM10/18/11
to fcrepo-...@googlegroups.com

[ https://jira.duraspace.org/browse/FCREPO-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=22848#comment-22848 ]

Chris Wilper commented on FCREPO-1010:
--------------------------------------

Gert, is this feature fair game for GSearch? If so, can you "Open" it?



> Use Apache Tika for extraction
> ------------------------------
>
> Key: FCREPO-1010
> URL: https://jira.duraspace.org/browse/FCREPO-1010
> Project: Fedora Repository Project
> Issue Type: New Feature
> Components: GSearch
> Reporter: A. Soroka

> Labels: gsearch, indexing

Chris Wilper (Assigned) (DuraSpace JIRA)

unread,
Oct 18, 2011, 11:41:04 AM10/18/11
to fcrepo-...@googlegroups.com

[ https://jira.duraspace.org/browse/FCREPO-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Wilper reassigned FCREPO-1010:
------------------------------------

Assignee: Gert Schmeltz Pedersen

Assigning Gert just so he gets on the email list for this issue. Gert, our normal process for FCREPO issues is to review them and get concensus among committers if needed. If the issue is "fair game" for Fedora, we then put it into the Open state so people know that it can be worked on. I figured you'd be the best judge on this issue submitted by Adam.



> Use Apache Tika for extraction
> ------------------------------
>
> Key: FCREPO-1010
> URL: https://jira.duraspace.org/browse/FCREPO-1010
> Project: Fedora Repository Project
> Issue Type: New Feature
> Components: GSearch
> Reporter: A. Soroka

> Assignee: Gert Schmeltz Pedersen
> Labels: gsearch, indexing

A. Soroka (Commented) (DuraSpace JIRA)

unread,
Oct 18, 2011, 12:59:04 PM10/18/11
to fcrepo-...@googlegroups.com

[ https://jira.duraspace.org/browse/FCREPO-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=22853#comment-22853 ]

A. Soroka commented on FCREPO-1010:
-----------------------------------

Some useful info:

Here:

https://tika.apache.org/0.10/parser.html

is the crucial API portion. As you can see at that link, Tika expects a SAX handler to receive information that it retrieves about an input. That seems to me to jibe quite well with the current GSearch architecture and shouldn't be difficult to integrate with any future architecture.

Also, take a look at this format list:

https://tika.apache.org/0.10/formats.html

A real cornucopia!



> Use Apache Tika for extraction
> ------------------------------
>
> Key: FCREPO-1010
> URL: https://jira.duraspace.org/browse/FCREPO-1010
> Project: Fedora Repository Project
> Issue Type: New Feature
> Components: GSearch
> Reporter: A. Soroka

> Assignee: Gert Schmeltz Pedersen
> Labels: gsearch, indexing
>

Gert Schmeltz Pedersen (Updated) (DuraSpace JIRA)

unread,
Oct 19, 2011, 8:34:03 AM10/19/11
to fcrepo-...@googlegroups.com

[ https://jira.duraspace.org/browse/FCREPO-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gert Schmeltz Pedersen updated FCREPO-1010:
-------------------------------------------

Status: Open (was: Received)



> Use Apache Tika for extraction
> ------------------------------
>
> Key: FCREPO-1010
> URL: https://jira.duraspace.org/browse/FCREPO-1010
> Project: Fedora Repository Project

> Issue Type: Story


> Components: GSearch
> Reporter: A. Soroka

> Assignee: Gert Schmeltz Pedersen
> Labels: gsearch, indexing
>

Gert Schmeltz Pedersen (Updated) (DuraSpace JIRA)

unread,
Oct 19, 2011, 8:36:03 AM10/19/11
to fcrepo-...@googlegroups.com

[ https://jira.duraspace.org/browse/FCREPO-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gert Schmeltz Pedersen updated FCREPO-1010:
-------------------------------------------

Fix Version/s: GSearch 2.4



> Use Apache Tika for extraction
> ------------------------------
>
> Key: FCREPO-1010
> URL: https://jira.duraspace.org/browse/FCREPO-1010
> Project: Fedora Repository Project

> Issue Type: Story


> Components: GSearch
> Reporter: A. Soroka

> Assignee: Gert Schmeltz Pedersen
> Labels: gsearch, indexing

> Fix For: GSearch 2.4

Gert Schmeltz Pedersen (Commented) (DuraSpace JIRA)

unread,
Oct 19, 2011, 9:40:03 AM10/19/11
to fcrepo-...@googlegroups.com

[ https://jira.duraspace.org/browse/FCREPO-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=22867#comment-22867 ]

Gert Schmeltz Pedersen commented on FCREPO-1010:
------------------------------------------------

I just published the branch fcrepo-1010 at github to get feedback.
It uses tika 0.10, tika 1.0 is expected in November 2011.
It is a big jar, 24mb, so PermGen space has to be raised,
and it doubles the size of fedoragsearch.war
The branch adds two functions to GenericOperationsImpl:
- getDatastreamFromTika: retrieves the text only
- getDatastreamFromTikaWithMetadata: retrieves metadata also
The branch comes with a test suite in gsearch.test.fgs24_1010,
where the two functions are tested on both Lucene and Solr.
The tests have docx, doc, and pdf datastreams,
but potentially all the Tika formats are available,
since the branch uses AutoDetectParser in Tika.



> Use Apache Tika for extraction
> ------------------------------
>
> Key: FCREPO-1010
> URL: https://jira.duraspace.org/browse/FCREPO-1010
> Project: Fedora Repository Project

> Issue Type: Story


> Components: GSearch
> Reporter: A. Soroka

> Assignee: Gert Schmeltz Pedersen
> Labels: gsearch, indexing
> Fix For: GSearch 2.4
>
>

Reply all
Reply to author
Forward
0 new messages