Ian Boston logged work on KERN-754:
-----------------------------------
Author: Ian Boston
Created on: 27-May-2010 06:44
Start Date: 27-May-2010 06:43
Worklog Time Spent: 4 hours
Work Description: Logging work done investigating this, Tika has configuration, jackrabbit configures Tika, but all attempts to override the config have failed. Suggestions on list amounted to "cant be done", leaving 8h of work on this one.
Issue Time Tracking
-------------------
Remaining Estimate: 1 day
Time Spent: 4 hours
> Tika exceptions when starting up nakamura
> -----------------------------------------
>
> Key: KERN-754
> URL: http://jira.sakaiproject.org/browse/KERN-754
> Project: Nakamura
> Issue Type: Bug
> Components: System - other
> Affects Versions: 0.4
> Reporter: Christian Vuerings
> Priority: Minor
> Fix For: 0.6
>
> Time Spent: 4 hours
> Remaining Estimate: 1 day
Simon Gaeremynck reassigned KERN-754:
-------------------------------------
Assignee: Simon Gaeremynck
> Tika exceptions when starting up nakamura
> -----------------------------------------
>
> Key: KERN-754
> URL: http://jira.sakaiproject.org/browse/KERN-754
> Project: Nakamura
> Issue Type: Bug
> Components: System - other
> Affects Versions: 0.4
> Reporter: Christian Vuerings
> Assignee: Simon Gaeremynck
> Priority: Minor
> Fix For: 0.6
>
> Time Spent: 4 hours
> Remaining Estimate: 1 day
>
Simon Gaeremynck commented on KERN-754:
---------------------------------------
The files are:
* myfriends.html
* helloworld.html
* sendmessage.html
* .project files
* helloworld.html
* jcap.js
Doing a file --mime-type on there returns text/html most of the time, so I don't really know why Tika tries to parse it as XML
> Tika exceptions when starting up nakamura
> -----------------------------------------
>
> Key: KERN-754
> URL: http://jira.sakaiproject.org/browse/KERN-754
> Project: Nakamura
> Issue Type: Bug
> Components: System - other
> Affects Versions: 0.4
> Reporter: Christian Vuerings
> Assignee: Simon Gaeremynck
> Priority: Minor
> Fix For: 0.6
>
> Time Spent: 4 hours
> Remaining Estimate: 1 day
>
Ian Boston commented on KERN-754:
---------------------------------
I think a comment I made got lost somewhere.
I asked the JR list
http://markmail.org/thread/6nfty43dcxs6p2bv
The problem is that the HTML is perfect for HTML but is being classified as XML because the settings in tika-config.xml incorrectly think that namespace less html is xml, when its really html.
To fix this we need to find a way of modifying the tika-config.xml and get it loaded in preference to the one that comes with the jackrabbit core jar
> Tika exceptions when starting up nakamura
> -----------------------------------------
>
> Key: KERN-754
> URL: http://jira.sakaiproject.org/browse/KERN-754
> Project: Nakamura
> Issue Type: Bug
> Components: System - other
> Affects Versions: 0.4
> Reporter: Christian Vuerings
> Assignee: Simon Gaeremynck
> Priority: Minor
> Fix For: 0.6
>
> Time Spent: 4 hours
> Remaining Estimate: 1 day
>
Simon Gaeremynck commented on KERN-754:
---------------------------------------
I'm not so sure if the tika-config.xml file is the culprit. That just maps mimetypes to a Parser.
Doing:
java -jar tika-app.0.6.jar -t helloworld.html
gives the same exception.
I think it's the tika-mimetypes.xml file that doesn't suit our HTML fragments patterns.
> Tika exceptions when starting up nakamura
> -----------------------------------------
>
> Key: KERN-754
> URL: http://jira.sakaiproject.org/browse/KERN-754
> Project: Nakamura
> Issue Type: Bug
> Components: System - other
> Affects Versions: 0.4
> Reporter: Christian Vuerings
> Assignee: Simon Gaeremynck
> Priority: Minor
> Fix For: 0.6
>
> Time Spent: 4 hours
> Remaining Estimate: 1 day
>
Ian Boston commented on KERN-754:
---------------------------------
If
java -jar tika-app.0.6.jar -t helloworld.html
gives the same exception (ie TIKA-237: Illegal SAXException )
then the SAX Parser is being used which indicates that the XML parser is also being used rather than the loose HTML parser, which is exactly what is happening with Nakamura.
If you can force Tika to use the HTML parser and you still get an exception (one that does not mention SAX, since IIRC the HTML parser doesnt use SAX) then our fragments are not parseable by the HTML parser, and we need to find an alternative route.
> Tika exceptions when starting up nakamura
> -----------------------------------------
>
> Key: KERN-754
> URL: http://jira.sakaiproject.org/browse/KERN-754
> Project: Nakamura
> Issue Type: Bug
> Components: System - other
> Affects Versions: 0.4
> Reporter: Christian Vuerings
> Assignee: Simon Gaeremynck
> Priority: Minor
> Fix For: 0.6
>
> Time Spent: 4 hours
> Remaining Estimate: 1 day
>
Simon Gaeremynck commented on KERN-754:
---------------------------------------
The HTMLParser doesn't use SAX and doing a HTMLParser.parse(....) with our fragments works just fine.
When using the AutoDetectParser it detects that the file is application/xml
It then tries to determine if the file is HTML or XML .
It does this by looking at the first bytes and checks if there is an HTML, link, head or title, .. tag in there.
The new widget specs states that all the widgets will be full-blown HTML pages and not fragments anymore.
I asked one of our UI guys and he said that they want to move to that spec sooner, rather then later.
Which makes me feel that we shouldn't spend to much time on it?
> Tika exceptions when starting up nakamura
> -----------------------------------------
>
> Key: KERN-754
> URL: http://jira.sakaiproject.org/browse/KERN-754
> Project: Nakamura
> Issue Type: Bug
> Components: System - other
> Affects Versions: 0.4
> Reporter: Christian Vuerings
> Assignee: Simon Gaeremynck
> Priority: Minor
> Fix For: 0.6
>
> Time Spent: 4 hours
> Remaining Estimate: 1 day
>
Ian Boston commented on KERN-754:
---------------------------------
My point exactly,
AutoDetectParser detects that HTML is application/xml for the purposes of parsing which means the file must be valid XML.
Only xhtml is guaranteed to be valid XML, most HTML is not valid XML, and wherever a fragment is stored in the content system, TIKA-237 will be reported, not just widgets, so application/xml should really only be selected where the file is guaranteed to be valid XML.
I agree that making widgets full html pages and forcing them to be fully valid XML will help, but the fundamental problem remains, HTML should use the HTMLParser not the XMLParser. To fix that we need to fix the tika-config.xml to that is correctly identifies html and text/html and not application/xml
> Tika exceptions when starting up nakamura
> -----------------------------------------
>
> Key: KERN-754
> URL: http://jira.sakaiproject.org/browse/KERN-754
> Project: Nakamura
> Issue Type: Bug
> Components: System - other
> Affects Versions: 0.4
> Reporter: Christian Vuerings
> Assignee: Simon Gaeremynck
> Priority: Minor
> Fix For: 0.6
>
> Time Spent: 4 hours
> Remaining Estimate: 1 day
>
Simon Gaeremynck commented on KERN-754:
---------------------------------------
I've comitted a bundle that unpacks Tika and has a custom tika-mimetypes.xml file.
If I do a java -jar tika-app.jar -m on our fragments it correctly parses it as text/html
When I load the content in the server however, it tries to parse it as application/xml.
When I attach a debugger to check it, it correctly uses text/html again.
I'm fairly sure that the jackrabbit-core jar is using tika with our tika-mimetypes file but it looks like there is some threading issue going on.
> Tika exceptions when starting up nakamura
> -----------------------------------------
>
> Key: KERN-754
> URL: http://jira.sakaiproject.org/browse/KERN-754
> Project: Nakamura
> Issue Type: Bug
> Components: System - other
> Affects Versions: 0.4
> Reporter: Christian Vuerings
> Assignee: Simon Gaeremynck
> Priority: Minor
> Fix For: 0.6
>
> Time Spent: 4 hours
> Remaining Estimate: 1 day
>
Ian Boston resolved KERN-754.
-----------------------------
Resolution: Fixed
This is now fixed,
SOme of the files starting with <!-- were categorized as html and then re-categorized as xml.
> Tika exceptions when starting up nakamura
> -----------------------------------------
>
> Key: KERN-754
> URL: http://jira.sakaiproject.org/browse/KERN-754
> Project: Nakamura
> Issue Type: Bug
> Components: System - other
> Affects Versions: 0.4
> Reporter: Christian Vuerings
> Assignee: Simon Gaeremynck
> Priority: Minor
> Fix For: 0.7
>
> Time Spent: 4 hours
> Remaining Estimate: 1 day
>