indexing pdf files with Tibetan Unicode in xtf

10 views
Skip to first unread message

Gerry Wiener

unread,
Nov 21, 2021, 11:25:13 PM11/21/21
to xtf-...@googlegroups.com
Lately, we have been looking into adding pdf files containing Unicode Tibetan to our digital library, http://nitarthadigitallibrary.org/xtf/search. We have noticed that pdf files generated by LibreOffice using "Export Directly as PDF" can be properly indexed by xtf and subsequent searches work properly. If we use Pages to generate the PDF file, the xtf indexer crashes. If we use InDesign to generate the PDF, the xtf indexer is successful but when searching for any of the Tibetan text in the PDF file, the search always fails.

Does anyone know of a remedy to this issue? It would be nice to be able to use the InDesign PDF files in XTF owing to the quality formatting supported by InDesign.

Thank you very much!

-Gerry




dan haig

unread,
May 2, 2022, 10:22:16 PM5/2/22
to xtf-...@googlegroups.com
Hey guys, if anybody's still out here,

I find I need to edit our stopwords list, which is kept in /conf/textIndexer.conf:

        <stopwords list="a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was will with"/>

... and easily changed there, but that alone doesn't do anything except break my textIndexer..

There's also the same list occurring a few times in /WEB-INF/ but I'm not sure which if any are relevant to the function of the indexing.

Anyone remember doing this? I did this once did like 10 years ago but for some reason it's not working like I thought it did.

Thanks,
Dan

Bridger Dyson-Smith

unread,
May 3, 2022, 11:28:21 AM5/3/22
to XTF Users List
Hi Dan -

I have vague memories of tinkering with this but ... it's been a bit. Did you use the <stopwords path="..."/> option previously?
Given a bit of time I can do some digging to see what I may have changed.

Best,
Bridger

--
You received this message because you are subscribed to the Google Groups "XTF Users List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xtf-user+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xtf-user/CAEQVUypu5%3DDoTy-uVqLGEReUVsk6qV%3Dh%2B2YBkvKN7CgGun1XwQ%40mail.gmail.com.

Steven D. Majewski

unread,
May 3, 2022, 11:49:46 AM5/3/22
to xtf-user@googlegroups.com List

I did it a while back to remove “will”  ( It turns out a number of finding aids refer to peoples estates and wills and searches for those were failing ). I don’t recall any problems other than having to reindex everything.  — Steve. 


commit af47e1e69e19993c2cc184d2cbade78ba74bfdbd
Author: Steve Majewski <sd...@virginia.edu>
Date:   Mon Feb 22 18:33:51 2021 -0500

    remote "will" from stopwords

diff --git a/conf/textIndexer.conf b/conf/textIndexer.conf
index 9141ff10..c3d9cbad 100644
--- a/conf/textIndexer.conf
+++ b/conf/textIndexer.conf
@@ -18,7 +18,7 @@
         <!-- End of expert version -->
         <chunk size="200" overlap="20"/>
         <docselector path="./style/textIndexer/VIVAdocSelector.xsl"/>
-        <stopwords list="a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was will with"/>
+        <stopwords list="a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was with"/>
         <pluralmap path="./conf/pluralFolding/pluralMap.txt.gz"/>
         <accentmap path="./conf/accentFolding/accentMap.txt"/>
         <spellcheck createDict="yes"/>




dan haig

unread,
May 3, 2022, 12:13:18 PM5/3/22
to xtf-...@googlegroups.com
Steve, thank you, you have reminded me of a long-forgotten maxim of working with XTF: When having trouble with Search, throw out /index entirely and start fresh. This is what I have done, and voila, it works. And yes, "will" was the offending word for our user as well, for people digging in a trove of philosophy texts the concept of Will is not trivial.

Bridger, thanks to you also; our only instance of <stopwords path="foo"/> was for a test index we never used. I was just running "default" here, as ever.

For posterity here's the error I got before I threw out the existing /index entirely and started fresh. I'd do it more readily but even on our brand spanking new blazing fast servers our stuff still takes 6 hours to index.


  Purging Incomplete Documents From Indexes:

    Index: [/Users/danhaig/IntelexRepo/past_masters/trunk/xtf/apache-tomcat-8.0.30/webapps/xtf-3.1/index/] 

    No Incomplete Documents Found.

  Done.

  

  Indexing New/Updated Documents:

    Index: "default"

*** Error: class java.lang.RuntimeException

java.lang.RuntimeException: Index stop words (a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was will with) doesn't match config (a an and are as at be but by for if in into is it no not of on or s t that the their then there these they this to was with)

at org.cdlib.xtf.textIndexer.XMLTextProcessor.open(XMLTextProcessor.java:591)

at org.cdlib.xtf.textIndexer.SrcTreeProcessor.open(SrcTreeProcessor.java:142)

at org.cdlib.xtf.textIndexer.TextIndexer.doIndexing(TextIndexer.java:474)

at org.cdlib.xtf.textIndexer.TextIndexer.main(TextIndexer.java:339)


Cheers,
Dan



Reply all
Reply to author
Forward
0 new messages