indexing pdf files with Tibetan Unicode in xtf

Gerry Wiener

unread,

Nov 21, 2021, 11:25:13 PM11/21/21

to xtf-...@googlegroups.com

Lately, we have been looking into adding pdf files containing Unicode Tibetan to our digital library, http://nitarthadigitallibrary.org/xtf/search. We have noticed that pdf files generated by LibreOffice using "Export Directly as PDF" can be properly indexed by xtf and subsequent searches work properly. If we use Pages to generate the PDF file, the xtf indexer crashes. If we use InDesign to generate the PDF, the xtf indexer is successful but when searching for any of the Tibetan text in the PDF file, the search always fails.

Does anyone know of a remedy to this issue? It would be nice to be able to use the InDesign PDF files in XTF owing to the quality formatting supported by InDesign.

Thank you very much!

-Gerry

dan haig

unread,

May 2, 2022, 10:22:16 PM5/2/22

to xtf-...@googlegroups.com

Hey guys, if anybody's still out here,

I find I need to edit our stopwords list, which is kept in /conf/textIndexer.conf:

... and easily changed there, but that alone doesn't do anything except break my textIndexer..

There's also the same list occurring a few times in /WEB-INF/ but I'm not sure which if any are relevant to the function of the indexing.

Anyone remember doing this? I did this once did like 10 years ago but for some reason it's not working like I thought it did.

Thanks,

Dan

Bridger Dyson-Smith

unread,

May 3, 2022, 11:28:21 AM5/3/22

to XTF Users List

Hi Dan -

I have vague memories of tinkering with this but ... it's been a bit. Did you use the <stopwords path="..."/> option previously?

Given a bit of time I can do some digging to see what I may have changed.

Best,

Bridger

--
You received this message because you are subscribed to the Google Groups "XTF Users List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xtf-user+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xtf-user/CAEQVUypu5%3DDoTy-uVqLGEReUVsk6qV%3Dh%2B2YBkvKN7CgGun1XwQ%40mail.gmail.com.

Steven D. Majewski

unread,

May 3, 2022, 11:49:46 AM5/3/22

to xtf-user@googlegroups.com List

I did it a while back to remove “will” ( It turns out a number of finding aids refer to peoples estates and wills and searches for those were failing ). I don’t recall any problems other than having to reindex everything. — Steve.

commit af47e1e69e19993c2cc184d2cbade78ba74bfdbd

Author: Steve Majewski <sd...@virginia.edu>

Date: Mon Feb 22 18:33:51 2021 -0500

remote "will" from stopwords

diff --git a/conf/textIndexer.conf b/conf/textIndexer.conf

index 9141ff10..c3d9cbad 100644

--- a/conf/textIndexer.conf

+++ b/conf/textIndexer.conf

@@ -18,7 +18,7 @@

- <stopwords list="a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was will with"/>

+ <stopwords list="a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was with"/>

dan haig

unread,

May 3, 2022, 12:13:18 PM5/3/22

to xtf-...@googlegroups.com

Steve, thank you, you have reminded me of a long-forgotten maxim of working with XTF: When having trouble with Search, throw out /index entirely and start fresh. This is what I have done, and voila, it works. And yes, "will" was the offending word for our user as well, for people digging in a trove of philosophy texts the concept of Will is not trivial.

Bridger, thanks to you also; our only instance of <stopwords path="foo"/> was for a test index we never used. I was just running "default" here, as ever.

For posterity here's the error I got before I threw out the existing /index entirely and started fresh. I'd do it more readily but even on our brand spanking new blazing fast servers our stuff still takes 6 hours to index.

Purging Incomplete Documents From Indexes:

Index: [/Users/danhaig/IntelexRepo/past_masters/trunk/xtf/apache-tomcat-8.0.30/webapps/xtf-3.1/index/]

No Incomplete Documents Found.

Done.

Indexing New/Updated Documents:

Index: "default"

*** Error: class java.lang.RuntimeException

java.lang.RuntimeException: Index stop words (a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was will with) doesn't match config (a an and are as at be but by for if in into is it no not of on or s t that the their then there these they this to was with)

at org.cdlib.xtf.textIndexer.XMLTextProcessor.open(XMLTextProcessor.java:591)

at org.cdlib.xtf.textIndexer.SrcTreeProcessor.open(SrcTreeProcessor.java:142)

at org.cdlib.xtf.textIndexer.TextIndexer.doIndexing(TextIndexer.java:474)

at org.cdlib.xtf.textIndexer.TextIndexer.main(TextIndexer.java:339)

Cheers,

Dan

To view this discussion on the web visit https://groups.google.com/d/msgid/xtf-user/EC81CED0-06F2-4AAD-850C-5C374FCA25C6%40gmail.com.

Reply all

Reply to author

Forward