Issues using annif-tutorial steps to load OCLC FAST vocabulary

81 views
Skip to first unread message

MSH

unread,
Dec 12, 2022, 6:11:55 AM12/12/22
to Annif Users
Hi :^)

I am writing from the National Library of Scotland, where we have been doing some initial experimentation with ANNIF. We have been trying to load the OCLC FAST vocabulary to generate headings for some sets of descriptive text, and have been following the annif-tutorial exercises within the Docker-based install of ANNIF. As well as this, we have had great help from some of the previous FAST (& LCSH) discussions on this forum, & some users at other institutions who've already ironed out some of the formatting issues of the un-skosified .nt FAST files on the OCLC site.

So far, my colleague Agnieszka Kurzeja & I have followed the path suggested in the tutorial, based on loading sample economic data (stw-econbiz-small.tsv.gz, stw-skos.ttl), with a brief specification in the projects.cfg file. I was hoping to switch out the paths and load our edited FAST vocabulary and text corpus using the same method, but we are both receiving an error message suggesting the files can't be found:
  • Error: Invalid value for 'SUBJECTFILE': File 'FASTAll_fixed.nt.gz' does not exist.
I have tried loading the vocab file as a zipped or unzipped file, & have tried loading our corpus as a compressed or uncompressed .tsv file:

f2.png
f1.png

I've made edits to our configuration file to describe the data but wondered whether I'm forgetting about a specification in the tutorial setup, & whether ANNIF is only looking for tutorial data (otherwise I am maybe making a clumsy syntax error!, apologies but I am not very familiar with Docker/Bash).

My apologies for a very general beginners question - it has been great to read more about Parthasarathi's recent experiments & we are really excited to play with ANNIF and see how it copes with our text corpus. I think we're almost there but any pointers about getting set up would be much appreciated. No worries if no obvious answers and I will bump this thread as we make progress :^)

All the best,
Mark Simon Haydn
National Library of Scotland

Lehtinen, Mona

unread,
Dec 12, 2022, 8:04:43 AM12/12/22
to MSH, Annif Users

Hi Mark!

Some first questions: so did you manage to do the tutorial exercises with Docker? Are you on Windows and do the named volumes approach?

Here’s some more info on using Annif w/ Docker, maybe following it will help https://github.com/NatLibFi/Annif/wiki/Usage-with-Docker (note esp. the part talking about file permission errors as this might cause not being able to read the file).

 

Best (& hope this helps),

Mona L.

 

 

Lähettäjä: annif...@googlegroups.com <annif...@googlegroups.com> Puolesta MSH
Lähetetty: maanantai 12. joulukuuta 2022 13.12
Vastaanottaja: Annif Users <annif...@googlegroups.com>
Aihe: Issues using annif-tutorial steps to load OCLC FAST vocabulary

 

Hi :^)

 

I am writing from the National Library of Scotland, where we have been doing some initial experimentation with ANNIF. We have been trying to load the OCLC FAST vocabulary to generate headings for some sets of descriptive text, and have been following the annif-tutorial exercises within the Docker-based install of ANNIF. As well as this, we have had great help from some of the previous FAST (& LCSH) discussions on this forum, & some users at other institutions who've already ironed out some of the formatting issues of the un-skosified .nt FAST files on the OCLC site.

 

So far, my colleague Agnieszka Kurzeja & I have followed the path suggested in the tutorial, based on loading sample economic data (stw-econbiz-small.tsv.gz, stw-skos.ttl), with a brief specification in the projects.cfg file. I was hoping to switch out the paths and load our edited FAST vocabulary and text corpus using the same method, but we are both receiving an error message suggesting the files can't be found:

  • Error: Invalid value for 'SUBJECTFILE': File 'FASTAll_fixed.nt.gz' does not exist.

I have tried loading the vocab file as a zipped or unzipped file, & have tried loading our corpus as a compressed or uncompressed .tsv file:

 

 

I've made edits to our configuration file to describe the data but wondered whether I'm forgetting about a specification in the tutorial setup, & whether ANNIF is only looking for tutorial data (otherwise I am maybe making a clumsy syntax error!, apologies but I am not very familiar with Docker/Bash).

 

My apologies for a very general beginners question - it has been great to read more about Parthasarathi's recent experiments & we are really excited to play with ANNIF and see how it copes with our text corpus. I think we're almost there but any pointers about getting set up would be much appreciated. No worries if no obvious answers and I will bump this thread as we make progress :^)

 

All the best,
Mark Simon Haydn
National Library of Scotland

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
annif-users...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/annif-users/23c4b696-93b1-43d0-a2d0-62abfc0baa66n%40googlegroups.com.

MSH

unread,
Dec 14, 2022, 6:19:27 AM12/14/22
to Annif Users
Hi Mona !, thank you for getting back to us. Just dropping a quick placeholder message here while we try out using named volumes instead - we are using Windows so I think it's possible we will hit a wall when we try to make the user IDs the same in the container & host. It makes sense that it's a permissions error stopping ANNIF from finding our vocab files.

Following the Docker instructions now & will let you know if we make any progress!, thank you for thinking this through for us.

Mark

MSH

unread,
Jan 20, 2023, 9:51:09 AM1/20/23
to Annif Users
Hi Mona, & to anyone else following along: just posting a quick update on the above, still extremely glad to have found the ANNIF community & get occasional updates on similar projects. We've made some progress with the tutorial exercises but are still running into the same (permissions?) errors quoted above, with our vocab files inaccessible, & it does seem like there are some basic issues meaning we would need to name our volumes or change the workflow so that we're not referring to locations we cannot access - this has not worked so far 🤷.

Before trying more troubleshooting I wanted to ask a couple of very general questions to rule things out before taking another approach:

1. When repeating commands that are successful in the training exercises but changing file paths to point to our own vocab files, contained in the same directories, we have received the same invalid value for [PATHS] error shown above; the config file doesn't seem to go into great detail about what filetypes ANNIF can work with, and I wanted to check that our file should be acceptable: it's an N-Triples file that has been lightly edited (but could probably be further Skosified if it might help), & we've tried loading it as an .nt file or as a compressed .gz file.

2. The other question was very general, about the best setup for us to use as Librarians using ANNIF on managed Windows 10 PCs, having initially landed on running ANNIF via Docker as the easiest solution. We have Docker & ANNIF running OK but I wonder if setting up a Virtual Machine for a Linux instance would be more productive? Apologies if this is not an easy question to answer, I'm just struggling with the directory architecture problems we're seeing in Docker & am not 100% whether it's the file/our setup/the commands we're using/etc.

Thank you for your help & patience with us on this!, & apologies for some beginner's questions. I am hoping to try a fresh install in Linux soon & will keep the thread updated if we make any progress. All the best from the UK,

Mark 

Jim Hahn

unread,
Jan 23, 2023, 9:16:18 AM1/23/23
to Annif Users
Hi Mark,

For your first question -- the FAST ontology needs a couple tweaks in order to be used in Annif. There was a prior thread on that topic in this group if using the RDF of FAST.

However, you could also train on a TSV file of FAST. Here is a public link to curated TSV FAST data: https://upenn.box.com/s/wmu8cdzltmr148wet56jx8qe81f3kb6y of course there are limitations -- no language tags here and in the case that alogorithms would be looking for skos elements -- would not be able to make use of any features that traverse skos elements -- maybe there are other limits, too...

If you wanted to inspect a working FAST Annif here: https://hub.docker.com/r/jimfhahn/annif059 it is the jimfhahn/annif059:fast image...

For question number 2 -- I've found docker desktop the most reliable prototyping environment. Though a locally run Linux environment might give you a little more ease of use for getting at files directly on the system rather than a container if you are new to docker. Depends on what you are most comfortable using...

Though once you have some experience with Docker containers they really are a good option.

Best wishes,

Jim

-Jim

MSH

unread,
Feb 3, 2023, 1:52:10 PM2/3/23
to Annif Users
Thank you for this Jim!, it has been fantastic looking at your work, especially experimenting with the annif.info demo; we have been trying to get ANNIF working with FAST for a few months now & it's great to skip to the end & see how it handles different blocks of text. We are working on a project to assign some of FAST's health & medicine subject headings to either website text or uncontrolled descriptive metadata tags: I feel like I'm slowly getting the hang of what produces the best results, great to have the opportunity to try it out using the demo, & I would at some point love to progress to using some more community-generated vocabs to see how they parse the material too. 

Thank you, also, for sending over the links to the FAST files; we have a similar file we've been working with, lightly modifying the OCLC NT FAST download, but will try this too. We are still struggling through our Docker setup but I have set aside some time to try with Linux next week: neither is in my comfort zone so lots of slow trial & error!, very helpful having examples to follow from the group. We will inspect the working ANNIF build you linked to - I had hoped it might be possible to check we weren't doing anything wrong with the projects config file we're using, it would be great to rule this out as one of the causes of the problem we have loading files.

A couple of general updates for anyone following along - we are getting variations the following error, invalid start byte or invalid continuation byte, while trying to load our subject vocab file, which maybe just means we need to reformat our TSV file (if there are any obvious formatting standards not covered on the help page we are open to advice):
  •   File "/Annif/annif/corpus/document.py", line 68, in documents
  •     for line in tsvfile:
  •   File "/usr/local/lib/python3.8/codecs.py", line 322, in decode
  •     (result, consumed) = self._buffer_decode(data, self.errors, final)
  •   File "/usr/local/lib/python3.8/encodings/utf_8_sig.py", line 69, in _buffer_decode
  •     return codecs.utf_8_decode(input, errors, final)
  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 23: invalid continuation byte
  • root@68f7cd394e87:/Annif-tutorial# annif list-projects 
Still a few things to try (VirtualBox rather than Docker), will post here with any updates. TY again for support so far & great to get a look at interesting projects elsewhere.

Mark

Osma Suominen

unread,
Feb 6, 2023, 10:47:15 AM2/6/23
to annif...@googlegroups.com
Hi Mark,

the error you're getting seems to indicate a problem with UTF-8 encoding:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 23:
invalid continuation byte

Make sure that the file you're trying to load is valid UTF-8. If it's
some other encoding (e.g. ISO 8559-1 or Windows-1252), convert it to
UTF-8 first. Annif expects that all .txt and .tsv files are UTF-8
encoded, as explained on the wiki pages on corpus formats.

-Osma

MSH kirjoitti 3.2.2023 klo 20.52:
> Thank you for this Jim!, it has been fantastic looking at your work,
> especially experimenting with the annif.info demo; we have been trying
> to get ANNIF working with FAST for a few months now & it's great to skip
> to the end & see how it handles different blocks of text. We are working
> on a project to assign some of FAST's health & medicine subject headings
> to either website text or uncontrolled descriptive metadata tags: I feel
> like I'm slowly getting the hang of what produces the best results,
> great to have the opportunity to try it out using the demo, & I would at
> some point love to progress to using some more community-generated
> vocabs to see how they parse the material too.
>
> Thank you, also, for sending over the links to the FAST files; we have a
> similar file we've been working with, lightly modifying the OCLC NT FAST
> download, but will try this too. We are still struggling through our
> Docker setup but I have set aside some time to try with Linux next week:
> neither is in my comfort zone so lots of slow trial & error!, very
> helpful having examples to follow from the group. We will inspect the
> working ANNIF build you linked to - I had hoped it might be possible to
> check we weren't doing anything wrong with the projects config file
> we're using, it would be great to rule this out as one of the causes of
> the problem we have loading files.
>
> A couple of general updates for anyone following along - we are getting
> variations the following error, /invalid start byte/ or /invalid
> continuation byte/, while trying to load our subject vocab file, which
> maybe just means we need to reformat our TSV file (if there are any
> obvious formatting standards not covered on the help page
> <https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats> we are
> open to advice):
>
> *   File "/Annif/annif/corpus/document.py", line 68, in documents
> *     for line in tsvfile:
> *   File "/usr/local/lib/python3.8/codecs.py", line 322, in decode
> *     (result, consumed) = self._buffer_decode(data, self.errors, final)
> *   File "/usr/local/lib/python3.8/encodings/utf_8_sig.py", line 69,
> in _buffer_decode
> *     return codecs.utf_8_decode(input, errors, final)
> * UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position
> 23: invalid continuation byte
> * root@68f7cd394e87:/Annif-tutorial# annif list-projects
> same /invalid value for [PATHS] /error shown above; the config
> file doesn't seem to go into great detail about what filetypes
> ANNIF can work with, and I wanted to check that our file /should
> /be acceptable: it's an N-Triples file that has been lightly
> edited (but could probably be further Skosified if it might
> help), & we've tried loading it as an .nt file or as a
> compressed .gz file.
>
> 2. The other question was very general, about the best setup for
> us to use as Librarians using ANNIF on managed Windows 10 PCs,
> having initially landed on running ANNIF via Docker as the
> easiest solution. We have Docker & ANNIF running OK but I wonder
> if setting up a Virtual Machine for a Linux instance would be
> more productive? Apologies if this is not an easy question to
> answer, I'm just struggling with the directory architecture
> problems we're seeing in Docker & am not 100% whether it's the
> file/our setup/the commands we're using/etc.
>
> Thank you for your help & patience with us on this!, & apologies
> for some beginner's questions. I am hoping to try a fresh
> install in Linux soon & will keep the thread updated if we make
> any progress. All the best from the UK,
>
> Mark
> On Wednesday, December 14, 2022 at 11:19:27 AM UTC MSH wrote:
>
> Hi Mona !, thank you for getting back to us. Just dropping a
> quick placeholder message here while we try out using named
> volumes instead - we /are/ using Windows so I think it's
> possible we will hit a wall when we try to make the user IDs
> the same in the container & host. It makes sense that it's a
> permissions error stopping ANNIF from finding our vocab files.
>
> Following the Docker instructions now & will let you know if
> we make any progress!, thank you for thinking this through
> for us.
>
> Mark
>
> On Monday, December 12, 2022 at 1:04:43 PM UTC
> mona.l...@helsinki.fi wrote:
>
> Hi Mark!____
>
> Some first questions: so did you manage to do the
> tutorial exercises with Docker? Are you on Windows and
> do the named volumes approach?____
>
> Here’s some more info on using Annif w/ Docker, maybe
> following it will help
> https://github.com/NatLibFi/Annif/wiki/Usage-with-Docker
> <https://github.com/NatLibFi/Annif/wiki/Usage-with-Docker> (note esp. the part talking about file permission errors as this might cause not being able to read the file). ____
>
> __ __
>
> Best (& hope this helps),____
>
> Mona L.____
>
> __ __
>
> __ __
>
> *Lähettäjä:* annif...@googlegroups.com
> <annif...@googlegroups.com> *Puolesta *MSH
> *Lähetetty:* maanantai 12. joulukuuta 2022 13.12
> *Vastaanottaja:* Annif Users <annif...@googlegroups.com>
> *Aihe:* Issues using annif-tutorial steps to load OCLC
> FAST vocabulary____
>
> __ __
>
> Hi :^)____
>
> __ __
>
> I am writing from the National Library of Scotland,
> where we have been doing some initial experimentation
> with ANNIF. We have been trying to load the OCLC FAST
> vocabulary to generate headings for some sets of
> descriptive text, and have been following the
> annif-tutorial exercises
> <https://github.com/NatLibFi/Annif-tutorial>within the
> Docker-based install of ANNIF. As well as this, we have
> had great help from some
> <https://groups.google.com/g/annif-users/c/lofDo9_byIg/m/5B8D6iZ8AwAJ> of <https://groups.google.com/g/annif-users/c/anlZYDUZXJQ/m/RE7ktCU8AwAJ> the <https://groups.google.com/g/annif-users/c/vGVUP90GUyQ/m/5XIseddbBwAJ> previous <https://groups.google.com/g/annif-users/c/8XAT03f4j_o/m/gsYX1pr8CAAJ>FAST (& LCSH) discussions on this forum, & some users at other institutions who've already ironed out some of the formatting issues of the un-skosified .nt FAST files on the OCLC site.____
>
> __ __
>
> So far, my colleague Agnieszka Kurzeja & I have followed
> the path suggested in the tutorial, based on loading
> sample economic data (stw-econbiz-small.tsv.gz,
> stw-skos.ttl), with a brief specification in the
> projects.cfg file. I was hoping to switch out the paths
> and load our edited FAST vocabulary and text corpus
> using the same method, but we are both receiving an
> error message suggesting the files can't be found:____
>
> * Error: Invalid value for 'SUBJECTFILE': File
> 'FASTAll_fixed.nt.gz' does not exist.____
>
> I have tried loading the vocab file as a zipped or
> unzipped file, & have tried loading our corpus as a
> compressed or uncompressed .tsv file:____
>
> __ __
>
> ____
>
> ____
>
> __ __
>
> I've made edits to our configuration file to describe
> the data but wondered whether I'm forgetting about a
> specification in the tutorial setup, & whether ANNIF is
> only looking for tutorial data (otherwise I am maybe
> making a clumsy syntax error!, apologies but I am not
> very familiar with Docker/Bash).____
>
> __ __
>
> My apologies for a very general beginners question - it
> has been great to read more about Parthasarathi's recent
> experiments & we are really excited to play with ANNIF
> and see how it copes with our text corpus. I think we're
> almost there but any pointers about getting set up would
> be much appreciated. No worries if no obvious answers
> and I will bump this thread as we make progress :^)____
>
> __ __
>
> All the best,
> Mark Simon Haydn
> National Library of Scotland____
>
> --
> You received this message because you are subscribed to
> the Google Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to annif-users...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/23c4b696-93b1-43d0-a2d0-62abfc0baa66n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/23c4b696-93b1-43d0-a2d0-62abfc0baa66n%40googlegroups.com?utm_medium=email&utm_source=footer>.____
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/ffa272d6-5ba2-4bf3-901a-6cd0af9cc368n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/ffa272d6-5ba2-4bf3-901a-6cd0af9cc368n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi
Reply all
Reply to author
Forward
0 new messages