Hi Mark,
the error you're getting seems to indicate a problem with UTF-8 encoding:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 23:
invalid continuation byte
Make sure that the file you're trying to load is valid UTF-8. If it's
some other encoding (e.g. ISO 8559-1 or Windows-1252), convert it to
UTF-8 first. Annif expects that all .txt and .tsv files are UTF-8
encoded, as explained on the wiki pages on corpus formats.
-Osma
MSH kirjoitti 3.2.2023 klo 20.52:
> Thank you for this Jim!, it has been fantastic looking at your work,
> especially experimenting with the
annif.info demo; we have been trying
> to get ANNIF working with FAST for a few months now & it's great to skip
> to the end & see how it handles different blocks of text. We are working
> on a project to assign some of FAST's health & medicine subject headings
> to either website text or uncontrolled descriptive metadata tags: I feel
> like I'm slowly getting the hang of what produces the best results,
> great to have the opportunity to try it out using the demo, & I would at
> some point love to progress to using some more community-generated
> vocabs to see how they parse the material too.
>
> Thank you, also, for sending over the links to the FAST files; we have a
> similar file we've been working with, lightly modifying the OCLC NT FAST
> download, but will try this too. We are still struggling through our
> Docker setup but I have set aside some time to try with Linux next week:
> neither is in my comfort zone so lots of slow trial & error!, very
> helpful having examples to follow from the group. We will inspect the
> working ANNIF build you linked to - I had hoped it might be possible to
> check we weren't doing anything wrong with the projects config file
> we're using, it would be great to rule this out as one of the causes of
> the problem we have loading files.
>
> A couple of general updates for anyone following along - we are getting
> variations the following error, /invalid start byte/ or /invalid
> continuation byte/, while trying to load our subject vocab file, which
> maybe just means we need to reformat our TSV file (if there are any
> obvious formatting standards not covered on the help page
> <
https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats> we are
> open to advice):
>
> * File "/Annif/annif/corpus/document.py", line 68, in documents
> * for line in tsvfile:
> * File "/usr/local/lib/python3.8/codecs.py", line 322, in decode
> * (result, consumed) = self._buffer_decode(data, self.errors, final)
> * File "/usr/local/lib/python3.8/encodings/utf_8_sig.py", line 69,
> in _buffer_decode
> * return codecs.utf_8_decode(input, errors, final)
> * UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position
> 23: invalid continuation byte
> * root@68f7cd394e87:/Annif-tutorial# annif list-projects
> same /invalid value for [PATHS] /error shown above; the config
> file doesn't seem to go into great detail about what filetypes
> ANNIF can work with, and I wanted to check that our file /should
> /be acceptable: it's an N-Triples file that has been lightly
> edited (but could probably be further Skosified if it might
> help), & we've tried loading it as an .nt file or as a
> compressed .gz file.
>
> 2. The other question was very general, about the best setup for
> us to use as Librarians using ANNIF on managed Windows 10 PCs,
> having initially landed on running ANNIF via Docker as the
> easiest solution. We have Docker & ANNIF running OK but I wonder
> if setting up a Virtual Machine for a Linux instance would be
> more productive? Apologies if this is not an easy question to
> answer, I'm just struggling with the directory architecture
> problems we're seeing in Docker & am not 100% whether it's the
> file/our setup/the commands we're using/etc.
>
> Thank you for your help & patience with us on this!, & apologies
> for some beginner's questions. I am hoping to try a fresh
> install in Linux soon & will keep the thread updated if we make
> any progress. All the best from the UK,
>
> Mark
> On Wednesday, December 14, 2022 at 11:19:27 AM UTC MSH wrote:
>
> Hi Mona !, thank you for getting back to us. Just dropping a
> quick placeholder message here while we try out using named
> volumes instead - we /are/ using Windows so I think it's
> possible we will hit a wall when we try to make the user IDs
> the same in the container & host. It makes sense that it's a
> permissions error stopping ANNIF from finding our vocab files.
>
> Following the Docker instructions now & will let you know if
> we make any progress!, thank you for thinking this through
> for us.
>
> Mark
>
> On Monday, December 12, 2022 at 1:04:43 PM UTC
>
mona.l...@helsinki.fi wrote:
>
> Hi Mark!____
>
> Some first questions: so did you manage to do the
> tutorial exercises with Docker? Are you on Windows and
> do the named volumes approach?____
> <
https://github.com/NatLibFi/Annif/wiki/Usage-with-Docker> (note esp. the part talking about file permission errors as this might cause not being able to read the file). ____
>
> __ __
>
> Best (& hope this helps),____
>
> Mona L.____
>
> __ __
>
> __ __
>
> *Lähettäjä:*
annif...@googlegroups.com
> <
annif...@googlegroups.com> *Puolesta *MSH
> *Lähetetty:* maanantai 12. joulukuuta 2022 13.12
> *Vastaanottaja:* Annif Users <
annif...@googlegroups.com>
> *Aihe:* Issues using annif-tutorial steps to load OCLC
> FAST vocabulary____
>
> __ __
>
> Hi :^)____
>
> __ __
>
> I am writing from the National Library of Scotland,
> where we have been doing some initial experimentation
> with ANNIF. We have been trying to load the OCLC FAST
> vocabulary to generate headings for some sets of
> descriptive text, and have been following the
> annif-tutorial exercises
> <
https://github.com/NatLibFi/Annif-tutorial>within the
> Docker-based install of ANNIF. As well as this, we have
> had great help from some
> <
https://groups.google.com/g/annif-users/c/lofDo9_byIg/m/5B8D6iZ8AwAJ> of <
https://groups.google.com/g/annif-users/c/anlZYDUZXJQ/m/RE7ktCU8AwAJ> the <
https://groups.google.com/g/annif-users/c/vGVUP90GUyQ/m/5XIseddbBwAJ> previous <
https://groups.google.com/g/annif-users/c/8XAT03f4j_o/m/gsYX1pr8CAAJ>FAST (& LCSH) discussions on this forum, & some users at other institutions who've already ironed out some of the formatting issues of the un-skosified .nt FAST files on the OCLC site.____
>
> __ __
>
> So far, my colleague Agnieszka Kurzeja & I have followed
> the path suggested in the tutorial, based on loading
> sample economic data (stw-econbiz-small.tsv.gz,
> stw-skos.ttl), with a brief specification in the
> projects.cfg file. I was hoping to switch out the paths
> and load our edited FAST vocabulary and text corpus
> using the same method, but we are both receiving an
> error message suggesting the files can't be found:____
>
> * Error: Invalid value for 'SUBJECTFILE': File
> 'FASTAll_fixed.nt.gz' does not exist.____
>
> I have tried loading the vocab file as a zipped or
> unzipped file, & have tried loading our corpus as a
> compressed or uncompressed .tsv file:____
>
> __ __
>
> ____
>
> ____
>
> __ __
>
> I've made edits to our configuration file to describe
> the data but wondered whether I'm forgetting about a
> specification in the tutorial setup, & whether ANNIF is
> only looking for tutorial data (otherwise I am maybe
> making a clumsy syntax error!, apologies but I am not
> very familiar with Docker/Bash).____
>
> __ __
>
> My apologies for a very general beginners question - it
> has been great to read more about Parthasarathi's recent
> experiments & we are really excited to play with ANNIF
> and see how it copes with our text corpus. I think we're
> almost there but any pointers about getting set up would
> be much appreciated. No worries if no obvious answers
> and I will bump this thread as we make progress :^)____
>
> __ __
>
> All the best,
> Mark Simon Haydn
> National Library of Scotland____
>
> --
> You received this message because you are subscribed to
> the Google Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to
annif-users...@googlegroups.com.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/annif-users/23c4b696-93b1-43d0-a2d0-62abfc0baa66n%40googlegroups.com <
https://groups.google.com/d/msgid/annif-users/23c4b696-93b1-43d0-a2d0-62abfc0baa66n%40googlegroups.com?utm_medium=email&utm_source=footer>.____
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
annif-users...@googlegroups.com
> <mailto:
annif-users...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/annif-users/ffa272d6-5ba2-4bf3-901a-6cd0af9cc368n%40googlegroups.com <
https://groups.google.com/d/msgid/annif-users/ffa272d6-5ba2-4bf3-901a-6cd0af9cc368n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel.
+358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi