Large vocabularies in SKOSMOS

73 views
Skip to first unread message

opo...@gmail.com

unread,
May 24, 2022, 10:41:25 AM5/24/22
to Skosmos Users
I have just tried to load a large dataset on geographical places into SKOSMOS. More than 1 million SKOS concepts, most of them fairly lean. The total ttl vocabulary file is about 430 MB.  
Apparently, this vocabulary is way too big for SKOSMOS  to handle. The alphabetical index never shows up (loading forever). I tried to set showAlpabeticalIndex to false, but then the hierarchical pane also disappeared, so had to put it back.
The vocabulary has just a few top concepts which are listed in the hierarchical pane. However, when I start expanding the hierarchy, the whole left pane disappears.

I assume the reason for this is  lack of resources in our SKOSMOS infrastructure, and I am just interested to hear about your experience with big data sets, i.e.  how large vocabularies SKOSMOS  realistically is able to handle, and whether any smart tricks can be done - e.g. on the fuseki server - to make it scale up.

Best regards, 
Oddrun Pauline Ohren

Mika Juhani

unread,
May 30, 2022, 5:26:02 AM5/30/22
to Skosmos Users
Hi Oddrun!

It is really possible that Skosmos becomes rather slow with bigger vocabularies but it should not cause the "total" disappearing of the panels.

When you had set showAlpabeticalIndex to false, did you configure some other tab as default?

https://github.com/NatLibFi/Skosmos/wiki/Configuration
-> skosmos:defaultSidebarView

With kind regards,
Mika Vaara

joeli....@gmail.com

unread,
Jun 6, 2022, 6:52:34 AM6/6/22
to Skosmos Users
Hi Oddrun,

Your dataset is definitely on the bigger side. Our larger vocabularies are all to the tune of 40k SKOS concepts with 100MB of vocabulary data when serialized in turtle - and that's enough to cause problems with vocabulary statistics https://api.dev.finto.fi/doc/#!/Vocabulary-specific_methods/get_vocid_vocabularyStatistics and https://api.dev.finto.fi/doc/#!/Vocabulary-specific_methods/get_vocid_labelStatistics . Those are set to load dynamically and can be disabled in the configuration, which might be a good idea in your case. All in all we have close to 2M concepts in the same Fuseki dataset for Skosmos, which shows that querying for concepts should work even with a larger set of data.

For the tips on tweaking Fuseki, you should use some kind of http cache in front, like Varnish or Nginx. Here are some links to the relevant Skosmos documentation for using Varnish:  https://github.com/NatLibFi/Skosmos/wiki/InstallTutorial#optimizing-performance
https://github.com/NatLibFi/Skosmos/wiki/ReverseProxy

You also pretty much need to set up your Fuseki with a text index like Lucene or Elasticsearch. We use Lucene:
https://github.com/NatLibFi/Skosmos/wiki/TextAnalysisConfiguration
There's also in-depth documentation for the setting up the text index in Fuseki:
https://jena.apache.org/documentation/query/text-query.html#configuring-an-analyzer

I hope this gives you an idea of how to get started.

______________
Joeli Takala

opo...@gmail.com

unread,
Jun 16, 2022, 2:38:51 AM6/16/22
to Skosmos Users
Hi Mika, 
Thanks for your answer!  
In our skosmos instance,  showAlpabeticalIndex set to false has the effect that no sidebar at all comes up when first opening the vocabulary in question. - whether  defaultSidebarView is set or not. However, as soon as a search is performed, the sidebar shows up with the defaultsidebarView on top.
 I assume this is a bug?
Oddrun

opo...@gmail.com

unread,
Jun 16, 2022, 3:00:43 AM6/16/22
to Skosmos Users
Hi Joeli,
Thanks for the tips about fuseki config - all of which I've forwarded to our technicians.
However, I think 1 mill entities (in this case places/placenames  registered by the Norwegian Mapping Authority) is too large for skosmos whatever we do, and we may not need all of them at any rate. As a test I've pruned it down to 25000, which works fine, except it's still necessary to turn off the statistics, which is a pity. Statistical information  about a vocabulary is found very useful by most users. Could there be a more persistent way of handling the results of the statistical calculations?  After all, it's not as if skosmos itself causes changes in the number of concepts in a vocabulary.
Oddrun

joeli....@gmail.com

unread,
Jun 23, 2022, 4:49:47 AM6/23/22
to Skosmos Users
Hi Oddrun,

The vocabulary statistics and label statistics are on the heavier side. A feature where such data is displayed statically would be to turn the comonents off and just add the info on the front page separately. There would be nothing stopping from doing this separatedly by including the info in the vocabulary description in a daily update query run at the triple store, for example. But I think there's a better way:

The 25000 entities in your example does sound like on the smaller side, and Skosmos can be set up to serve such queries via a reverse proxy server such as Varnish (links to our documentation in the previous email). That way the results are essentially served like a static resource would be. For example, the vocabulaty YSO has 650000 triples for 32000 concepts. As the query response is pretty static (it changes few times per week), it is most of the time served by Varnish in a fraction of a second. But the first person querying the new updated data would get it in 17 seconds or so.

Test it out in real situation (I get results in 0.077s and 0.072s):
https://api.finto.fi/rest/v1/yso/vocabularyStatistics?lang=en
https://api.finto.fi/rest/v1/yso/labelStatistics?lang=en

In order to avoid the situation where a human user would be the first one to get the result fresh, and not from a cache - we run a script that warms up the varnish cache by querying web pages and API-method calls just after a vocabulary update, usually in the middle of the night.

Hope this was of use!

______________
Joeli Takala
Reply all
Reply to author
Forward
0 new messages