Dear all
The backdrop for these questions is as follows: we have curated and made ready a set of 3 lakhs (0.3 million) bibliographic records (a short text corpus with title, abstract, and MeSH term URIs as per the Annif compatible format) from the Medline database (available under ODbL) after going through an array of steps. Medline uses MeSH vocabulary to index records by trained LIS professionals. Another 0.1 million records will be ready by this weekend, making it a total of 0.4 million records for training. We want to create four projects (the same MeSH vocabulary for all projects) with four regular backends, namely TF-IDF, fastText, Omikuji-Bonsai, and Omikuji-Parabel, and then the final project with the NN-Ensemble Fusion backend by combining the regular backends with weights like tf-idf:1, fastText:2, Omikuji-B:3, and Omkuji-P:3. We already tested the said process with a small set of 10,000 records, and it is working fine. But before the beginning of the training with all 0.3 million records, we would like to know answers to the following questions:
1. What will be a better approach? 1.1) to train each regular backend by using all 0.3 million records, or 1. 2) Let us divide the datasets into 4 equal sets of records and use them for each regular backend separately, like set 1 for tf-idf, set 2 for fastText, and so on.
2. Let us imagine that we are following Solution 1.1, and in that case, what will be a better approach to handling the NN-Ensemble training and learning - 2.1) with a set of records from the 0.3 million record set (say 0.1 million from this set selected randomly) or 2.2) to use a completely new training set (from the training set in the pipeline with 0.1 million records)?
best regards
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/2341ee23-3943-4049-ad87-72d61a82f11b%40helsinki.fi.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/b39006a6-c133-42ff-bfe3-9fa0dc5cd3fe%40helsinki.fi.