Hello everyone,
Thanks Jakub for the suggestion, I have sent the folks at Cantus DB a message.
Thanks Fr. Anthony for the attempts, I have reproduced what you did
and could not go any further. Good thinking on using the json option.
In the meantime I have done a first test run by fetching the web urls
and scraping the HTML code. Here is what I did along with some
results.
I have selected the gregobase entries of the following office-parts:
an, hy, in, gr, tr, of, co.
I have not included alleluias because they are hard to search, and the
other ones because the correspondence with Cantus genres is not
obvious (remember this is a test run).
I have then discarded entries that have no incipit, no mode, no gabc,
and are otherwise misformed.
I was left with 9768 Gregobase entries (out of ~14000) which is a good number.
I then merged those entries that have the same incipit and office-part
in order to avoid unneccesary requests to Cantus.
I was left with 5350 (incipit, office-part) couples, pointing back to
the 9768 Gregobase entries.
I then fetched the Cantus search URL, passing only the incipit (as
"starts with" text search) and genre, and grabbed the Cantus ID(s) of
the results.
Out of 5350 searches,
- 2096 returned no results, corresponding back to 2961 Gregobase entries.
- 2284 returned results sharing the same single Cantus ID,
corresponding back to 4917 Gregobase entries. Those are our successes.
- 970 returned results having several Cantus IDs between them,
corresponding back to 1890 Gregobase entries. Most of these only have
2 or 3 potential Cantus IDs and should be manually parsed or
potentially disambigued by mode?
A significant weakness of this search, besides the limited scope, is
that I only fetch the first page of results. There might be a few
cases where there are more than 100 results, and the first 100 share
the same Cantus ID, so the search seems to be a success, but a
different Cantus ID appears in results beyond 100 i.e. on the second
page.
However, after not much work we appear to be able to attribute a
Cantus ID to 4917 Gregobase entries, which is not too bad!
I will add more checks, fix the aforementioned problem with results
beyond 100, etc. before importing those into Gregobase. But I think we
are on the right track.
Matthias
> To view this discussion on the web visit
https://groups.google.com/d/msgid/gregorio-users/CAOYNFtAqcCPHCeV%2BjyC_hpQyw0k795RzXwL6T9uJJUg8TH4Lgw%40mail.gmail.com.