Cleaning Gregobase: cantus IDs

34 views
Skip to first unread message

Rob Leduc

unread,
Feb 12, 2021, 10:32:32 AM2/12/21
to Gregorio Users

Hi all,

I was originally very pessimistic about assigning cantus IDs, but I am possibly changing my mind.

Originally, I had experience just with antiphons from a psalter, and I think properly identifying antiphons will still be a very hit or miss task, so my pessimism here is largely unchanged.

After moving on to more Responsories, however, I am finding it much easier to find exact matches.  I expect this may also be true for mass chants, but haven't tried many examples.

My method is to:
1.  Go to the Cantus database at https://cantus.uwaterloo.ca/ and enter the first two or three words of the text in the search box.  I don't enter too many so I can cast a wide net. 

2. Ignore all the suggestions and hit return, or click at the bottom of the search pop-up window on the "view all results for ... " entry.

3.  This usually brings up an unhelpfully long list.  At the top of the list, enter the genre and mode for the score in question and click Apply. 

So far, for Responsories at least, this provides a shorter list of particular examples all with the same Cantus ID.  I check one of them to see that the melody is pretty much the same as the score I'm looking at, just to be sure.

I'm pretty confident using that process if I get it down to 1 or 2 ids.  So far, it's been a unique id.

Rob


Matthias Bry

unread,
Feb 12, 2021, 3:17:44 PM2/12/21
to gregori...@googlegroups.com
Hello Rob,

What you describe sounds very much scriptable and we could imagine
generating a large number of Cantus IDs automatically:

fetch https://cantus.uwaterloo.ca/search?op=starts&t=Here+The+Incipit+Words&genre=143##
for instance, 143 is the code for introit ##&cid=&mode=1## or other
mode ##&feast=&volpiano=All
if the list has the same Cantus ID for all results, this is the Cantus
ID we are looking for

Does anyone know if the Cantus DB has an API or if I need to extract
the list of results from the web pages?

Matthias
> --
> Gregorio homepage: http://gregorio-project.github.io
> Archives for the old mailing list: http://www.mail-archive.com/gregori...@gna.org/
> To report a bug, please post to: https://github.com/gregorio-project/gregorio/issues
> ---
> You received this message because you are subscribed to the Google Groups "Gregorio Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gregorio-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gregorio-users/e693ff74-1723-410f-ab13-57c0385ce81cn%40googlegroups.com.

Jakub Pavlík

unread,
Feb 12, 2021, 6:32:07 PM2/12/21
to gregori...@googlegroups.com
Yes, there is some sort of an API


Regards,
Jakub

pá 12. 2. 2021 v 21:18 odesílatel Matthias Bry <matthi...@gmail.com> napsal:

rled...@gmail.com

unread,
Feb 13, 2021, 11:34:46 PM2/13/21
to gregori...@googlegroups.com

I think it would be great if someone can write code to generate a list of unique ids for a large subset of chants based on some straightforward rules. My experience with electronic data capture from e-medical tiles suggests we should verify them by eye rather than just directly import the results wholesale.

That said, a clever script could make this easy by actually grabbing the melody/melodies attached to the ID and representing them side by side with the grego base image, or a link to the page at least. Such a tool would make it a lot easier to verify the results.

Anyway, I think careful thinking about this idea could be time well spent.

R
You received this message because you are subscribed to a topic in the Google Groups "Gregorio Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gregorio-users/KMWfeKGrBwQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gregorio-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gregorio-users/CAOYNFtBNqwys7boz4GudtDznLxM45gCPAGfgDPnJ6Hbw0iDY6A%40mail.gmail.com.

Jakub Pavlík

unread,
Feb 14, 2021, 5:20:04 AM2/14/21
to gregori...@googlegroups.com
Please note that CANTUS ID is an identifier of a unique liturgical text in a given genre of liturgical plainchant ("antiphon with text T" or "introit with text T"). If given text has multiple independent melodies, all of them still share the same CANTUS ID. There is therefore no need to check melodies when assigning CANTUS IDs to GregoBase pieces.

Regards,
Jakub


ne 14. 2. 2021 v 5:34 odesílatel rled...@gmail.com <rled...@gmail.com> napsal:

Rob Leduc

unread,
Feb 14, 2021, 12:26:14 PM2/14/21
to Gregorio Users
Excellent!  I was not aware, obviously not having looked too closely.

This would explain why a method like the above is more successful with a longer piece of text such as a Responsory instead of an Antiphon as well.  A short search prompt to cast a wide net would catch too many things in the world of Antiphons.  Still, giving the full text of a Responsory to search for would not be a good idea because sometimes the verse is different between sources.  So it may take some fussing to get a program optimized.

Rob

Matthias Bry

unread,
Feb 14, 2021, 4:49:00 PM2/14/21
to gregori...@googlegroups.com
Hello everyone,

I am having a miserable time using the API, I might fall back to
parsing HTML pages...
In https://cantus-api.readthedocs.io/en/latest/searching.html the
examples provided seem to use a non-standard HTTP verb, SEARCH, do I
read this correctly? Weird that it is not POST as REST APIs usually do
but fair enough;

I am using the python requests library, trying to mimic the first
example given on the aforementioned page.

>>> data = {"query": "+source_id:123614 +folio:042r", "sort": "sequence,asc"}
>>> URL = 'https://abbot.uwaterloo.ca:8888/chants/'
>>> r=requests.request('SEARCH', url=URL, headers={"X-Cantus-Per-Page" : "50"}, data=data, verify=False)
[[[ here a warning that I should verify HTTPS, but the endpoint has a
bad certificate apparently ]]]
>>> r
<Response [400]> # bad request

I have tried using the standard verbs instead of SEARCH : GET just
ignores the search query, and POST timeouts.
Any ideas?

Matthias
> --
> Gregorio homepage: http://gregorio-project.github.io
> Archives for the old mailing list: http://www.mail-archive.com/gregori...@gna.org/
> To report a bug, please post to: https://github.com/gregorio-project/gregorio/issues
> ---
> You received this message because you are subscribed to the Google Groups "Gregorio Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gregorio-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gregorio-users/146f9540-ec02-4979-aeba-35aefad02480n%40googlegroups.com.

Rob Leduc

unread,
Feb 15, 2021, 8:54:02 AM2/15/21
to Gregorio Users
Alas, you are a far better man than I.  This is well out of my wheelhouse. 

R

Fr. Anthony VanBerkum, O.P.

unread,
Feb 15, 2021, 11:41:06 AM2/15/21
to gregori...@googlegroups.com
Hello Matthias,

I spent some time playing with the API, and I haven't been able to get any results out of it, but I did figure out a way where it at least looks like it's trying to run the search query. I found enabling logging as given at [1] to be very helpful. According to the logs, the body somehow got dropped when following the 301 redirect, so I changed the url to include the trailing slash. I also changed from data=data to json=data to get the requests library to send the data dict as json rather than as a query string. I also dropped 'sort' for now since it was complaining. This gets past the 'empty body' error message and instead gets a '404 search found nothing', which is maybe a helpful improvement? I also tried a bunch of intuitive variations, but got nowhere. See attached file for details.

Fr Anthony


ipython_log.py

Jakub Pavlík

unread,
Feb 15, 2021, 12:01:59 PM2/15/21
to gregori...@googlegroups.com
I'm not sure how in/official and un/maintained this particular API instance currently is. Maybe it's just broken. I would suggest getting in touch with the people behind the CANTUS project (e.g. using the contact form https://cantus.uwaterloo.ca/contact ) and asking them directly.

Regards,
Jakub


po 15. 2. 2021 v 17:41 odesílatel Fr. Anthony VanBerkum, O.P. <antho...@gmail.com> napsal:

Matthias Bry

unread,
Feb 16, 2021, 1:53:29 AM2/16/21
to gregori...@googlegroups.com
Hello everyone,

Thanks Jakub for the suggestion, I have sent the folks at Cantus DB a message.

Thanks Fr. Anthony for the attempts, I have reproduced what you did
and could not go any further. Good thinking on using the json option.

In the meantime I have done a first test run by fetching the web urls
and scraping the HTML code. Here is what I did along with some
results.

I have selected the gregobase entries of the following office-parts:
an, hy, in, gr, tr, of, co.
I have not included alleluias because they are hard to search, and the
other ones because the correspondence with Cantus genres is not
obvious (remember this is a test run).
I have then discarded entries that have no incipit, no mode, no gabc,
and are otherwise misformed.
I was left with 9768 Gregobase entries (out of ~14000) which is a good number.
I then merged those entries that have the same incipit and office-part
in order to avoid unneccesary requests to Cantus.
I was left with 5350 (incipit, office-part) couples, pointing back to
the 9768 Gregobase entries.
I then fetched the Cantus search URL, passing only the incipit (as
"starts with" text search) and genre, and grabbed the Cantus ID(s) of
the results.
Out of 5350 searches,
- 2096 returned no results, corresponding back to 2961 Gregobase entries.
- 2284 returned results sharing the same single Cantus ID,
corresponding back to 4917 Gregobase entries. Those are our successes.
- 970 returned results having several Cantus IDs between them,
corresponding back to 1890 Gregobase entries. Most of these only have
2 or 3 potential Cantus IDs and should be manually parsed or
potentially disambigued by mode?

A significant weakness of this search, besides the limited scope, is
that I only fetch the first page of results. There might be a few
cases where there are more than 100 results, and the first 100 share
the same Cantus ID, so the search seems to be a success, but a
different Cantus ID appears in results beyond 100 i.e. on the second
page.

However, after not much work we appear to be able to attribute a
Cantus ID to 4917 Gregobase entries, which is not too bad!
I will add more checks, fix the aforementioned problem with results
beyond 100, etc. before importing those into Gregobase. But I think we
are on the right track.

Matthias
> To view this discussion on the web visit https://groups.google.com/d/msgid/gregorio-users/CAOYNFtAqcCPHCeV%2BjyC_hpQyw0k795RzXwL6T9uJJUg8TH4Lgw%40mail.gmail.com.

rled...@gmail.com

unread,
Feb 16, 2021, 12:55:03 PM2/16/21
to gregori...@googlegroups.com

Matthias,

That’s really encouraging! I would suggest a manual check of a random subset of successes, maybe 50 or so, before merging into Gregobase. I’d be happy to do that.

Yes, I think the 900+ partial matches could be checked and resolved by hand. I could help with that.

Did you try Responsories? I’ve had a lot of luck with those by hand using some portion of the text prior to the verse.

I also found a typo this morning by checking for a Cantus ID. A word in my score became separated at the first syllable Noe de ambularet instead of Noe deambularet. This caused my “hand” search to fail at Cantus. So a typo in gregobase could also impact the results. That’s to be expected with keying complicated text and lack of proofreading in many cases. So even some of the misses may be recoverable with a sophisticated search for partial matches.

Rob
> You received this message because you are subscribed to a topic in the Google Groups "Gregorio Users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/gregorio-users/KMWfeKGrBwQ/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to gregorio-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gregorio-users/CA%2BKjmK2YJrAEw6Rvy_Z%2Bppwpi-F6Sqe5jjV5ov9rGpsudpzoFw%40mail.gmail.com.

Matthias Bry

unread,
Apr 2, 2021, 5:19:10 PM4/2/21
to Gregorio Users
Hello Rob, hello everyone,

It has been seven weeks, in which I moved, had no internet for a month, dealt with Ms.' pregnancy issues (all is well), and various other delays, but I think I have a reasonable thing going on.
I have coded a slightly more robust algorithm for matching Gregobase entries with the Cantus DB.

Here are the results: there are currently in Gregobase:
14948 chants that are not duplicates
13371 chants that are not duplicates and with no cantus ID
10578 chants that are not duplicates, have no cantus ID, with genres an, in, of, gr, hy, co, tr
6316 different (incipit, genre) couples, each pointing back to 1-12 chants for a checked total of 10578 chants

Hence I made 6316 requests to Cantus DB.
2912 returning no Cantus ID (either the request fails, or the search returns nothing) : I have yet to make statistics on the various error messages.
2438 returning a single Cantus ID, pointing back to 4976 Gregobase entries
602 returning two Cantus IDs
209 returning three Cantus IDs
155 returning four or more.

I agree with the idea of checking a sufficient subset of the 4976 Gregobase entries that supposedly have a single match in the Cantus DB ; and the list of the other ones should be a useful tool for making manual Cantus ID attribution way faster.

A list of the "single match entries" (with helpful links!) is to be found here : http://marteo.fr/gregobase_entries_with_one_cantus_id.html
A list of the Gregobase entries with several matches (and the same links!) : http://marteo.fr/gregobase_entries_with_several_cantus_ids.html

I will set out to start checking the first list and manually resolving the second one, all contributions are welcome :-)

Matthias

Rob Leduc

unread,
Apr 4, 2021, 9:57:45 AM4/4/21
to Gregorio Users

Congratulations on the baby-to-be and glad to know all is well!

I had hoped you still had had time to work on this.  That is impressive progress.

I will look at the links; hopefully there is a way to record our progress there to avoid duplication of effort. If not, maybe we can set up a google doc or some such to log things that have been checked.

I will start by looking at a random subset of the single match entries, even though you are starting there as well.  Using a random subset will give us an idea of how correct the entries are; at some point, Olivier may be satisfied to just add them all programmatically, or add them programmatically to a new (hidden) field for automated Cantus IDs.  This would be the sort of search that could be rerun in the future as both data bases grow.

After that, I'll have a look at the multiple matches.  We should be able to resolve those by hand, in time.

Best wishes and happy Easter!

Rob

Rob Leduc

unread,
Apr 4, 2021, 11:56:45 AM4/4/21
to Gregorio Users

OK - so I've gone through the first 50 of my random list of size 200, drawn from the several thousand with exactly one match.  I've found the following two problems:

For Cantus ID 205777, the chant begins with signo crucis, but the text afterwards doesn't seem to match at all.  I don't think there is a match for this chant (grego chant number 5671) in the Cantus database.

For Cantus ID 002543, giving chant 7580, I think the proper Cantus ID should be 002542 instead. 

I can supply my 50 in terms of row numbers in your HTML file, but it is hard for me to automatically extract the cantus IDs. 

Based on that random sample, we're looking at an error rate of roughly 2% of Cantus IDs in the list, although with my subsample of only 50 there is a reasonable level of uncertainty in this estimate - I couldn't rule out an error rate as high as 10% yet, although I can keep checking my random list to narrow that down. 

But it seems likely with ~2500 prospective matches with 1 Cantus ID, a 2% error rate would yield about 50 errors.  So we may just have to check all 2500 by hand.  What do you think?

Rob
Reply all
Reply to author
Forward
0 new messages