As in the task of tabla bol transcription, timbre-based classification of the four categories can potentially be achieved with supervised learning from tabla audio labeled with the bol category and onset times. As also in bol transcription, the generalization to unseen instruments is a challenge arising from the variation in physical properties and tuning across tabla sets. We build on previous work that investigated deep learning techniques for the four-way tabla transcription based on the distinctiveness of timbres across the four categories (Rohit et al., 2021). The resonant bass (RB) class was especially challenging to detect, attributable to its inadequate representation in the dataset. Such data scarce contexts have been addressed in a number of different ways with transfer learning being among the more popular, aiming to reuse network parameters by transferring knowledge between domains (Choi et al., 2017). Motivated by this, Rohit et al. (2021) used an available pretrained Western drum stroke classification network and fine-tuned it to have the hi-hat, snare and kick drum output nodes predict the tabla stroke categories D, RT and RB respectively. A marked improvement was observed in performance for the RB class from the use of transfer learning. In the current work, we propose new methods that more closely exploit the correspondence between the tabla bol categories and Western drums to explore further overall task performance gains. We use available drum datasets to pretrain models and investigate data augmentation techniques for the tabla audio and also for the drum-pretraining datasets to determine ways to improve the acoustic match between the two culturally distinct percussion instruments.
We find that drum-pretraining results in better tabla stroke classification performance compared to a randomly initialised model, pointing to the promise of drum data for the tabla stroke classification task. Subsequent finetuning of the drum-pretrained model on tabla data leads to large improvements in performance, as expected from this transfer learning context. However, our experiments, as reported in this work, show that the performance of the finetuned model does not generally surpass the corresponding model trained from scratch on the tabla data but for the case of one of the stroke categories, namely the resonant treble.
Building further, Chordia (2005) made use of a large set of acoustic features, inspired from timbre recognition tasks such as instrument classification, to train neural network classifiers on a larger and more realistic dataset. Classification performance was heavily influenced by the particular tabla set that the model was tested on and tended to be low on instruments not seen during training. There have been more recent studies along similar lines, making use of classifiers like decision tree and support vector machines trained on common low-level acoustic features (Sarkar et al., 2018; Shete and Deshmukh, 2021). A main drawback of these models is that they are trained on small datasets, often with individual recordings of strokes from a single tabla and not realistic continuous playing where strokes overlap in time.
Extending the bol transcription task to a pattern retrieval application, Gupta et al. (2015) designed a system to transcribe tabla playing with the goal of identifying instances of common bol sequence patterns. For the transcription, a GMM-HMM system was trained on frame-wise mel-frequency cepstral coefficients (MFCCs). Although the training dataset was realistic in that it contained harmonium accompaniment in the background, the recordings were made on a single tabla and the system is therefore unlikely to perform well on new instruments.
Both the random forest and the CNNs were tested on realistic tabla accompaniment recorded in isolation, but due to the expensive nature of creating such a dataset, they were trained on more easily available tabla solo data. With both methods, a considerable gap in the train and test classification performance was observed, possibly due to lack of sufficient training data and a mismatch in the playing styles of the two formats. Interestingly, the use of a pretrained and fine-tuned drum-model was found to particularly benefit test set performance for the tabla resonant bass category, which also had the least training data available. Motivated by this success of each of transfer learning and data augmentation when used independently, we investigate the synergetic benefit from applying them together in the current work. We also revise the tabla-to-drum stroke mappings for better exploitation of the available training data.
A summary of the different sources for the tabla and drum training and testing datasets used in this work appears in Table 2. All audio files (of both tabla and drums) were formatted to single-channel, 16 kHz sampling rate and 16 bit depth. We provide a brief review of the relevant aspects of the datasets in this section.
The tabla instrument diversity (in terms of number of distinct instruments recorded and the range of their tuning pitch) represented in recent work (Rohit and Rao, 2021; Rohit et al., 2021) improved considerably upon that reported in any similar previous literature, and is therefore adopted for the present work. With the primary goal of classifying strokes played in tabla accompaniment (as opposed to tabla solo), the test set comprises recordings of expressive tabla played as accompaniment to vocals but recorded in perfect isolation to limit the complexity of our task. Since most public concert audio, even if available in multi-track format, contains bleed from other instruments, these recordings were created specially for the task. For this, recordings of solo singing were first obtained from expert singers. Then, tabla artists played the corresponding accompaniment while listening to the vocals over headphones, and their playing was recorded. There are a total of ten tracks spanning a net duration of twenty minutes and yielding about 4500 strokes. The expressive nature of tabla accompaniment (lacking a fixed score and containing extempore fillers) makes the annotation of this audio challenging. The audio was labelled by first running an automatic onset detector based on the high-frequency content algorithm (Brossier et al., 2004) to obtain stroke onsets, followed by manually assigning the four-way labels by listening to the audio and visually inspecting the spectrogram. Annotation was carried out by a tabla artist and cross-checked by one of the authors of the paper.
Due to the intensive nature of obtaining and annotating such tabla accompaniment recordings, 76 minutes of the more easily obtained tabla solo playing is used as training and validation data. This dataset comprises about 26,600 strokes rendered in solo compositions recorded from across 10 distinct tabla sets. To ensure adequate diversity of training data, instruments of sufficiently different tuning were chosen, and a variety of compositions played over a wide tempo range was included. Audio files were annotated by first running an automatic onset detector, then automatically aligning the corresponding bol sequence supplied by the players (as composition scores) with the onsets, and finally replacing bols with four-way labels following the assignment shown in Table 1. Although tabla bol-stroke mapping is fairly unique, there are exceptions. The same bol (e.g., Na) can sometimes be used to refer to strokes of very different types (resonant treble or damped). Hence, additional manual verification was required in order to assign the correct four-way label based on the actual sound production.
While the accompaniment-style test dataset is relatively small in size (20 minutes), the availability of the larger tabla solo dataset serves us well as a training dataset while also facilitating the more reliable testing of within-style classification in cross-validation mode.
Rohit et al. (2021) reported an improvement in tabla stroke classification performance with the use of tabla-specific data augmentation methods, along with standard pitch-shifting and time-scaling. These methods included spectral filtering and stroke remixing. Spectral filtering involved modifying the overall balance between low and high frequencies to emulate variations in instrument-related acoustic characteristics as validated by a tabla instrument classification experiment. Stroke remixing used non-negative matrix factorization (NMF) to decompose tabla audio into the three atomic strokes (damped, treble, bass) followed by recombination with different weightings to simulate expressive variations and different artistic and playing styles. We continue to use the same tabla data augmentation in the present work and additionally investigate a new method specific to RB strokes.
Schematic of the four-way classification system using three one-way CNN models to predict presence of atomic strokes (D, RT-any and RB-any) in a given audio frame. If both RT and RB onsets are detected then the onset is marked B. Any D that co-occurs with RT or RB is ignored.
The classification models for each stroke are CNNs that operate on short excerpts of the log-scaled mel spectrogram to produce a value between 0 and 1 indicating the probability of an onset in the center frame of the input. The general architecture is highlighted in Figure 3, and the hyperparameter values related to the model architecture for each stroke are detailed in Table 4. These values are based on previously suggested model architectures for tabla and drum transcription tasks (Rohit et al., 2021; Jacques and Rbel, 2018). The RB-any model consists of two convolutional layers with 16 and 32 filters respectively and a penultimate dense layer with 128 units. For RT-any, the number of filters in the two convolutional layers is doubled to 32 and 64 respectively. For D, the number of units in the dense layer is doubled to 256.
c80f0f1006