t-SNE, UMAP, dbscan

44 views

Skip to first unread message

K. P. Law

unread,

Jun 5, 2023, 11:20:31 AM6/5/23

to Cardinal MSI Help

Dear colleagues,

I have several questions regarding the use of Cardinal 3.

1. I would like to use t-SNE, UMAP, and dbscan (with corresponding R packages) to analyze MSI data. I understand they are not directly supported, but I have seen a report of a ROC that they have managed to use t-SNE, and UMAP with Cardinal in their data analysis of DESI-MSI data.

I believe most users would love to see they are being supported in Cardinal 3. For now, may I ask if an experienced user can direct me in a right way? Thanks.

2. Cardinal 3 is very slow (in Windows), even to do a PCA analysis with setCardinalBPPARAM(SnowParam()) [indeed it may even slow it down] on my 20-CPU server, only 7 cores are being used, and it often causes memory problems.

It is unclear which functions can use parallelization, and which is not recommended.

May I suggest the choice of backend should be changed, and would it be better it has a control of the number of threads used for the calculation.

Many Thanks

Kai

kbemis

unread,

Jun 5, 2023, 12:23:29 PM6/5/23

to Cardinal MSI Help

1. You can always extract the (p x n) data matrix with the spectra() method. You can pass this to any R function that expects a data matrix. (You may need to call as.matrix() on the data matrix first if it is a 'matter_mat' or 'sparse_mat' object. You may also need to transpose it first, as many traditional statistical functions will expect a n x p data matrix unlike the p x n standard in bioinformatics.)

2. Unfortunately, SnowParam() is not the most efficient as it must start new parallel R sessions and then pass the data to them, unlike MulticoreParam() that is available on macOS and Linux. This means you need a very long-running method to make it worth it despite the overhead. Indeed, parallelization will use more memory, as each task is running on a separate chunk of data. If the dataset is not in memory, then the slowest part will always be reading the data chunks into memory from storage, and an SSD is strongly recommended. Any method with a "BPPARAM" parameter in the signature can make use of parallelization. Whether it is worth it depends on many factors including the size of the dataset and how much memory you have available (as parallelization typically requires more memory). Please see the BiocParallel package documentation for how to set SnowParam parameters such as different numbers of workers, etc.