Samples and projects present in SRAdb but not in MetaSRA

69 views
Skip to first unread message

Steven Foltz

unread,
Dec 10, 2020, 3:33:01 PM12/10/20
to MetaSRA Users
Hi, MetaSRA team! Thank you for this wonderful resource; it's very helpful for us to have easy access to the metadata labels generated by your group!

In the course of our study, one thing we noticed was that some samples we thought would be present in MetaSRA are not there. (But they are found in SRAdb and refine.bio.) Of course, it's totally justifiable for there to be filtering and exclusions; we just want to understand why some would missing.

Overall, our study includes ~60K refine.bio samples, and about 37% of those were not found in MetaSRA. Nearly all projects are either entirely present or not present in MetaSRA. For a few projects, there is a mix of present and not present samples.

For some, we may know why. For example: not having ILLUMINA as the platform.

For others, we were unable to discern. Perhaps some study descriptions just don't yield significant biological terms?

Attached, please find a markdown file containing some of our exploratory findings. We would greatly appreciate any follow-up wisdom you could share to help us understand why we are missing some samples.

Thank you!
Steven Foltz
preliminary_look_at_missing_samples.html
preliminary_look_at_missing_samples.md

bernie...@gmail.com

unread,
Dec 14, 2020, 11:17:04 AM12/14/20
to MetaSRA Users
Hi Steven,

Thank you so much for your interest in the MetaSRA and for looking so thoroughly into these issues! 

I have started to dig into this and have a couple points/conclusions:  First, yes currently the MetaSRA only indexes Illumina samples (i.e. where the "platform" field equals "ILLUMINA"); however, it may be helpful for a future release to annotate all samples regardless of platform.  Second, regarding the mystery samples, I looked into a few of these and it appears that they are missing from the MetaSRA because the SRAdb is missing their sample attributes, which the MetaSRA uses to label each sample.  For example, for sample SRS1771380, the following query within the SRAdb's SQLite file returns the empty string:

select sample_attribute from sample where sample_accession='SRS1771380';

This means that this sample's attributes appear to be missing from the SRAdb despite the fact that on the website there do appear to be attributes (https://www.ncbi.nlm.nih.gov/sra/?term=SRS1771380).

So it looks to be the fact that, for at least some of the samples you have pointed out, it appears that the SRAdb is simply missing the attributes that the MetaSRA uses as input.  For such samples where this is the case, they would be absent from the MetaSRA.

Please let me know if anything I said doesn't make sense, or if you find any other issues. In the meantime, I will keep digging to make sure there aren't other issues causing these samples to be missing,.

Best,
Matt  

bernie...@gmail.com

unread,
Dec 15, 2020, 7:57:23 AM12/15/20
to MetaSRA Users
Hi Steven, 

To follow up with some more details after looking further into this; it is indeed disturbing that only 60% of those refine.bio samples are in the MetaSRA. It should be much higher.  As I described in my last message, at first I thought that the cause was due to some samples missing their sample attributes in the SRAdb; however, this does not seem to be the case for the vast majority of samples. Indeed some samples are missing their sample attributes, but not nearly enough to be the cause of the discrepancy between refine.bio and the MetaSRA. 

Looking deeper, I noticed something strange about many of the samples that you pointed out: they are missing species information. It appears that there is a vast amount of entries in the SRAdb (nearly 4 million) that seem to be missing species information (i.e. have no values in the "scientific_name" or "common_name" fields in the "sample" table) and are thus not being picked up by the MetaSRA's query:

SELCT * FROM experiment JOIN sample USING (sample_accession) WHERE library_strategy = 'RNA-Seq' AND scientific_name = 'Homo sapiens' AND platform = 'ILLUMINA'

This missing data is indeed a problem as clearly there are many human samples included within them, as evidenced by the fact that refine.bio and MetaSRA differ by so many samples. We will need address it this somehow.

Best,

Matt

Steven Foltz

unread,
Dec 15, 2020, 4:16:49 PM12/15/20
to MetaSRA Users
Matt,

Thank you for following up. I appreciate you taking the time to look into this issue. If there is any way for me to be helpful, please let me know!

--Steven

Reply all
Reply to author
Forward
0 new messages