Re: CARD version and marker to sequence/ARO mapping

Eric Franzosa

unread,

Aug 28, 2017, 11:13:57 AM8/28/17

to Aram Avila-Herrera, shortbred-users

Hi Aram,

Sorry for the delayed reply. In answer to your questions: The "2017 version" of CARD was based on their database v1.1.8. We concatenated the four FASTA files corresponding to CARD's four models of ABR protein selection. We filtered sequences that were <50 AAs in length before providing them as input to shortbred_identify (default settings). We used UniRef90 as a background "protein universe."

You should be able to associate identifiers from the marker file with individual sequences in CARD? The identifiers that ShortBRED selected will be present in the FASTA headers from CARD v1.1.8.

When True Markers or Junction Markers are reported, they are associated with a single family among the input sequences, and are identified by one of the headers for that family. When Quasi Markers are reported, it's because ShortBRED couldn't find unique markers for a family among the input proteins. When the non-uniqueness results from indistinguishable families among the input proteins, ShortBRED collapses (clusters) them as a single, larger family. I believe this is why you're seeing multiple identifiers in the headers for Quasi Markers.

To your last question, if you have broader ABR categories of interest, then I would recommend summing the abundance of individual proteins/families within those categories, following the "is_a" logic of something like GO. I'm not sure if there are accessory files in CARD to facilitate this?

Thanks,

Eric

On Tue, Aug 15, 2017 at 8:58 PM, Aram Avila-Herrera <avilah...@llnl.gov> wrote:

Hello,

I have a couple of questions for the latest precomputed markers.

Which CARD release was used for the Antibiotic Resistance Factors linked on the homepage (https://bitbucket.org/biobakery/shortbred/downloads/ShortBRED_CARD_2017_markers.faa.gz)? Was there any filtering or pre-processing done prior to running shortbred_identify?

Is the cluster/marker mapping available e.g., to map markers to individual CARD sequences?

How was the CARD annotation (ARO numbers) transferred to the markers? For example, does each True Marker (CD-HIT cluster) get all the lowest-level AROs of the member sequences, or is a higher-level ARO that covers all sequences in the cluster assigned to the marker?

What about Junction and Quasi Markers? I noticed some markers have multiple ARO numbers in the marker sequence header.

And a question perhaps better suited for folks familiar with CARD: If one wanted to map ShortBRED markers to a small set of high-level AROs (like GO-Slim), would you recommend "walking up the graph" from the AROs in the marker headers?

Thanks,
Aram

--
You received this message because you are subscribed to the Google Groups "shortbred-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shortbred-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aram Avila-Herrera

unread,

Aug 30, 2017, 6:14:57 PM8/30/17

to shortbred-users, avilah...@llnl.gov

Hi Eric,

Thanks! We should be able to find the identifiers from the marker files with the individual CARD sequences. I see now that the CARD fasta headers include one ARO.

When True Markers or Junction Markers are reported, they are associated with a single family among the input sequences, and are identified by one of the headers for that family.

Is it possible that a single family can include sequences with different AROs? Perhaps unlikely, since sequences with different antibiotic resistance functions would have to be mostly similar in sequence space to be grouped together (e.g., long enzymes with different active sites?). Did you happen to see any such cases when generating markers?

To your last question, if you have broader ABR categories of interest, then I would recommend summing the abundance of individual proteins/families within those categories, following the "is_a" logic of something like GO. I'm not sure if there are accessory files in CARD to facilitate this?

I think ARO.obo or ARO.owl contain the necessary "is_a" or "subClassOf" relationships.

Best,
Aram

Thanks,
Eric

To unsubscribe from this group and stop receiving emails from it, send an email to shortbred-use...@googlegroups.com.

Eric Franzosa

unread,

Aug 31, 2017, 11:39:52 AM8/31/17

to shortbred-users

Hi Aram,

I haven't looked closely at the individual clusters myself. ShortBRED's default is to produce fairly tight clusters (90% identity). However, if two functions were to differ only by an important amino acid in an active site, then they could definitely wind up in the same cluster.

Thanks,

Eric

On Wed, Aug 30, 2017 at 6:14 PM, Aram Avila-Herrera <avilah...@llnl.gov> wrote:

Hi Eric,

Thanks! We should be able to find the identifiers from the marker files with the individual CARD sequences. I see now that the CARD fasta headers include one ARO.

When True Markers or Junction Markers are reported, they are associated with a single family among the input sequences, and are identified by one of the headers for that family.

Is it possible that a single family can include sequences with different AROs? Perhaps unlikely, since sequences with different antibiotic resistance functions would have to be mostly similar in sequence space to be grouped together (e.g., long enzymes with different active sites?). Did you happen to see any such cases when generating markers?

To your last question, if you have broader ABR categories of interest, then I would recommend summing the abundance of individual proteins/families within those categories, following the "is_a" logic of something like GO. I'm not sure if there are accessory files in CARD to facilitate this?

I think ARO.obo or ARO.owl contain the necessary "is_a" or "subClassOf" relationships.

Best,
Aram

On Monday, August 28, 2017 at 8:13:57 AM UTC-7, Eric Franzosa wrote:

Thanks,
Eric

To unsubscribe from this group and stop receiving emails from it, send an email to shortbred-use...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward