Hi All-
biocommons.seqrepo 0.5.5 was released yesterday and there's a new data release, 2020-04-13, as of today.
This release fixes a number of bugs in the parsing of fasta headers that caused some sequences to appear to not be in seqrepo.
The new release has these changes:
- Some seqrepo aliases included the sequence description from the fasta source. These were fixed.
- Ensembl <= 84 was dropped.
- The remaining Ensembl-nn namespaces were collapsed to the Ensembl namespace
- The gi and genbank namespaces were dropped
- RefSeq updates since Jan 2019 were loaded/reloaded.
- Ensembl sequences from releases 90-99 were loaded (into Ensembl namespace)
- Japanese Reference Genome v1 and v2 were added
A byproduct of the above changes that removed sequence descriptions and redundant aliases is that the aliases database was reduced in size by about 1.5 GB (~13% of the release size).
-Reece