About the sort of GTF file by chromosome

671 views
Skip to first unread message

wayj86 wayj86

unread,
Jan 29, 2018, 11:09:46 AM1/29/18
to gen...@soe.ucsc.edu
Dear Sir or Madam,

I am using STAR to align my reads to human genome. The fasta files (chr1, ... chrX, chrY, chrM) were downloaded from UCSC and merged together. Then I generated the GTF file according to the Wiki of UCSC:

1. Download your gene set of interest for hg19. For this example, I'll use the refGene table, but you can choose other gene sets, such as the knownGene table from the "UCSC Genes" track.

rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz ./

2. Unzip

gzip -d refGene.txt.gz

3. Remove the first "bin" column:

cut -f 2- refGene.txt > refGene.input

4. Convert to gtf:

genePredToGtf file refGene.input hg19refGene.gtf

5. Sort output by chromosome and coordinate

cat hg19refGene.gtf  | sort -k1,1 -k4,4 > hg19refGene.gtf.sorted

Example output for hg19refGene.gtf.sorted:

$head hg19refGene.gtf.sorted
chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316973"; exon_number "7"; exon_id "NM_001316973.7"; gene_name "LZIC";
chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316975"; exon_number "7"; exon_id "NM_001316975.7"; gene_name "LZIC";
chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316976"; exon_number "5"; exon_id "NM_001316976.5"; gene_name "LZIC";
chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_032368"; exon_number "7"; exon_id "NM_032368.7"; gene_name "LZIC";
chr1 refGene.input CDS 10002739 10002793 . - 0 gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC";
chr1 refGene.input exon 10002739 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC";
chr1 refGene.input start_codon 10002791 10002793 . - 0 gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC";
chr1 refGene.input exon 10002981 10003083 . + . gene_id "NMNAT1"; transcript_id "NM_001297778"; exon_number "1"; exon_id "NM_001297778.1"; gene_name "NMNAT1";
chr1 refGene.input transcript 10002981 10045556 . + . gene_id "NMNAT1"; transcript_id "NM_001297778";  gene_name "NMNAT1";
chr1 refGene.input exon 10003307 10003485 . - . gene_id "LZIC"; transcript_id "NM_032368"; exon_number "8"; exon_id "NM_032368.8"; gene_name "LZIC";
However, the 5th step, sort the output by chromosome and coordinate, make the order of chromosomes like:

cat hg19refGene.gtf.sorted | cut -f1 | sort -u

chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr20
chr21
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY

Which do not follow the order of chr1, chr2, chr3, ... chr22, chrM, chrX, chrY. I tried sort -nk1.4 hg19refGene.gtf.sorted and it didn't work either. So my question are:

1. Is there any better way to do that? I would apologize if it is not a proper question.

2. It would not be a problem if the order of chromosomes in GTF did not match that of genome Fasta file when generating genome indexes, would it?

Best regards,
Stanley

Christopher Lee

unread,
Jan 29, 2018, 3:51:49 PM1/29/18
to wayj86 wayj86, UCSC Genome Browser Discussion List
Greetings Stanley,

This mailing list is not a general scientific advice forum, and is
intended only for questions regarding the data or tools contained
within the UCSC Genome Browser website. Please direct your questions
to a more general advice forum like BioStars, or to the STAR support
forum directly:
BioStars: https://www.biostars.org/
STAR: https://github.com/alexdobin/STAR and
https://groups.google.com/d/forum/rna-star

If you have questions regarding the UCSC Genome Browser in the future,
feel free to send a message to one of our mailing lists:

- General questions: gen...@soe.ucsc.edu
- Questions involving private data: genom...@soe.ucsc.edu
- Questions involving mirror sites: genome...@ose.ucsc.edu

Thanks,

Christopher Lee
UCSC Genomics Institute
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "UCSC Genome Browser Public Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to genome+un...@soe.ucsc.edu.
> To post to this group, send email to gen...@soe.ucsc.edu.
> Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
> To view this discussion on the web visit
> https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CACZWusPwdtY9rY26mwUuTAkFd4cwz_xMEd%3D7b7dEcgZYpzWjjQ%40mail.gmail.com.
> For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.
Reply all
Reply to author
Forward
0 new messages