parsing the data

5 views
Skip to first unread message

Kirthi Pulakanti

unread,
May 12, 2014, 4:03:12 PM5/12/14
to unix-and-perl-...@googlegroups.com
Hello ,

I have a data in format :

CHR TSS-25bp TSS+25bp count  tss Ensemble transcript refgene strand
chr8 68141773 68141823 1 68141798 ENSMUST00000152320 1-Mar +
chr8 68141882 68141932 3 68141907 ENSMUST00000110258 1-Mar +
chr8 68141898 68141948 3 68141923 ENSMUST00000110256 1-Mar +
chr8 68141910 68141960 3 68141935 ENSMUST00000155804 1-Mar +
chr8 68141959 68142009 2 68141984 ENSMUST00000110255 1-Mar +
chr8 68910167 68910217 2 68910192 ENSMUST00000039540 1-Mar +
chr8 68910174 68910224 2 68910199 ENSMUST00000110253 1-Mar +
chr17 33822631 33822681 2 33822656 ENSMUST00000066121 2-Mar -
chr17 33828434 33828484 2 33828459 ENSMUST00000172767 2-Mar -
chr17 33828758 33828808 1 33828783 ENSMUST00000173454 2-Mar -
chr17 33840058 33840108 1 33840083 ENSMUST00000173392 2-Mar -
chr18 56963297 56963347 1 56963322 ENSMUST00000153044 3-Mar -
chr19 37282007 37282057 4 37282032 ENSMUST00000024078 5-Mar +
chr19 37282032 37282082 5 37282057 ENSMUST00000112391 5-Mar +
chr19 37282040 37282090 4 37282065 ENSMUST00000148105 5-Mar +
chr15 31385628 31385678 2 31385653 ENSMUST00000090227 6-Mar -
chr15 31387011 31387061 1 31387036 ENSMUST00000043826 6-Mar -
For every ref genes ,I would like to get a unique count (column 7) on the basis of strand(column8) and tss( column5) information. For example,
There are 7 rows for 1-March with count 1,2,3 .If a gene has same number of count, I would like to retrieve that gene with its highest number for tss, likewise if its on negative strand ,keep the one with lowest tss number (starting from 3'). For gene 1- March and 2-March, I would expect;

chr86814177368141823168141798ENSMUST000001523201-Mar+
chr86814191068141960368141935ENSMUST000001558041-Mar+
chr86891017468910224268910199ENSMUST000001102531-Mar+
chr173382263133822681233822656ENSMUST000000661212-Mar-

chr173382875833828808133828783ENSMUST000001734542-Mar-

Sorry for the typo,it should be gene March1 and March2 . 

CAN YOU PLEASE HELP ME GIVE IDEAS OF HOW DO I START WRITING A SCRIPT IN UNIX . ?

Thanks for your time. 


Keith Bradnam

unread,
May 13, 2014, 2:32:32 PM5/13/14
to unix-and-perl-...@googlegroups.com
Some more information might help.

  1. How many rows in your input file?
  2. Does every row contain a unique ID in the Ensembl transcript column?
  3. Is the input file sorted by ref gene, or could you expect to see '1-Mar' further down the file?
Have you tried any approaches yourself yet?

Regards,

Keith
Reply all
Reply to author
Forward
0 new messages