Incorrect output?

14 views
Skip to first unread message

Mike Busch

unread,
Dec 29, 2015, 10:36:24 PM12/29/15
to Disco-development
I just started trying to switch a research project from Hadoop to Disco and I don't understand what I'm doing wrong. My input data is FASTQ format genome data: 

@SOLEXA-1GA-2_2_FC30DNN:1:2:349:1752
GAAGCTGAGCGACATCGACACGGTGATCGAGTTCTT
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@SOLEXA-1GA-2_2_FC30DNN:1:2:109:1790
AAACATATGCATCTCGTTTGTGGAAAAGCTATCGAC
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

To learn Disco, all I'm doing at first is trying to read the data file, separate out the DNA data and output it back out:

from disco.core import Job, result_iterator
import collections, sys


def fastq_input_stream(stream, size, url, params):
   
while True:
        seq_id
= stream.next().strip()
        seq
= str(stream.next().strip())
        strand
= stream.next().strip()
        qual
= stream.next().strip()
       
yield seq_id, repr(seq), strand, qual




def map(line, params):
    seq_id
, seq, strand, qual = line
#    print seq
   
yield seq, 1


def reduce(dna_iter, params):
   
from disco.util import kvgroup
   
for seq, count in kvgroup(dna_iter):
       
yield seq, sum(count)


if __name__ == '__main__':
    job
= Job().run(input=["tag://test:test"],
        map_reader
= fastq_input_stream,
        map
=map,
        reduce
=reduce)
   
for line, count in result_iterator(job.wait(show=True)):
       
print (line,count)

But my output is some weird hex numbers:

("'\\x00\\x00\\x00\\x00\\x00\\x00x^\\xb5\\x9bO\\x8f+E\\x12\\xc4\\xc5\\x1e\\xf9\\x1a\\xbb\\x9c\\x10\\xc8]\\xfd\\xbf\\xb9\\xace\\xc0 \\xc1{\\x07\\xfc\\x04\\xec\\x01\\x10\\xd2'", 1)
('\'\\x87H\\xaa\\x04\\xe3\\x97\\xcdUC \\xc6T\\x8a\\xb8f\\xcf\\x12\\xc3\\xc6$\\x11\\xd8/\\xf8i\\xad\\x82g\\x0f&Z|j\\xcc\\xabO\\x11 \\xee\\xc9C\\x86\\x8b]\\xa1\\xb0\\\'\\xae\\xd3l\\x9f^\\xdc\\xb3\\xfc\\xb0\\xfaT\\xb5\\xa5\\xeaz\\x15\\xf1)\\xf7\\xf9\\x1a\\x864\\x0c)\\x84Ky\\x95C\\x7fGx\\xf4\\xaaD\\xe3\\xcd`\\x84W\\xd61`SC\\x0e\\x06*\\x88\\xf5&\\xb9\\xa0\\x980\\x7f6\\xc8\\xb0\\x08\\xe1\\x10Q/a\\xf8F\\xb0\\x1e\\xfbu\\xc7V\\xfa\\xdc\\xdcm c\\xce\\xd6\\x05iI\\\\(\\xe6\\x92\\xc6Z\\xd5+D\\xfc\\xfc\\xfb\\x9bL<\\x06\\xd1\\xb4\\xdaJ\\x9f\\xcb]?\\x03 #\\xbc\\xda\\x92>%J\\xbeIN\\x04"Z^`Br\\xb8^\\x7f{\\x0c\\xa2=H\\x11\\xfda]\\xf2\\xc0<^\\t\\xf5\\x88\\x10.\\xbcAe@\\xa8ek\\xe5\\xaa\\xbbD\\xab\\x07\\x14\\xfaC\\xb1KR\\x00\\x97\\xd4\\x80\\x0e\\x82]\\x03\\xbb\'', 1)
('"\\xa9?PM\\x8e\\xbeW\\t-\\x18\\xdcN:\\x15t\\xe4\\x88\\x18\\x03\\x07z\\xe3\\xd2k\\xbd\\x04p\\x1ec\\xe8\\x9ay\\x19\\xfab\\x966J\\xb21\\x1a*\\xc4\\x81A\\xd9,\\xed\\xd9\\x1by\\xa8\'(\\xcd\\xe5\\xdc\\x97\\x02\\xeaji\\x150\\xaeO(f\\xb9\\x9f\\xd9\'7Z\\x1e3\\xbc\\xdd\\x8b6\\xe9\\xd1F\\xe0\\xdb\\x11\\x142C\\x87\\xce\\xd2\\xad\\x8a\\xa82\\xd1\\xf0\\xb8\\xc8\\x98W\\xa0x\\xa3(p\\x01\'\\x00\\xb0\\xc1\\xc8X6G\\xa8b\\x9b,\\xab RO{\\x99\\xb7\\x03\\xdc6!\\x02r\\x831\\x01\\xe0\\xce;b\\x83wm\\x03Q\\xcd\\x8dAO\\x10\\xf5e\\t\\xea\\xa2\\r\\x05\\xb0@X,J4b\\x12\\xe8\\xaf\\xb6a\\xa8\\x13q\\xe0\\xc9\\xd3\\xdb\\xa2\\xc7}\\x84\\xd7\\xbc9\\x1a\\x1c\\xa0R\\xbe\\xe0d\\x0f\\x10\\xa3L\\xa2)w@\\x991\\x1d.=\\xa5\\xa7\\x8fY\\xad\\x88\\xec\\x1e[\\x9f\\x16\\xa9F\\xa3\\xa7\\x8c\\x97\\xe6\\x129z\\x1bD\\x15Cq\\x08N8\\x08u\\xe0_;\\x10\\xd1\\xd2\\xe1\\x8e}\\xb9\\xef\\x17\\xbe\\x88G\\xba\\xa5\\xb2y\\xd3\\xedqLz\\xec\\x93\\xa0\\x07@\\xccy\\xab\\x8c0\\xd8\\x974\\xaa\\xc6u\\x9eF\\xb2(i\\x03\\xd5\\x0eLt\\xba\\xef\\xd7L\\xe5I\\x8dXz\\x86\\x0e\\\\B\\xddK\\x81C\\xc9\\xe2\\x00\\xfd]r\\xbc\\xbd%Q"', 1)

That gibberish above is the output from 400 records. Using the two records I posted above as reference, I expect to see:


GAAGCTGAGCGACATCGACACGGTGATCGAGTTCTT 1
AAACATATGCATCTCGTTTGTGGAAAAGCTATCGAC 1

What am I doing wrong?

Thanks.

Mike Busch

unread,
Dec 30, 2015, 12:23:11 PM12/30/15
to Disco-development
OK, this one I figured out -- the documentation doesn't make clear that when using DDFS, one should use chain_reader. Well, it does say it somewhere buried way down in the API docs, but for someone coming from Hadoop where everything MUST come from HDFS, it isn't obvious.
Reply all
Reply to author
Forward
0 new messages