I just started trying to switch a research project from Hadoop to Disco and I don't understand what I'm doing wrong. My input data is FASTQ format genome data:
@SOLEXA-1GA-2_2_FC30DNN:1:2:349:1752
GAAGCTGAGCGACATCGACACGGTGATCGAGTTCTT
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@SOLEXA-1GA-2_2_FC30DNN:1:2:109:1790
AAACATATGCATCTCGTTTGTGGAAAAGCTATCGAC
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
To learn Disco, all I'm doing at first is trying to read the data file, separate out the DNA data and output it back out:
from disco.core import Job, result_iterator
import collections, sys
def fastq_input_stream(stream, size, url, params):
while True:
seq_id = stream.next().strip()
seq = str(stream.next().strip())
strand = stream.next().strip()
qual = stream.next().strip()
yield seq_id, repr(seq), strand, qual
def map(line, params):
seq_id, seq, strand, qual = line
# print seq
yield seq, 1
def reduce(dna_iter, params):
from disco.util import kvgroup
for seq, count in kvgroup(dna_iter):
yield seq, sum(count)
if __name__ == '__main__':
job = Job().run(input=["tag://test:test"],
map_reader = fastq_input_stream,
map=map,
reduce=reduce)
for line, count in result_iterator(job.wait(show=True)):
print (line,count)
But my output is some weird hex numbers:
("'\\x00\\x00\\x00\\x00\\x00\\x00x^\\xb5\\x9bO\\x8f+E\\x12\\xc4\\xc5\\x1e\\xf9\\x1a\\xbb\\x9c\\x10\\xc8]\\xfd\\xbf\\xb9\\xace\\xc0 \\xc1{\\x07\\xfc\\x04\\xec\\x01\\x10\\xd2'", 1)
('\'\\x87H\\xaa\\x04\\xe3\\x97\\xcdUC \\xc6T\\x8a\\xb8f\\xcf\\x12\\xc3\\xc6$\\x11\\xd8/\\xf8i\\xad\\x82g\\x0f&Z|j\\xcc\\xabO\\x11 \\xee\\xc9C\\x86\\x8b]\\xa1\\xb0\\\'\\xae\\xd3l\\x9f^\\xdc\\xb3\\xfc\\xb0\\xfaT\\xb5\\xa5\\xeaz\\x15\\xf1)\\xf7\\xf9\\x1a\\x864\\x0c)\\x84Ky\\x95C\\x7fGx\\xf4\\xaaD\\xe3\\xcd`\\x84W\\xd61`SC\\x0e\\x06*\\x88\\xf5&\\xb9\\xa0\\x980\\x7f6\\xc8\\xb0\\x08\\xe1\\x10Q/a\\xf8F\\xb0\\x1e\\xfbu\\xc7V\\xfa\\xdc\\xdcm c\\xce\\xd6\\x05iI\\\\(\\xe6\\x92\\xc6Z\\xd5+D\\xfc\\xfc\\xfb\\x9bL<\\x06\\xd1\\xb4\\xdaJ\\x9f\\xcb]?\\x03 #\\xbc\\xda\\x92>%J\\xbeIN\\x04"Z^`Br\\xb8^\\x7f{\\x0c\\xa2=H\\x11\\xfda]\\xf2\\xc0<^\\t\\xf5\\x88\\x10.\\xbcAe@\\xa8ek\\xe5\\xaa\\xbbD\\xab\\x07\\x14\\xfaC\\xb1KR\\x00\\x97\\xd4\\x80\\x0e\\x82]\\x03\\xbb\'', 1)
('"\\xa9?PM\\x8e\\xbeW\\t-\\x18\\xdcN:\\x15t\\xe4\\x88\\x18\\x03\\x07z\\xe3\\xd2k\\xbd\\x04p\\x1ec\\xe8\\x9ay\\x19\\xfab\\x966J\\xb21\\x1a*\\xc4\\x81A\\xd9,\\xed\\xd9\\x1by\\xa8\'(\\xcd\\xe5\\xdc\\x97\\x02\\xeaji\\x150\\xaeO(f\\xb9\\x9f\\xd9\'7Z\\x1e3\\xbc\\xdd\\x8b6\\xe9\\xd1F\\xe0\\xdb\\x11\\x142C\\x87\\xce\\xd2\\xad\\x8a\\xa82\\xd1\\xf0\\xb8\\xc8\\x98W\\xa0x\\xa3(p\\x01\'\\x00\\xb0\\xc1\\xc8X6G\\xa8b\\x9b,\\xab RO{\\x99\\xb7\\x03\\xdc6!\\x02r\\x831\\x01\\xe0\\xce;b\\x83wm\\x03Q\\xcd\\x8dAO\\x10\\xf5e\\t\\xea\\xa2\\r\\x05\\xb0@X,J4b\\x12\\xe8\\xaf\\xb6a\\xa8\\x13q\\xe0\\xc9\\xd3\\xdb\\xa2\\xc7}\\x84\\xd7\\xbc9\\x1a\\x1c\\xa0R\\xbe\\xe0d\\x0f\\x10\\xa3L\\xa2)w@\\x991\\x1d.=\\xa5\\xa7\\x8fY\\xad\\x88\\xec\\x1e[\\x9f\\x16\\xa9F\\xa3\\xa7\\x8c\\x97\\xe6\\x129z\\x1bD\\x15Cq\\x08N8\\x08u\\xe0_;\\x10\\xd1\\xd2\\xe1\\x8e}\\xb9\\xef\\x17\\xbe\\x88G\\xba\\xa5\\xb2y\\xd3\\xedqLz\\xec\\x93\\xa0\\x07@\\xccy\\xab\\x8c0\\xd8\\x974\\xaa\\xc6u\\x9eF\\xb2(i\\x03\\xd5\\x0eLt\\xba\\xef\\xd7L\\xe5I\\x8dXz\\x86\\x0e\\\\B\\xddK\\x81C\\xc9\\xe2\\x00\\xfd]r\\xbc\\xbd%Q"', 1)
That gibberish above is the output from 400 records. Using the two records I posted above as reference, I expect to see:
GAAGCTGAGCGACATCGACACGGTGATCGAGTTCTT 1
AAACATATGCATCTCGTTTGTGGAAAAGCTATCGAC 1
What am I doing wrong?
Thanks.