I have an input file (txt) as below
<a><b><c>val1</c></b></a>||<a><b><c>val2</c></b></a>||<a><b>
<c>val3</c></b></a>||<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>
If you observe the input carefully, the xml data record after the third '||' is split across two lines.
I want to use StreamXmlRecordReader of hadoop streaming to parse this file
-inputreader "org.apache.hadoop.streaming.StreamXmlRecordReader,begin=<a>,end=</a>,slowmatch=true
which I am unable to parse the 3rd record.
I am getting the below error
Traceback (most recent call last):
File "/home/rsome/test/code/m1.py", line 13, in <module>
root = ET.fromstring(xml_str.getvalue())
File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 964, in XML
return parser.close()
File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 1254, in close
self._parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 18478
I have used slowmatch=true as well but still no luck.
Below is the output on running the above
$ hdfs dfs -text /user/rsome/poc/testout001/part-*
rec::1
<a><b><c>val1</c></b></a>
rec::2
<a><b><c>val2</c></b></a>
rec::3
<a><b>
rec::4
<c>val3</c></b></a>
rec::1
<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>
My expected output is
$ hdfs dfs -text /user/rsome/poc/testout001/part-*
rec::1::mapper1
<a><b><c>val1</c></b></a>
rec::2::mapper1
<a><b><c>val2</c></b></a>
rec::3::mapper1
<a><b><c>val3</c></b></a>
rec::1::mapper2
<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>
any help on this would be of great help