XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'

hongy...@gmail.com

unread,

Sep 29, 2021, 4:22:13 AM9/29/21

to

I tried to convert a xls file into csv with the following command, but failed:

$ in2csv --sheet 'Sheet1' 2021-2022-1.xls
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'

The above testing file is located at here [1].

[1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls

Any hints for fixing this problem?

Regards,
HZ

J.O. Aho

unread,

Sep 29, 2021, 5:40:58 AM9/29/21

to

You need to delete the 13 first lines in the file or you see to that
your code does first trim the data before start xml parse it.

--

//Aho

hongy...@gmail.com

unread,

Sep 29, 2021, 7:10:13 AM9/29/21

to

On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote:
> On 29/09/2021 10.22, hongy...@gmail.com wrote:
> > I tried to convert a xls file into csv with the following command, but failed:
> >
> > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> > XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> >
> > The above testing file is located at here [1].
> >
> > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> >
> > Any hints for fixing this problem?
> You need to delete the 13 first lines in the file

Yes. After deleting the top 3 lines, the problem has been fixed.

> or you see to that your code does first trim the data before start xml parse it.

Yes. I really want to do this trick programmatically, but how do I do it without manually editing the file?

HZ

J.O. Aho

unread,

Sep 29, 2021, 8:12:08 AM9/29/21

to

You could do something like loading the XML into a string (myxmlstr) and
then find the fist < in that string

xmlstart = myxmlstr.find('<')

xmlstr = myxmlstr[xmlstart:]

then use the xmlstr in the xml parser, sure not as convenient as loading
the file directly to the xml parser.

I don't say this is the best way of doing it, I'm sure some python wiz
here would have a smarter solution.

--

//Aho

hongy...@gmail.com

unread,

Sep 29, 2021, 9:22:39 AM9/29/21

to

On Wednesday, September 29, 2021 at 8:12:08 PM UTC+8, J.O. Aho wrote:
> On 29/09/2021 13.10, hongy...@gmail.com wrote:
> > On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote:
> >> On 29/09/2021 10.22, hongy...@gmail.com wrote:
> >>> I tried to convert a xls file into csv with the following command, but failed:
> >>>
> >>> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> >>> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> >>>
> >>> The above testing file is located at here [1].
> >>>
> >>> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> >>>
> >>> Any hints for fixing this problem?
> >> You need to delete the 13 first lines in the file
> >
> > Yes. After deleting the top 3 lines, the problem has been fixed.
> >
> >> or you see to that your code does first trim the data before start xml parse it.
> >
> > Yes. I really want to do this trick programmatically, but how do I do it without manually editing the file?
> You could do something like loading the XML into a string (myxmlstr)

How to do this operation? As you have seen, the file refused to be loaded at all.

> and then find the fist < in that string
>
> xmlstart = myxmlstr.find('<')
>
> xmlstr = myxmlstr[xmlstart:]
>
> then use the xmlstr in the xml parser, sure not as convenient as loading
> the file directly to the xml parser.
>
> I don't say this is the best way of doing it, I'm sure some python wiz
> here would have a smarter solution.

Another very strange thing: I trimmed the first 3 lines in the original file and saved it into a new one named as 2021-2022-1-trimmed-top-3-lines.xls. [1]

Then I read the file with the following python script named as pandas-excel.py:

------
import pandas as pd

excel_file='2021-2022-1-trimmed-top-3-lines.xls'

#print(pd.ExcelFile(excel_file).sheet_names)

newpd=pd.read_excel(excel_file, sheet_name='Sheet1')

for i in newpd.index:
if i >1:
for j in newpd.columns:
if int(j.split()[1]) > 2:
if not pd.isnull(newpd.loc[i][j]):
print(newpd.loc[i][j])
------

$ python pandas-excel.py | sort -u
汽车实用英语 [1-8]周 1-4节 38 汽车楼413基础电气实训室II 汽修1932
汽车车载网络系统的检测与修复 [1-12]周 1-4节 38 汽车楼416安全、舒适系统实训室汽修1932

OTOH, I also tried to read the file with in2csv as follows:

$ in2csv --sheet Sheet1 2021-2022-1-trimmed-top-3-lines.xls 2>/dev/null |tr ',' '\n' | \
sed -re '/^$/d' | sort -u | awk '{print length($0),$0}' | sort -k1n | tail -3 | cut -d ' ' -f2-
汽车实用英语 [1-8]周 1-4节 38 汽车楼413基础电气实训室II 汽修1932
智能网联汽车概论 [1-8]周 6-9节 45 汽车楼511汽车营销策划实训室汽销1931
汽车车载网络系统的检测与修复 [1-12]周 1-4节 38 汽车楼416安全、舒适系统实训室汽修1932

As you can see, the above two methods give different results. I'm very puzzled by this phenomenon. Any hints/tips/comments will be greatly appreciated.

[1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1-trimmed-top-3-lines.xls

Regards,
HZ

Peter J. Holzer

unread,

Sep 29, 2021, 5:20:04 PM9/29/21

to

On 2021-09-29 01:22:03 -0700, hongy...@gmail.com wrote:
> I tried to convert a xls file into csv with the following command, but failed:
>
> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
>
> The above testing file is located at here [1].
>
> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls

signature.asc

hongy...@gmail.com

unread,

Sep 29, 2021, 9:20:37 PM9/29/21

to

On Thursday, September 30, 2021 at 5:20:04 AM UTC+8, Peter J. Holzer wrote:
> On 2021-09-29 01:22:03 -0700, hongy...@gmail.com wrote:
> > I tried to convert a xls file into csv with the following command, but failed:
> >
> > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> > XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> >
> > The above testing file is located at here [1].
> >
> > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> Why is that file name .xls when it's obviously an HTML file?

Good catch! Thank you for pointing this out. This file is automatically exported from my university's teaching management system, and it was assigned the .xls extension by default.

HZ

hongy...@gmail.com

unread,

Sep 29, 2021, 11:53:54 PM9/29/21

to

According to the above comment, after I change the extension to html, the following python code will do the trick:

import sys
import pandas as pd

if len(sys.argv) != 2:
print('Usage: ' + sys.argv[0] + ' input-file')
exit(1)

myhtml_pd = pd.read_html(sys.argv[1])
#In [25]: len(myhtml_pd)
#Out[25]: 3

for i in myhtml_pd[2].index:
if i > 0:
for j in myhtml_pd[2].columns:
if j >1 and not pd.isnull(myhtml_pd[2].loc[i][j]):
print(myhtml_pd[2].loc[i][j])

HZ