Reading fixed width files

110 views
Skip to first unread message

Joseph T Bradley

unread,
Aug 20, 2018, 5:18:41 PM8/20/18
to python-etl
Hi all,
I have been using petl for a little bit, and had the need to read fixed width files. I didn’t want to add in pandas for this, so wrote a package I called petl-fwf which adds 2 functions for reading fixed width files. The basic structure is much like the fromtext function in petl, except it adds some additional parameters specific to fixed width files. It doesn’t include the ability to write files, but that could be added later on. Would appreciate any feedback, issues, or PRs you may have. Thanks!

https://pypi.org/project/petl-fwf/

Brad Maggard

unread,
Aug 20, 2018, 8:53:12 PM8/20/18
to python-etl
I personally would not install a separate package for this.  Perhaps instead add a 'fixed_width_fields' parameter to `fromtext`?

In [1]: import petl as etl

In [2]: fixed_width_text = '  18 5 2\n2018 5 2\n20180502'

In [3]: with open('fixed_width.txt', 'w') as f:
   ...:     f.write(fixed_width_text)
   ...:

In [4]: lines_table = etl.fromtext('fixed_width.txt', strip='\n')

In [5]: lines_table
Out[5]:
+------------+
| lines      |
+============+
| '  18 5 2' |
+------------+
| '2018 5 2' |
+------------+
| '20180502' |
+------------+

In [6]: from collections import OrderedDict

In [7]: fixed_width_fields = OrderedDict()

In [8]: fixed_width_fields['year']    = lambda r: r.lines[0:4]

In [9]: fixed_width_fields['month']   = lambda r: r.lines[4:6]

In [10]: fixed_width_fields['day']     = lambda r: r.lines[6:8]

In [11]: result_table = etl.fieldmap(lines_table, fixed_width_fields)

In [12]: result_table
Out[12]:
+--------+-------+------+
| year   | month | day  |
+========+=======+======+
| '  18' | ' 5'  | ' 2' |
+--------+-------+------+
| '2018' | ' 5'  | ' 2' |
+--------+-------+------+
| '2018' | '05'  | '02' |
+--------+-------+------+

Joseph T Bradley

unread,
Aug 20, 2018, 9:37:22 PM8/20/18
to python-etl
I can appreciate that: I posted it here just in case it might be useful, but I did create it because of my own needs, so it may not be. 

My goal was to have an easily re-usable setup for any number of of fixed width layouts. The code you've outlined above seems a bit more verbose than my setup, and is also more brittle when it comes to any changes to the layout, especially if columns get added anywhere but at the end of the file; you'll need to update the slices for each subsequent column. Many of the fixed width files I typically work with have headers and footers that don't resemble the rest of the file and aren't as useful, so the skiprows and skipfooter options for my package will skip n rows at the top or bottom of the file (respectively), but still properly return the header like a petl Table would.

It's totally true that you could just re-implement the methods I have to accomplish all of this, but that's the point of sharing packages, right :)  

Brad Maggard

unread,
Aug 21, 2018, 12:27:32 AM8/21/18
to python-etl
Firstly, let me say that I very much appreciate your contribution.  I'm always excited to see folks using and contributing to petl.

My goal was to have an easily re-usable setup for any number of of fixed width layouts.  The code you've outlined above seems a bit more verbose than my setup, and is also more brittle when it comes to any changes to the layout, especially if columns get added anywhere but at the end of the file; you'll need to update the slices for each subsequent column.

I agree.  My code was put together hastily, but I think your concerns would be addressed by functionalizing the creation of `fixed_width_fields`fromfwf.ipynb

Many of the fixed width files I typically work with have headers and footers that don't resemble the rest of the file and aren't as useful, so the skiprows and skipfooter options for my package will skip n rows at the top or bottom of the file (respectively), but still properly return the header like a petl Table would.

I have not (yet) solved for that in this feedback.

It's totally true that you could just re-implement the methods I have to accomplish all of this, but that's the point of sharing packages, right :)  

Yes, but please just contribute to petl directly, or petlx if you feel it is fringe.



 



Brad Maggard

unread,
Aug 21, 2018, 12:30:05 AM8/21/18
to python-etl
Also: `field_widths` is most likely described in external metadata, so ...

Joseph T Bradley

unread,
Aug 21, 2018, 12:35:05 AM8/21/18
to python-etl
Since this only solves for reading fixed width files (and not writing), I didn’t consider it ready for a PR for petlx or petl, which is the only reason it’s separate, honestly. Do you disagree?

Brad Maggard

unread,
Aug 21, 2018, 1:25:31 AM8/21/18
to pytho...@googlegroups.com
I do disagree.  Several other I/O functions are one-way.

On Tue, Aug 21, 2018 at 12:35 AM Joseph T Bradley <jtbr...@gmail.com> wrote:
Since this only solves for reading fixed width files (and not writing), I didn’t consider it ready for a PR for petlx or petl, which is the only reason it’s separate, honestly. Do you disagree?

--
You received this message because you are subscribed to the Google Groups "python-etl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-etl+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brad Maggard

unread,
Aug 21, 2018, 10:39:54 AM8/21/18
to pytho...@googlegroups.com
Many of the fixed width files I typically work with have headers and footers that don't resemble the rest of the file and aren't as useful, so the skiprows and skipfooter options for my package will skip n rows at the top or bottom of the file (respectively), but still properly return the header like a petl Table would.

Perhaps `skiplast` could be implemented in petl.head() when n< 0; ala GNU `head`?

       -n, --lines=[-]K
              print the first K lines instead of the first 10; with the leading '-', print all but the last K lines of each file

Right now, `table.head(-1).nrows()` raises:

ValueError: Stop argument for islice() must be None or an integer: 0 <= x <= sys.maxsize.


Brad Maggard

unread,
Aug 21, 2018, 11:08:37 AM8/21/18
to pytho...@googlegroups.com
Likewise for 'skiprows': could be worked into tail(), although I'd stick with '-' instead of what would have to be done to implement it as '+' , which may be more semantically sound, but how would you even do that?.

`man tail` (again GNU, BSD versions don't document support for this):

      -n, --lines=K
              output the last K lines, instead of the last 10; or use -n +K to output starting with the Kth

Currently, `petl.tail(-n)` returns no rows:  negative n in tail.ipynb

timbledum

unread,
Aug 22, 2018, 3:26:23 AM8/22/18
to python-etl
I would absolutely be keen for a fixed width loader (and possibly saver) to be included into petl core.

We currently have skip() – how does skiprows() differ? I would be pro having a skipfooters function – petl is nothing but a bunch of convenience functions IMO after all, although this case is obviously easy to do with len and rowslice.

Cheers
Marcus

Brad Maggard

unread,
Aug 22, 2018, 7:57:16 AM8/22/18
to pytho...@googlegroups.com
I would absolutely be keen for a fixed width loader (and possibly saver) to be included into petl core.

I don't currently see the need for a new "saver".  I would like to better understand why `totext()`with formatting is insufficient? 
 
We currently have skip() – how does skiprows() differ? 

skip: Skip n rows, including the header row, such that:
`
etl.dummytable().skip(99)
`
produces the last 2 rows of the table
,
and
`
etl.dummytable().skip(99).nrows()
`
returns 1


`
skiprows(99)
`
 
or perhaps
:
 
`
tail(n=99)
`
or, how about
something like
)
:
 
`
skip(keep_header=True)
`

Skip n rows, the header row would be produced.
On the same table, the header + 1 row would be produced, and nrows would still return 1.
 
 this case is obviously easy to do with len and rowslice.

len() would materialize the table, which is unacceptable.  I believe when I looked at it, Joseph's solution was to use deque(), which does not.

Brad Maggard

unread,
Aug 22, 2018, 7:58:50 AM8/22/18
to python-etl
I goofed that last one.  Should have been:

 or perhaps
:
 
`
tail(n=-99)
`

Joseph T Bradley

unread,
Aug 22, 2018, 8:29:08 AM8/22/18
to python-etl
I really like overloading head and tail: it seems simple and straightforward. I’ll submit a PR to do that later this week, if you agree. I’ll make one for overloading head and tail for petl, but I think that fromfwf might need a bit more work, before going into petl.

Joseph T Bradley

unread,
Aug 22, 2018, 9:16:46 AM8/22/18
to python-etl
Re:just using from/totext: the primary benefit would be to use a single config to govern both input and output. Also, something that I haven’t implemented, but that having a single config would be most helpful for, is properly handling padding (in and out). I only needed to read values, and handled padding with my conversion functions, so I didn’t do this part. I also didn’t need to read header/footers with different layouts than the body (and from each other), but that is also pretty common for fixed width files.

Just some stream-of-conscious thinking here, that I went through when making my functions:
* Pros/cons of overloading split to accept widths and padding params. The next step would be to output fixed-width string from a Table, with the same params. Join is already used for joining tables, so jointext might work? I didn’t want to modify existing functions so just added the splitting functionality to fromfwf.
* I still think the fromfwf function is useful for discovery, given fromcsv vs fromtext, etc, but if we had the above functions, then from/tofwf would just be a wrapper for fromtext().split(widths, padding) or jointext(widths, padding).totext(). I really like the idea of using the built-in text I/O, but the need to skip headers or footers and still keep the provided header is why I used my own reader.

Brad, you keep raising the point of using your own functions and relying on from/totext. Since I am probably newer to petl than everyone else here, my question is what criteria should be used to when deciding if something should just be handled on your own vs incorporated into petl. The contribution docs don’t seem to have any overall guideline for this, but the overall philosophy seems to be very Zen of Python-ic, which I really like. Obviously I see the benefit of a separate function for these type of files, since I work with them quite often, their formatting can be quite complex at times, and verbosity is preferred over complexity. I would also point out that pandas has a separate reader for them, as validation. But the beauty of petl is its simplicity and functionality, so you raise a good point.

I’ll submit fromfwf to petlx for now, either way, and maybe we can see if there’s any usage/call for additional functionality?

Brad Maggard

unread,
Aug 22, 2018, 10:07:26 AM8/22/18
to pytho...@googlegroups.com
I only have time to respond to this, atm:

what criteria should be used to when deciding if something should just be handled on your own vs incorporated into petl. The contribution docs don’t seem to have any overall guideline for this, but the overall philosophy seems to be very Zen of Python-ic, which I really like.

should lead you towards the package maintainers, who are better equipped to answer your question.




Brad Maggard

unread,
Aug 22, 2018, 2:41:30 PM8/22/18
to python-etl
Here's a full out/in example:  fixed_width_io.ipynb

I believe that your header/footer on output could be addressed with petl.totext(prologue='...', epilogue='...')

Joseph T. Bradley

unread,
Aug 22, 2018, 6:58:54 PM8/22/18
to pytho...@googlegroups.com
Any chance you could post only the code someplace? I don’t have/use Jupyter, and would really like to see just the code.

Thanks!
On Wed, Aug 22, 2018 at 1:41 PM Brad Maggard <magg...@gmail.com> wrote:
Here's a full out/in example:  fixed_width_io.ipynb

I believe that your header/footer on output could be addressed with petl.totext(prologue='...', epilogue='...')

--
You received this message because you are subscribed to a topic in the Google Groups "python-etl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-etl/y81jlUx59TU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-etl+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Joseph T. Bradley

Brad Maggard

unread,
Aug 22, 2018, 7:55:49 PM8/22/18
to pytho...@googlegroups.com
You don't need Jupyter to see both the code and the rendered output on github, but OK.


# coding: utf-8

# In[1]:


from collections import OrderedDict


# In[2]:


# IRL, read from extrernal metadata / "single config"
field_widths = OrderedDict()
field_widths['year']    = 4
field_widths['month']   = 2
field_widths['day']     = 2


# In[3]:


import petl as etl


# In[4]:


date_parts_table = [['year', 'month', 'day'],
    ['18',   ' 5', ' 2'],
    ['2018', ' 5', ' 2'],
    ['2018', '05', '02']]


# In[5]:


etl.wrap(date_parts_table)


# In[6]:


# `width` could be more than an int, like an object specifying attributes, e.g. left/right alignment
def make_fixed_width_template(**widths):
    template = ''
    for field_name, width in widths.items():
        template += '{' + f'{field_name}:>{width}' + '}'
    template += '\n'
    return template 


# In[7]:


fixed_width_template = make_fixed_width_template(**field_widths)


# In[8]:


fixed_width_template


# In[9]:


from pathlib import Path


# In[10]:


date_parts_file = Path('date_parts_table.txt')


# In[11]:


etl.totext(date_parts_table, date_parts_file , template=fixed_width_template)


# In[12]:


print(date_parts_file.read_text())


# In[13]:


def fixed_width_field(start, end):
    return lambda r: r.lines[start:end]


# In[14]:


def make_fixed_width_fields(**widths):
    fields = OrderedDict()
    start, end = 0, 0
    for field_name, width in widths.items():
        end += width
        fields[field_name] = fixed_width_field(start, end)
        start = end
    return fields


# In[15]:


fixed_width_fields = make_fixed_width_fields(**field_widths)


# In[16]:


lines_table = etl.fromtext(date_parts_file, strip='\n')


# In[17]:


result_table = etl.fieldmap(lines_table, fixed_width_fields)


# In[18]:


result_table


You received this message because you are subscribed to the Google Groups "python-etl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-etl+...@googlegroups.com.

Alistair Miles

unread,
Aug 30, 2018, 5:06:49 PM8/30/18
to pytho...@googlegroups.com
Hi all,

Apologies for coming late to the party, just catching up after vacation.

FWIW I'd echo the sentiments that fromfwf() could live inside petl. Fixed width files are common, and so having some dedicated functionality at least to read would seem useful, even if it is just convenience.

I also like that the proposed API (widths and header arguments) is similar to pandas, so should be less to remember for people (like me) who often mix petl and pandas.

Re skiplast(), sounds useful to be able to omit the last N rows from a table, worth adding this functionality in some form to petl.

The suggestion to do this by supporting negative integers in head() is neat, and I like that it mirrors the unix head command. 

Do I also understand right that there is a requirement to omit the first N data rows from a table but keep the header row? 

Cheers,
Alistair

To unsubscribe from this group and all its topics, send an email to python-etl+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Joseph T. Bradley

--
You received this message because you are subscribed to the Google Groups "python-etl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-etl+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "python-etl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-etl+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom

Joseph T Bradley

unread,
Aug 31, 2018, 9:20:09 PM8/31/18
to python-etl
Awesome! Yes, for the fixed width files I work with (fairly often), the header contains data that is not in the same format as the rest of the file, so I just provide a header, but want to skip the first line of the actual file.

I looked at adding support for negative values to head and tail, however both of those functions currently produce a table, whereas the functionality I have should produce an iterator. Perhaps it would make more sense to modify skip to both support negative values and to accept an include_header parameter which would apply skip only to the data rows? To skip both header and footer rows you would just chain skip functions.

Alistair Miles

unread,
Sep 3, 2018, 4:18:56 AM9/3/18
to pytho...@googlegroups.com
On Saturday, 1 September 2018, Joseph T Bradley <jtbr...@gmail.com> wrote:
Awesome! Yes, for the fixed width files I work with (fairly often), the header contains data that is not in the same format as the rest of the file, so I just provide a header, but want to skip the first line of the actual file.

I looked at adding support for negative values to head and tail, however both of those functions currently produce a table, whereas the functionality I have should produce an iterator. Perhaps it would make more sense to modify skip to both support negative values and to accept an include_header parameter which would apply skip only to the data rows? To skip both header and footer rows you would just chain skip functions.

That sounds sensible. So, e.g., for your use case, you would do something like...

tbl = etl.fromfwf('example.txt', header=['foo', 'bar'], widths=[7, 19]).skip(5, include_header=True).skip(-10)

...to skip the first 5 and last 10 data rows?

Joseph T Bradley

unread,
Sep 3, 2018, 1:09:26 PM9/3/18
to python-etl
Exactly.

Brad Maggard

unread,
Sep 3, 2018, 1:41:40 PM9/3/18
to pytho...@googlegroups.com
'fromfwf' is a bit obscure.  Also: how about using positional args for widths?  Here's my suggestion, borrowing 'fromtext's signature:

petl.io.text.fromfixedwidth(source=None, *widths, encoding=None, errors='strict', strip=None, header=None)




On Mon, Sep 3, 2018 at 1:09 PM Joseph T Bradley <jtbr...@gmail.com> wrote:
Exactly.

--
You received this message because you are subscribed to the Google Groups "python-etl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-etl+...@googlegroups.com.

Alistair Miles

unread,
Sep 3, 2018, 1:41:52 PM9/3/18
to pytho...@googlegroups.com
Cool. No objections to PR along those lines. 

Maybe worth considering skiprows() or skipdata() as convenient shorthand for skip(include_header=True) just to save some typing.

On Mon, 3 Sep 2018, 18:09 Joseph T Bradley, <jtbr...@gmail.com> wrote:
Exactly.

--
You received this message because you are subscribed to the Google Groups "python-etl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-etl+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages