Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Merge/append CSV files with different headers

609 views
Skip to first unread message

Vincent Davis

unread,
Mar 24, 2014, 1:50:07 PM3/24/14
to pytho...@python.org
I have several csv file I need to append (vertically). They have different but overlapping headers. For example;
file1 headers ['a', 'b', 'c']
file2 headers ['d', 'e']
file3 headers ['c', 'd']

Is there a better way than this
import csv
def merge_csv(fileList, newFileName):
    allHeaders = set([])
    for afile in fileList:
        with open(afile, 'rb') as csvfilesin:
            eachheader = csv.reader(csvfilesin, delimiter=',').next()
            allHeaders.update(eachheader)
    print(allHeaders)
    with open(newFileName, 'wb') as csvfileout:
        outfile = csv.DictWriter(csvfileout, allHeaders)
        outfile.writeheader()
        for afile in fileList:
            print('***'+afile)
            with open(afile, 'rb') as csvfilesin:
                rows = csv.DictReader(csvfilesin, delimiter=',')
                for r in rows:
                    print(allHeaders.issuperset(r.keys()))
                    outfile.writerow(r)

Vincent Davis

Mark Lawrence

unread,
Mar 24, 2014, 2:24:10 PM3/24/14
to pytho...@python.org
I haven't looked too hard at your code but guess you could simplify it
by using the fieldnames parameter to the DictReader class or making use
of the Sniffer class. See
http://docs.python.org/3/library/csv.html#module-csv


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com


Chris Angelico

unread,
Mar 24, 2014, 3:44:39 PM3/24/14
to pytho...@python.org
On Tue, Mar 25, 2014 at 4:50 AM, Vincent Davis <vin...@vincentdavis.net> wrote:
> I have several csv file I need to append (vertically). They have different
> but overlapping headers. For example;
> file1 headers ['a', 'b', 'c']
> file2 headers ['d', 'e']
> file3 headers ['c', 'd']
>
> Is there a better way than this

Summary of your code:

1) Build up a set of all headers used, by opening each file and
reading the headers.
2) Go through each file a second time and write them out.

That seems like the best approach, broadly. You might be able to
improve it a bit (it might be tidier to open each file once, but since
you're using two different CSV readers, it'd probably not be), but by
and large, I'd say you have the right technique. Your processing time
here is going to be dominated by the actual work of copying.

The only thing you might want to consider is order. The headers all
have a set order to them, and it'd make sense to have the output come
out as ['a', 'b', 'c', 'd', 'e'] - the first three from the first
file, then adding in everything from subsequent files in the order
they were found. Could be done easily enough by using 'in' and
.append() on a list, rather than using a set. But if that doesn't
matter to you, or if something simple like "sort the headers
alphabetically" will do, then I think you basically have what you
want.

ChrisA

Vincent Davis

unread,
Mar 24, 2014, 5:20:34 PM3/24/14
to pytho...@python.org
Thanks for the feedback.

Vincent Davis


0 new messages