Merge CSV files based on Column Key from each file

999 views
Skip to first unread message

Josh Crosby

unread,
Sep 6, 2013, 7:03:00 PM9/6/13
to golan...@googlegroups.com
Hello all,
I'm new to Go and learning quite a bit.

I have a dilema that can normally be solved with Awk to join two csv files based on a column that would then be outputted to a consolidated file.

i.e. awk 'BEGIN {FS="\t"} {if(NR==FNR) {a[$1]=$2} else if ($4 in a) {print $0, a[$4]}}' network_map.txt c3_ct.txt > combined.txt
This script reads in both files (network_map.txt and c3_ct.txt)
NR==FNR essentially makes sure that you are still streaming the first of two files
{a[$1]=$2;} adds all the keys from the first file into the array with field 2 as the value
$4 in a {print $0, a[$4]} checks to see if the fourth column in file two has the key that exists in file one.  If it does, it prints the entire row from file 2 with the value of the key in file one.  

So here's the problem,
1 file is 9 gb and the other is about 300 mb, to do this in AWK would take forever.

So I've seen some examples using bufio reading a CSV file. But doesn't really help me on mapping these into structs(not sure if thats the right path) and how to join or merge the files based on a specific column from each file, nor how to pass the files in. (os.Stdin)

Any guidance is greatly appreciated

Kyle Lemons

unread,
Sep 6, 2013, 8:12:58 PM9/6/13
to Josh Crosby, golang-nuts
Are they sorted?  If so, your job is really easy (you only need one line from each file in memory).  If not, I would probably sort them first :).

One slightly more scalable solution would be to scan the file one line at a time and build an index of {byteOffset, lineLength, columnKey} for each file, sort those, and then use them to rebuild the output file.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

philippe...@gmail.com

unread,
Sep 6, 2013, 9:48:28 PM9/6/13
to golan...@googlegroups.com
Skipping error checking for brevity (ok, laziness maybe a bit too ;-) ..
Using the same 'logic' as your awk pgm, you could do something like the following in Go :

(Please note that I am also a go beginner, so any comments or suggestions from the 'pros' are welcome, thx)

you can run it with: go run mergeCSV.go file1.csv file2.csv > merged.csv
(file1,csv being the large ref file used to create the index)

I hope this can be helpful

-------

package main

import (
    "bufio"
    "fmt"
    "os"
    "strings"
)

func buildIndex(filename1 string) (ret map[string]string) {
    var index = make(map[string]string)

    file, _ := os.Open(filename1)
    scanner := bufio.NewScanner(file)
    defer file.Close()

    for scanner.Scan() {
        ln := scanner.Text()
        fields := strings.Split(ln, "\t")
        index[fields[0]] = fields[1]
    }

    return index
}

func doLookup(filename2 string, index map[string]string) {
    file, _ := os.Open(filename2)
    scanner := bufio.NewScanner(file)
    defer file.Close()

    for scanner.Scan() {
        ln := scanner.Text()
        fields := strings.Split(ln, "\t")
        key := fields[3]
        if foundValue, ok := index[key]; ok {
            fmt.Printf("%s\t%s\n", ln, foundValue)
        }
    }
}

func main() {
    index := buildIndex(os.Args[1])

    doLookup(os.Args[2], index)

Kyle Lemons

unread,
Sep 7, 2013, 1:57:14 AM9/7/13
to Josh Crosby, golang-nuts
Actually my roommate posed a better idea: as long as one of the files can fit in memory, load that file and store its lines in a map, then stream through the other file doing the join and writing to disk.  This is O(N+M), where N and M are the lines in the file, and is much nicer about cache than my solution :).

Josh Crosby

unread,
Sep 7, 2013, 5:54:40 PM9/7/13
to Kyle Lemons, golang-nuts
Thank you for the idea Kyle. (And room mate). I'll have to read up on how to do that. 


Miki Tebeka

unread,
Sep 7, 2013, 11:36:44 PM9/7/13
to golan...@googlegroups.com
I have a dilema that can normally be solved with Awk to join two csv files based on a column that would then be outputted to a consolidated file.

No a Go specific solution, but you can load both files as tables in an SQLite database and then join them in SQL export it. IMO this should be fast and very little code to write (remember to index though :)

Silvan Jegen

unread,
Sep 8, 2013, 3:45:55 AM9/8/13
to golan...@googlegroups.com, Josh Crosby

On Saturday, September 7, 2013 2:12:58 AM UTC+2, Kyle Lemons wrote:
> Are they sorted? If so, your job is really easy (you only need one line from each
> file in memory). If not, I would probably sort them first :).

If the files are sorted on the key column you should be able to use the text utility 'join'
(man join) for this task. If you are interested in the learning experience however, you
can just ignore this reply :-)






Reply all
Reply to author
Forward
0 new messages