Account Options

  1. Sign in
Google Groups Home
« Groups Home
How would you replace a field in a CSV file?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  8 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Pete Kazmier  
View profile  
 More options Oct 1 2006, 2:55 pm
Newsgroups: fa.haskell
From: Pete Kazmier <pete-expires-20060...@kazmier.com>
Date: Sun, 01 Oct 2006 18:55:40 UTC
Local: Sun, Oct 1 2006 2:55 pm
Subject: [Haskell-cafe] How would you replace a field in a CSV file?
The other day at work an opportunity arose where I was hoping to sneak
some Haskell into the pipeline of tools used to process call detail
records (CDRs).  In the telecommunications industry, CDRs are used for
billing.  Each CDR is a single line record of 30 comma-separated
values.  Each line is approximately 240 characters in length.  The
task at hand is to replace field number 10 if a new value can be found
in a hashmap which is keyed using the contents of the field.

My colleague was going to write a C program (that's all he knows), but
I whipped up a trivial python program instead.  I was curious if a
haskell version could be faster and more elegant , but I have not been
able to beat my python version in either case.  So, I'm curious as to
how you would go about this task in Haskell.  The input files are
generally 300-400MB, and the hashmap will contain perhaps 20-30 items.

For those that know python, here is a very simple implementation that
happens to be very fast compared to my Haskell version and very short:

    for line in sys.stdin:
        fields = line.split(',')
        fields[9] = tgmap.get(fields[9], fields[9])
        print ",".join(fields),

For each line in standard input:

  - Splits the string on the comma: "field0,field1,...,field29" =>
    ["field0", "field1", ..., "field29"] to obtain a list of strings.

  - Gets the value associated with the key of field9 from tgmap, if it
    does not exist, it returns a default value which is the original
    value.  I.e., if it's not in the map, then don't replace the
    field.

  - Joins the list of fields with a comma to yield a string again
    which is printed out to standard output.  The join method on the
    string is a bit odd: ",".join([1,2,3]) => "1,2,3"

Here is my first Haskell attempt:

import Data.ByteString.Lazy.Char8 as B hiding (map,foldr)
import Data.List (map)
import Data.Map as M hiding (map)

-- This is just a placeholder until I actually populate the map
tgmap = M.singleton (B.pack "Pete") (B.pack "Kazmier")

main = B.interact $ B.unlines . map doline . B.lines
    where doline    = B.join comma . mapIndex fixup . B.split ','
          fixup i s = if i==9 then M.findWithDefault s s tgmap else s
          comma     = B.pack ","

-- f is supplied the index of the current element being processed
mapIndex f xs = m f 0 xs
    where m f i [] = []
          m f i (x:xs') = f i x : m f (i+1) xs'

After talking with dons on #haskell, he cleaned my version up and
produced this version which gets rid of 'if' statement and makes
mapIndex stricter:

import Data.ByteString.Lazy.Char8 as B hiding (map,foldr)
import Data.List (map)
import Data.Map as M hiding (map)

-- This will be populated from a file
dict = M.singleton (B.pack "Pete") (B.pack "Kazmier")

main = B.interact $ B.unlines . map doline . B.lines
    where doline    = B.join comma . mapIndex fixup . B.split ','
          comma     = B.singleton ','
          fixup 3 s = M.findWithDefault s s dict
          fixup n s = s

-- f is supplied the index of the current element being processed
mapIndex :: (Int -> ByteString -> ByteString) -> [ByteString] ->
[ByteString]
mapIndex f xs = m xs 0
    where m []      _ = []
          m (x:xs') i = f i x : (m xs' $! i+1)

That helped things a bit, but I must confess I don't understand how
the strictness improved things as I had assumed things were going to
be evaluated in a reasonable amount of time due to the printing of
output.  I thought IO was interlaced with the execution and thus I
wasn't going to have to concern myself over laziness.  In addition,
the function is able to generate new elements of the list on demand so
I thought it was a good citizen in the lazy world.  Could anyone help
explain?

And then he came up with another version to avoid the 'unlines', but
that did not that really speed things up significantly.  So, with all
that said, is there a better approach to this problem?  Perhaps a more
elegant Haskell solution?

Thanks,
Pete

_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
tpled...@ihug.co.nz  
View profile  
 More options Oct 1 2006, 7:11 pm
Newsgroups: fa.haskell
From: tpled...@ihug.co.nz
Date: Sun, 01 Oct 2006 23:11:53 UTC
Local: Sun, Oct 1 2006 7:11 pm
Subject: [Haskell-cafe] How would you replace a field in a CSV file?
Hi Pete.

For such a small self-contained task, I don't think Haskell
is any better than Python.

Haskell would come into its own if you wanted some assurance
about type safety, and/or were taking on a task large enough
to warrant the use of records (and hence record update
notation).

Regards,
Tom
_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pete Kazmier  
View profile  
 More options Oct 1 2006, 8:05 pm
Newsgroups: fa.haskell
From: Pete Kazmier <pete-expires-20061...@kazmier.com>
Date: Mon, 02 Oct 2006 00:05:26 UTC
Local: Sun, Oct 1 2006 8:05 pm
Subject: [Haskell-cafe] Re: How would you replace a field in a CSV file?

tpled...@ihug.co.nz writes:
> For such a small self-contained task, I don't think Haskell
> is any better than Python.

I figured as much, but I thought with the new FPS lazy bytestrings it
might have a chance in terms of raw speed.  On the other side of the
coin, in terms of elegance, I thought I'd ask as haskellers always
amaze me with their one-liners :-)

_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bulat Ziganshin  
View profile  
 More options Oct 2 2006, 6:02 am
Newsgroups: fa.haskell
From: Bulat Ziganshin <bulat.zigans...@gmail.com>
Date: Mon, 02 Oct 2006 10:02:24 UTC
Local: Mon, Oct 2 2006 6:02 am
Subject: Re: [Haskell-cafe] How would you replace a field in a CSV file?
Hello tpledger,

Monday, October 2, 2006, 3:11:29 AM, you wrote:

> For such a small self-contained task, I don't think Haskell
> is any better than Python.

i disagree. while it's hard to beat Python version in number of lines,
Haskell version may have the same length and better performance.

for this particular task, using list as intermediate datastructure
make program both long and inefficient. if ByteString will include
array-based splitting and joining, the things will become much better

main = B.interact $ B.unlines . map doline . B.lines
    where doline    = B.joinArray comma . mapElem 9 fixup . B.splitArray ','
          fixup s   = M.findWithDefault s s
          comma     = B.pack ","

mapElem n func arr = arr//[(n,func (arr!n))]

if mapElem, splitArray, joinArray will be library functions (i think
they are good candidates) this program will be not longer than Python
one

--
Best regards,
 Bulat                            mailto:Bulat.Zigans...@gmail.com

_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
apfel...@quantentunnel.de  
View profile  
 More options Oct 2 2006, 12:46 pm
Newsgroups: fa.haskell
From: apfel...@quantentunnel.de
Date: Mon, 02 Oct 2006 16:46:04 UTC
Local: Mon, Oct 2 2006 12:46 pm
Subject: [Haskell-cafe] Re: How would you replace a field in a CSV file?

How about

import Data.ByteString.Lazy.Char8 as B hiding (map,foldr)
import Data.List (map)
import Data.Map as M hiding (map)

dict = M.singleton (B.pack "Pete") (B.pack "Kazmier")

main = B.interact $ B.unlines . map doline . B.lines
    where
    doline  = B.join comma . zipWith ($) fixup9 . B.split ','
    fixup9  = fixup 9
    fixup n = replicate n id
              ++ [\s -> M.findWithDefault s s dict] ++ repeat id

Note that fixup9 is shared intentionally across different invocations of
doline. The index n starts at 0.

Also note that because (compare :: (Byte)String -> ..) takes time
proportional to the string length, the use of Map will inevitably
introduce a constant factor. But I'm still happy you didn't use arrays
or hash tables (urgh!) :)

In any case, tries are *the* natural data structure for (in)finite maps
in functional languages, see also

Ralf Hinze. Generalizing generalized tries. Journal of Functional
Programming, 10(4):327-351, July 2000
http://www.informatik.uni-bonn.de/~ralf/publications/GGTries.ps.gz

Regards,
apfelmus

_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
grauenw...@gmail.com  
View profile  
 More options Oct 3 2006, 2:01 am
Newsgroups: fa.haskell
From: Grauenw...@gmail.com
Date: 2 Oct 2006 23:01:11 -0700
Local: Tues, Oct 3 2006 2:01 am
Subject: Re: How would you replace a field in a CSV file?
I noticed that the original python version was flawed. In CSV, commas
can be inside a quoted field. Therefore, a simple split isn't good
enough unless you are dead certain that fields are not quoted.

So for a general solution, you need to...

1. Break only on commas outside of quotes
2. Throw away the outer quotes around each field that has them
3. Turn any pairs of quotes inside a quoted string into a quote
4. On output, quote each field that needs it.
5. While processing, ensure the quotes are balanced and that each row
has the correct number of fields.

Jonathan Allen

"A complete solution in BASIC beats a partial solution in any other
langauge."


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Masklinn  
View profile  
 More options Oct 3 2006, 6:05 am
Newsgroups: fa.haskell
From: "Masklinn" <maskl...@gmail.com>
Date: 3 Oct 2006 03:05:40 -0700
Local: Tues, Oct 3 2006 6:05 am
Subject: Re: How would you replace a field in a CSV file?

Grauenw...@gmail.com wrote:
> I noticed that the original python version was flawed. In CSV, commas
> can be inside a quoted field. Therefore, a simple split isn't good
> enough unless you are dead certain that fields are not quoted.

Which is why Python happens to have a CSV module as part of it's
standard library since 2.3.

I guess Pete didn't realize that there was one.

Makes his script a line shorter too (sice the line.split() is not
required anymore)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Goerzen  
View profile  
 More options Oct 9 2006, 11:13 pm
Newsgroups: fa.haskell
From: John Goerzen <jgoer...@complete.org>
Date: Tue, 10 Oct 2006 03:13:42 UTC
Local: Mon, Oct 9 2006 11:13 pm
Subject: [Haskell-cafe] Re: How would you replace a field in a CSV file?
On 2006-10-01, Pete Kazmier <pete-expires-20060...@kazmier.com> wrote:

> For those that know python, here is a very simple implementation that
> happens to be very fast compared to my Haskell version and very short:

>     for line in sys.stdin:
>         fields = line.split(',')

Of course, this doesn't handle quoted values that contain commas, which
are among the many joys* of parsing CSV files.

I might just point out the existance of MissingH.Str.CSV to you, which
can read and write CSV files.  See
http://gopher.quux.org:70/devel/missingh/html/MissingH-Str-CSV.html for
details.

A one-line call gives you a [[String]].

This uses Parsec, so it will not be a top-notch performer, but it is
elegant ;-)

* I mean "joy" sarcastically, in case it wasn't obvious.

_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »