[Haskell-cafe] Parse CSV / TSV file in Haskell - Unicode Characters

26 views

Skip to first unread message

Volker Strobel

unread,

Oct 23, 2014, 9:35:59 PM10/23/14

to haskel...@haskell.org

Hi,

I'm trying to parse a tab-delimited file using cassava/Data.Csv in Haskell. However, I get problems if there are "strange" (Unicode) characters in my CSV file. I'll get a parse error (endOfInput) then.

According to the command-line tool "file", my file has a "UTF-8 Unicode text" decoding. My Haskell code looks like this:

{-# LANGUAGE ScopedTypeVariables #-}

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString as C

import qualified System.IO.UTF8 as U

import qualified Data.ByteString.UTF8 as UB

import qualified Data.ByteString.Lazy.Char8 as DL

import qualified Codec.Binary.UTF8.String as US

import qualified Data.Text.Lazy.Encoding as EL

import qualified Data.ByteString.Lazy as L

import Data.Text.Encoding as E

-- Handle CSV / TSV files with ...

import Data.Csv

import qualified Data.Vector as V

import Data.Char -- ord

csvFile :: FilePath

csvFile = "myFile.txt"

-- Set delimiter to \t (tabulator)

myOptions = defaultDecodeOptions {

decDelimiter = fromIntegral (ord '\t')

}

main :: IO ()

main = do

csvData <- L.readFile csvFile

case EL.decodeUtf8' csvData of

Left err -> print err

Right dat ->

case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of

Left err -> putStrLn err

Right v -> V.forM_ v $ \ (category :: String ,

user :: String ,

date :: String,

time :: String,

message :: String) -> do

print message

I tried using decodingUtf8', preprocessing (filtering) the input with predicates from Data.Char, and much more. However the endOfFile error persists.

My CSV-file looks like this:

a - - - RT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel

a - - - Uhm .. wat dan ook ????!!!! 👋

Or more literally:

a\t-\t-\t-\tRT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel

a\t-\t-\t-\tUhm .. wat dan ook ????!!!! 👋

The problem chars are the 👋 and • (and in my complete file, there are many more of similar characters). What can I do, so that cassava / Data.Csv can read my file properly?

I've also posted this question at StackOverflow a few days ago: http://stackoverflow.com/questions/26499831/parse-csv-tsv-file-in-haskell-unicode-characters

Best,

Volker

Johan Tibell

unread,

Oct 24, 2014, 1:59:38 AM10/24/14

to Volker Strobel, haskell-cafe

I just replied to your question on SO.

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply all

Reply to author

Forward

0 new messages