Parsing fixed layout, utf-8 encoded files. Can we use binary pattern matching?

368 views
Skip to first unread message

Mattias Gyllsdorff

unread,
Jul 31, 2015, 5:14:45 PM7/31/15
to elixir-lang-talk
I often have to parse files/binaries with fixed width parts. The files generally consist of parts where the first two character identifies the layout of the following table. The content itself are always characters and generally encoded with a windows codepage or ISO-8859-X but some of them encoded using UTF-8. 

Now, as long as they are using a 1 byte -> 1 character encoding I can use binary pattern matches to parse the content and then convert the data to UTF-8 strings but that don't seem to work when the content starts as UTF-8. I know we can match a single UTF-8 codepoint using the utf8 type but is there anyway to match several utf-8 characters as a single string? 

I would prefer to use Elixirs / Erlangs binary pattern matching but since some of my data is UTF-8 and we want to have a single parser standard we haven't found any way to parse the files that way.

Right now I re-encode the content to utf-8 and use StringIO.open (for smaller files) or File.open (for the large files) to parse the file like this but I suspect this is not the best way to do it. 
  @spec parse(number, {pid, %Root{}}) :: %Root{}
  defp read_part
(20 = code, io) do
   
{%Payment{}, io}
   
|> put(10,  :sender_bankgiro, cast: :integer)
   
|> put(25,  :reference, trim: [])
   
|> put(18,  :value, cast: :decimal)
   
|> put(1,   :reference_code, cast: :integer)
   
|> drop(1,  :dev_code, should_match: "P")
   
|> put(1,   :payment_channel, cast: :integer)
   
|> put(12,  :sequence, cast: :integer)
   
|> put(1,   :have_image, cast: :boolean)
   
|> drop(10, :blank)
   
|> done(code, fn payment, root -> %{root | payments: payment ++ root.payments} end)
 
end

 
@spec start(pid) :: %Root{}
 
def start(io) do
    part_code
= read(io, 2, cast: :integer)
    read_part
(part_code, {io, %Root{}})
 
end

 
@spec done({%{}, {pid, %Root{}}}, number, (%{}, %Root{} -> %Root{})) :: %Root{}
  defp
done({part, {_, root} = io}, part_code, put_data) do
    root
= put_data.(part, root)
   
Logger.debug("Done with part #{part_code}")
    case read(io, 1) do
      :eof -> root
      @end_of_part_marker -> read_part(read(io, 2, cast: :integer), io)
    end
  end


I am away from my computer so I wrote that from memory so it might not be valid code. :/ 

José Valim

unread,
Aug 1, 2015, 2:33:26 AM8/1/15
to elixir-l...@googlegroups.com
There are multiple questions here and, because the answer to which approach is better, depends on many factors like code readability and performance, I will just provide some bullet points for you to work with.

1. Elixir's binary syntax allows you to also match on bytes:

iex(1)> <<j::utf8, o::utf8, s::utf8, e::utf8>> = "josé"
"josé"
iex(2)> [j, o , s, e]
[106, 111, 115, 233]

2. Unfortunately you cannot pass a size for utf8/utf16/utf32 modifiers. However, you can use String.slice/3 or String.split_at/2 to get exactly what you want.

3. The functions in the IO module also can work with codepoint, as long as the IO device is in UTF-8. StringIO is always on UTF-8, for file devices, you need to pass it as an option when you open. Use :io.getopts(device) and :io.setopts(device) to read and change those configs.

4. Typically, loading the whole file into memory and doing the operations in memory is the fasted approach. It is often the cleanest approach too. So unless you have large files (or small memory), that would be my vote. You can quickly estimate it based on the machine resources too. Let's suppose each file takes 1MB and you have 1GB of memory... are you processing more than a thousand of those at time?

I hope this helps.

José Valim
Skype: jv.ptec
Founder and Director of R&D

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/7183e18b-7671-441b-91bc-f9696614f1cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mattias Gyllsdorff

unread,
Aug 1, 2015, 6:09:43 AM8/1/15
to elixir-lang-talk, jose....@plataformatec.com.br
Thank you for your help.

1: So binary pattern matching is generally not suited for content that is only text (unless it is using a single byte encoding) but it is perfect for mixed data like something BER encoded?

2: Lets say I have a 10 MB string in memory where the columns are always 10 characters long and I consumed it by doing repeatedly doing 
{column, rest} = String.slice(rest, 10)
Would the runtime have to create a new copy of the rest String each time or would it just use a offset of the original string?

I think I will have to create a benchmark and test it myself. 

4: The files are mostly around 2 MB but some of them are over 1 GB, I know my coworker received a few 40 GB file recently. 

Right now I use the StringIO/FileIO approach since I don't know if the file will fit in memory, this way I can always read the data using IO.read without having to think about where/how the content is stored. 

José Valim

unread,
Aug 1, 2015, 6:30:52 AM8/1/15
to Mattias Gyllsdorff, elixir-lang-talk
1: So binary pattern matching is generally not suited for content that is only text (unless it is using a single byte encoding) but it is perfect for mixed data like something BER encoded?

Well, it depends. The issue is that your binary stream is giving the length in characters and that is variable (and honestly, a bit weird because you can no longer split things based on arithmetic operations, you need to traverse the whole thing!). If it gave the length in bytes, it would work just fine.

2: Lets say I have a 10 MB string in memory where the columns are always 10 characters long and I consumed it by doing repeatedly doing String.slice

String.slice and String.split_at will give you references to the larger binary. String.split_at could be improved though, I will push improvements to master.

Right now I use the StringIO/FileIO approach since I don't know if the file will fit in memory, this way I can always read the data using IO.read without having to think about where/how the content is stored. 

That sounds like the way to go. StringIO will always have the whole thing in memory though and it will add a lot of overhead as it uses an extra process. If that becomes a problem, you may want to keep it FileIO based and use String.slice for the in memory ones.




José Valim
Skype: jv.ptec
Founder and Director of R&D

Ed W

unread,
Aug 3, 2015, 8:55:01 AM8/3/15
to elixir-l...@googlegroups.com
On 31/07/2015 22:14, Mattias Gyllsdorff wrote:
I often have to parse files/binaries with fixed width parts. The files generally consist of parts where the first two character identifies the layout of the following table. The content itself are always characters and generally encoded with a windows codepage or ISO-8859-X but some of them encoded using UTF-8. 

Now, as long as they are using a 1 byte -> 1 character encoding I can use binary pattern matches to parse the content and then convert the data to UTF-8 strings but that don't seem to work when the content starts as UTF-8. I know we can match a single UTF-8 codepoint using the utf8 type but is there anyway to match several utf-8 characters as a single string? 

I would prefer to use Elixirs / Erlangs binary pattern matching but since some of my data is UTF-8 and we want to have a single parser standard we haven't found any way to parse the files that way.


I have a similar problem to parse some fixed length files (TAP files used in the telecoms industry).  I found binary pattern matching to be vastly faster then other options, but as you say, there are many limits on matching options

So I have lots of functions named "parse", one of which will match and nibble something from the input (technically every record is 160 octets in my problem, but we don't make use of that)

Example function:
    https://gist.github.com/ewildgoose/9793fff12e4092e75383

The reason for the repetitive formulation is that I want to rewrite this using a macro to generate the code from a specification, so something more like how Ecto defines a table format.  The macros aren't done, but things I observed in the research process:

- It seems to be the same speed to use nested pattern matches as flat patterns
- So for example you could use a macro to generate the repetitive: <<j::utf8, o::utf8, s::utf8, e::utf8>> piece to match on
- It seems feasible to then compose the whole match using further macros:
    << <<j::utf8, o::utf8, s::utf8, e::utf8>>, .... >>

- So in your case you could take the pain away from matching long utf8 strings if you use a macro to compose the function heads? (you would need to end up with a lot of extra variables which you would glue back together to get the desired string)


I guess it would be nice to raise a feature request upstream to be able to do something more like:
 << name::utf8-size(20) >>
Is this likely to be accepted?

Ed W

Reply all
Reply to author
Forward
0 new messages