[erlang-questions] Strings and Text Processing

Steve Davis

unread,

Dec 29, 2012, 9:08:56 AM12/29/12

to Erlang Questions

Disclaimer :-) All the below is prefixed by a big IMHO

Erlang has been correctly criticized for the difficulty of handling "strings".

There are two reasons for this (fundamental decisions that were taken way-back-when):
1) "strings" are "just lists of integers"
2) "strings" are by default latin-1 representations

This introduces major inconveniences, some of which are not resolvable:
When faced with any list during pattern matching, it is not at all easy to determine whether that list is a "string".
Further, since strings are "only" a subset of the set of lists of integers, it can be impossible to determine programmatically whether the list is a list of integers or is meant to represent a string. Determining whether a particular list even qualifies as a string in a program requires non-trivial processing of the entire list.

It's rather unfortunate that Erlang has earned this reputation, since the truth is that Erlang is truly excellent at text processing. However, to benefit from this excellence, you need to do two things:
1) Represent and process text as binaries.
2) Assume that the text binary is UTF-8 encoded, unless otherwise stated (meaning, e.g. #text{encoding = cstring, value = <<116,101,120,116,0>>}).

Suddenly, thanks to binary syntax and pattern matching, processing text in your programs becomes deterministic and easy. (Note that part of the reason for this is that binaries are "expected" to be opaque, whereas general list processing is fundamental to writing any program in Erlang).

There's a couple of minor drawbacks, both of which are the result of the initial decisions about "strings":
1) The code is littered with additional angle brackets <<"string">> (annoying, but definitely worth the inconvenience)
2) The standard Erlang/OTP library functions require textual arguments as lists (requiring overuse of binary_to_list)

And there are further benefits:
1) Parsing/transcoding different charset encodings is far more straightforward
2) Internationalization/localization is far more straightfoward

I wonder if, had the current binary pattern matching/comprehensions been available "way-back-when", whether the decision about "string" representation in Erlang may have been different. (i.e. <<116,101,120,116>> = "text").

Finally, here's my two questions:
1) Is there any benefit at all to the "list representation" of strings above binary text?
2) If not, I wonder if there's any way to change our minds about "strings" as we enter 2013?

regs,
/s

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Dmitry Kolesnikov

unread,

Dec 29, 2012, 9:20:44 AM12/29/12

to Steve Davis, Erlang Questions

Hello Steve,

You have raised a good point here.
One more reason for binary is memory consumption and IPC overhead.
On another hand list allows to represent a code point per element.
iolists are also very handy to dynamically compose a complex strings.

I am afraid that this is an application specific questions… However, I tend to use binary for strings...

- Dmitry

Masklinn

unread,

Dec 29, 2012, 9:34:18 AM12/29/12

to Erlang Questions

On 2012-12-29, at 15:08 , Steve Davis wrote:
> Erlang is truly excellent at text processing. However, to benefit from this excellence, you need to do two things:
> 1) Represent and process text as binaries.
> 2) Assume that the text binary is UTF-8 encoded, unless otherwise stated (meaning, e.g. #text{encoding = cstring, value = <<116,101,120,116,0>>}).

I don't think the former and the latter match. Erlang/OTP can be nice at
string processing where "string" is understood as "sequence of bytes",
but it remains rather ungood at *text* processing: *as far as I know*,
aside from encoding and decoding UTFs it has very limited support for
it[0]: no support (note: by "support" I mean "support built into the
core distribution", it's always possible to call into ICU) for
UnicodeData queries (codepoint meta-information), unicode case folding,
grapheme cluster handling, the important text-processing annexes (UAX 14
"line breaking algorithm", UAX 15 "normalization forms", UAX 29 "text
segmentation") or standards (UTS 10 "collation algorithm" and UTS 18
"regular expressions" as well — for other parts of the system but also
part of unicode itself — UTS 35 "LDML" and the its data-formatting and
data-parsing components), …

In fact Dmitry's email demonstrates it rather well when he notes that

> On another hand list allows to represent a code point per element.

this can provide interesting properties (or not) but it's pretty much
irrelevant when it comes to text processing, it's just an implementation
detail.

[0] http://www.erlang.org/doc/apps/stdlib/unicode_usage.html

Joe Armstrong

unread,

Dec 29, 2012, 10:30:37 AM12/29/12

to Dmitry Kolesnikov, Steve Davis, Erlang Questions

On Sat, Dec 29, 2012 at 3:20 PM, Dmitry Kolesnikov <dmkole...@gmail.com> wrote:

Hello Steve,

You have raised a good point here.
One more reason for binary is memory consumption and IPC overhead.

The point about memory consumption is raised *many* times - on a modern

machine this is not a problem.

Example: I am working on a text file of 84KB - in a 32 bit Erlang we use

8 bytes/character - so I use 0.6 MB - I have 4GB memory - so I use 0.015% of

memory - ie no problem.

My strategy is to keep large strings as binaries when I'm not working on them,

turn them into lists in order to work on them, and turn them back into binaries

when I'm done. Just because a string starts off in a binary does not mean

that it has to stay as a binary as you work on it.

Imagine I have a lot of text files, say each of 50KB, I can store 20 per/MB or

20,000 files per GB. Assume I have a quad core. I can only work on four things

at the same time - so having (say) 20,000 files (at 50K) and work on four of them

(unpacked) at a time is another 1.6 Meg.

Gigabyte memories mean (among other things) what saving the odd byte here are there is hardly relevant.

On another hand list allows to represent a code point per element.

yes - the convenience of having one character per list element far outweighs

the space saving of storing strings in binaries

iolists are also very handy to dynamically compose a complex strings.

I am afraid that this is an application specific questions… However, I tend to use binary for strings...

My strings change form depending on what I'm doing. Sometimes they are

binaries, sometimes lists, sometimes trees, ...

Cheers

/Joe

Thomas Lindgren

unread,

Dec 29, 2012, 11:15:28 AM12/29/12

to Steve Davis, Erlang Questions

>________________________________
> From: Steve Davis <steven.cha...@gmail.com>
...

>Finally, here's my two questions:
>1) Is there any benefit at all to the "list representation" of strings above binary text?
>2) If not, I wonder if there's any way to change our minds about "strings" as we enter 2013?

1) Don't forget the good old I/O-list, which some might consider the "real" string representation.

2) The groundswell seems to be to use binaries instead of lists, even if the syntax is a bit of a pain. So keep building and backporting support for that, IMO.

Alternatively, it might be worth considering a higher-level datastructure that takes encoding and such into account too. Common Lisp took the route of making characters a separate, opaque datatype if memory serves. Strings as builtin CL-style "compact arrays of characters" (suitably updated to handle unicode!) could perhaps replace the use of binaries.

Best,

Thomas

Dmitry Kolesnikov

unread,

Dec 29, 2012, 11:21:56 AM12/29/12

to Joe Armstrong, Steve Davis, Erlang Questions

Hello Joe,

I do re-call you strategy "keep large strings as binaries when I'm not working on them". I was trying to reference it in my previous email :-) but did not find a link to it.

One thing, what I am not fully agree with you is a statement about memory on modern machines…

Let me give you a simple example about EC2 cloud:

* M1 Small Instance (Default) 1.7 GiB of memory

* M1 Medium Instance 3.75 GiB of memory

* M1 Large Instance 7.5 GiB of memory

So I am using EC2 to run my service for consumers, I have to keep 4KB of json objects per registered user.

Using lists, "Each character consumes 8 bytes of memory on a 32 bit machine…". 4KB data blows up 8x times.

Taking into assumption that active users are kept in memory, each node handles

* M1 Small Instance (Default) 1.7 GiB of memory --> 53K active users

* M1 Medium Instance 3.75 GiB of memory --> 117K active users

* M1 Large Instance 7.5 GiB of memory --> 234K active users

Using binaries same metrics could be improved 8x times.

Note: the following calculation is artificial. It just demonstrates that memory metrics is really depends on use-case.

BTW, memory utilisation was a reason why I moved from mochijson to jsx, it parses json as binary. The drop in performance was acceptable but memory consumption was improved 2-3 times...

All in-all, I do agree that string are changes during processing, use-case, etc.

- Dmitry

Masklinn

unread,

Dec 29, 2012, 11:26:59 AM12/29/12

to Erlang Questions

On 2012-12-29, at 17:15 , Thomas Lindgren wrote:
> Alternatively, it might be worth considering a higher-level datastructure that takes encoding and such into account too. Common Lisp took the route of making characters a separate, opaque datatype if memory serves. Strings as builtin CL-style "compact arrays of characters" (suitably updated to handle unicode!) could perhaps replace the use of binaries.
>

My experience is that a "character" datatype has very limited use for
text processing, and tends to drive users (of the API) towards the wrong
patterns, especially when trying to build a good unicode-based
text-processing API: whatever you pick for a "character" (usually a
byte, a code unit or a unicode codepoint) will be the wrong thing more
often than not at a higher level.

On the other hand, defining erlang strings as iolists (or an opaque
datatype implemented through iolists) could work nicely. And it'd be
backwards-compatible: existing strings are already valid iolists
(although raw binaries are not and have to be wrapped in a list). It
would also go a long way towards fixing issues such as
http://prog21.dadgum.com/70.html:

> Ideally filenames would be IO lists, but for compatibility reasons
> there's still the need to support atoms in filenames. That brings up an
> interesting idea: why not allow atoms as part of the general IO list
> specification?
> […]
> I find I'm often calling atom_to_list before sending data to external
> ports, and that would no longer be necessary.

Thomas Lindgren

unread,

Dec 29, 2012, 11:59:08 AM12/29/12

to Masklinn, Erlang Questions

----- Original Message -----
> From: Masklinn <mask...@masklinn.net>
...
>
> I don't think the former and the latter match. Erlang/OTP can be nice at
> string processing where "string" is understood as "sequence of
> bytes",
> but it remains rather ungood at *text* processing: *as far as I know*,
> aside from encoding and decoding UTFs it has very limited support for
> it[0]: no support (note: by "support" I mean "support built into
> the
> core distribution", it's always possible to call into ICU) for
> UnicodeData queries (codepoint meta-information), unicode case folding,
> grapheme cluster handling, the important text-processing annexes (UAX 14
> "line breaking algorithm", UAX 15 "normalization forms", UAX
> 29 "text
> segmentation") or standards (UTS 10 "collation algorithm" and UTS
> 18
> "regular expressions" as well — for other parts of the system but also
> part of unicode itself — UTS 35 "LDML" and the its data-formatting and
> data-parsing components), …

Good point. A strong erlang unicode library implementing the above would be very nice.

(I'm not a great fan of drivers myself.)

Best regards to Arcturus,
Thomas

Masklinn

unread,

Dec 29, 2012, 5:41:06 PM12/29/12

to Erlang Questions

On 2012-12-29, at 17:59 , Thomas Lindgren wrote:
> Good point. A strong erlang unicode library implementing the above would be very nice.

If anybody wants to give it a shot (beware, it's not for the faint of
heart) I would strongly recommend reading Daniel Ehrenberg's
unicode-tagged posts[0], especially his 6 "Unicode implementer's guide"
(but the rest as well, the whole tag is a goldmine) before diving into
the Unicode Standard itself.

[0] http://useless-factor.blogspot.fr/search/label/unicode

Matthew Evans

unread,

Dec 31, 2012, 3:11:01 PM12/31/12

to mask...@masklinn.net, erlang-q...@erlang.org

Really there are advantages for using both lists or binaries, but I think one of the "issues" with string handling is the need for consistent library support between the two. What would be nice is to have consistent support with the Erlang libraries to handle both types (e.g. string, re, regexp etc.).

Just my 2c...

J K

unread,

Jan 2, 2013, 6:25:22 PM1/2/13

to Joe Armstrong, Dmitry Kolesnikov, Steve Davis, Erlang Questions

Hi,

I'm not sure I understand your 20.000 files example. Are you suggesting that the user should limit the number of erlang processes to the number of cores or are you suggesting that the VM compresses the erlang process data when not running? That would be really nice if it can be done with only a small performance penalty, say 10-20%.

In my case I start 50.000 to 100.000 processes , one per file (it's an (map reduce like) application to do feature extraction for some machine learning algorithms) . One erlang node uses about 7GB of memory. I can probably tune it a bit (a lot?) more by using binaries but it would be nice to have an option to compress process data when not running, for people that are lazy/not an erlang expert/does not have that much time (my case)/or just as an indication of how much memory that could be saved by using binaries.

JK

From: Joe Armstrong <erl...@gmail.com>
To: Dmitry Kolesnikov <dmkole...@gmail.com>
Cc: Steve Davis <steven.cha...@gmail.com>; Erlang Questions <erlang-q...@erlang.org>
Sent: Saturday, December 29, 2012 4:30 PM

Subject: Re: [erlang-questions] Strings and Text Processing

Masklinn

unread,

Jan 2, 2013, 6:40:48 PM1/2/13

to Erlang Questions

On 2012-12-31, at 21:11 , Matthew Evans wrote:
> Really there are advantages for using both lists or binaries, but I think one of the "issues" with string handling is the need for consistent library support between the two. What would be nice is to have consistent support with the Erlang libraries to handle both types (e.g. string, re, regexp etc.).

> Just my 2c…

Which would work nicely if the understanding of "strings" (and the
corresponding string-using functions) was extended to iolists (even more
so if bare binaries were included in iolists as well, but it's easy
enough to wrap them in a list), as I previously noted.

Although there's an issue when using binaries for strings: what's
the encoding?

Joe Armstrong

unread,

Jan 3, 2013, 4:34:53 AM1/3/13

to J K, Erlang Questions, Steve Davis

On Thu, Jan 3, 2013 at 12:25 AM, J K <jmaka...@yahoo.com> wrote:

Hi,

I'm not sure I understand your 20.000 files example.

This was just to give an estimate of the total memory size

Are you suggesting that the user should limit the number of erlang processes to the number of cores or are you suggesting that the VM compresses the erlang process data when not running?

You could experiment with limiting the number of active processes and

compressing data when it's not being used (you'd have to do this yourself as

part of the application).

To first approximation one process per parallel activity is a good rule of thumb (and you let the Erlang scheduler figure out who is to run where) - the alternative is

that you limit the number of parallel processes and decide when and where they execute - you are basically saying "because I know a lot about the specific details of my application I can do a better job of process management than the Erlang VM" - this is pretty tricky.

That would be really nice if it can be done with only a small performance penalty, say 10-20%.

In my case I start 50.000 to 100.000 processes , one per file (it's an (map reduce like) application to do feature extraction for some machine learning algorithms) .

How big is each file? how much processing is needed per-file to do feature extraction?

One erlang node uses about 7GB of memory. I can probably tune it a bit (a lot?) more by using binaries but it would be nice to have an option to compress process data when not running, for people that are lazy/not an erlang expert/does not have that much time (my case)/or just as an indication of how much memory that could be saved by using binaries.

look up the manual entry for hibernate/3

http://erlang.org/doc/man/erlang.html#hibernate-3

The hibernate BIF minimise the size of a process before putting it to sleep. It won't

compress any process data - but it does trim the stack and heap before suspending a process. If you have a large number of processes that sleep for

long times this might be a good idea. If your processes sleep for short times and wake up at random then it probably won't help.

Performance tuning is a black art. Unfortunately it's *very* system dependent. If you change the number of cores, or operating system, or amount of memory you have to start again.

If in doubt measure !

/Joe

Jesper Louis Andersen

unread,

Jan 3, 2013, 4:36:53 AM1/3/13

to Steve Davis, Erlang Questions

On Dec 29, 2012, at 3:08 PM, Steve Davis <steven.cha...@gmail.com> wrote:

> 2) If not, I wonder if there's any way to change our minds about "strings" as we enter 2013?

There is :)

A String is often a notoriously bad way to represent data. Think about it: a string is really just a sequence of bytes with some interpretation (utf8 is common). The problem is that people tend to keep their data represented as strings where other data structures are better for the purpose. A Comma-separated file for instance is better represented as a list of tuples or a generator of tuples (and/or records).

The reason you don't usually need a lot of syntactic sugar for manipulating strings are simply that you shouldn't be manipulating strings too much in the first case. A language like Perl is good for processing directly on text, but the weakness is that it has a harder time at structuring data like you should do in Erlang (Or ML or Haskell or …). I have a hunch that most of the critique is really due to the fact that people are trying to write in the style of a different language in Erlang.

Jesper Louis Andersen
Erlang Solutions Ltd., Copenhagen

Steve Davis

unread,

Jan 3, 2013, 8:11:36 AM1/3/13

to Jesper Louis Andersen, Erlang Questions

Hi Jesper,

I do most certainly agree with your comments. The requirement for text processing is almost entirely a boundary consideration - made quite common owing to the number of text-based protocols and file formats out there. The provision of codecs that transform from string to the internal data representation, and then from the internal representation out to a string is a not-infrequent requirement of the code. Decoders are frequently far harder to write than encoders - and as I noted, made harder still when using erlang strings than when using binaries.

Regards,
Steve

J K

unread,

Jan 9, 2013, 5:44:34 PM1/9/13

to Joe Armstrong, Erlang Questions

From: Joe Armstrong <erl...@gmail.com>

To: J K <jmaka...@yahoo.com>
Cc: Dmitry Kolesnikov <dmkole...@gmail.com>; Steve Davis <steven.cha...@gmail.com>; Erlang Questions <erlang-q...@erlang.org>
Sent: Thursday, January 3, 2013 10:34 AM

Subject: Re: [erlang-questions] Strings and Text Processing

On Thu, Jan 3, 2013 at 12:25 AM, J K <jmaka...@yahoo.com> wrote:

Hi,

I'm not sure I understand your 20.000 files example.

This was just to give an estimate of the total memory size

Are you suggesting that the user should limit the number of erlang processes to the number of cores or are you suggesting that the VM compresses the erlang process data when not running?

You could experiment with limiting the number of active processes and

compressing data when it's not being used (you'd have to do this yourself as

part of the application).

To first approximation one process per parallel activity is a good rule of thumb (and you let the Erlang scheduler figure out who is to run where) - the alternative is

that you limit the number of parallel processes and decide when and where they execute - you are basically saying "because I know a lot about the specific details of my application I can do a better job of process management than the Erlang VM" - this is pretty tricky.

---

Yes, it would probably take some time to do that. I would rather let the VM limit the number of processes to a number set by the user.

That would be really nice if it can be done with only a small performance penalty, say 10-20%.

In my case I start 50.000 to 100.000 processes , one per file (it's an (map reduce like) application to do feature extraction for some machine learning algorithms) .

How big is each file? how much processing is needed per-file to do feature extraction?

---

I would guess most of them are about 200-300KB, but probably vary between 0.1-1MB.

This is the parsing part of one of the feature extraction algorithms:

Definitions.

Rules.

{T}{T} : {token,TokenChars,pushback(TokenChars)}.

{T} : skip_token.

Erlang code.

pushback([_,X]) -> [X].

It's the character bigram algorithm:

http://en.wikipedia.org/wiki/Bigram

A special case of n-gram:

http://en.wikipedia.org/wiki/N-gram

However I just realized I can do this:

{T}{T} : {token,list_to_binary(TokenChars),pushback(TokenChars)}.

One erlang node uses about 7GB of memory. I can probably tune it a bit (a lot?) more by using binaries but it would be nice to have an option to compress process data when not running, for people that are lazy/not an erlang expert/does not have that much time (my case)/or just as an indication of how much memory that could be saved by using binaries.

look up the manual entry for hibernate/3

http://erlang.org/doc/man/erlang.html#hibernate-3

The hibernate BIF minimise the size of a process before putting it to sleep. It won't

compress any process data - but it does trim the stack and heap before suspending a process. If you have a large number of processes that sleep for

long times this might be a good idea. If your processes sleep for short times and wake up at random then it probably won't help.

Performance tuning is a black art. Unfortunately it's *very* system dependent. If you change the number of cores, or operating system, or amount of memory you have to start again.

If in doubt measure !

---

I actually do explicit garbage collection already. I made some measurement of peak memory usage using different combinations of GC, binary, list and fullsweep_after. I might try hibernate later but I'm not sure if it would make any big difference, after all I have 7GB available on each host :) And I can easily add another host if needed.

I don't have the files available right now so I copied some include files totaling about 27MB (as reported by du -m).

Peak mem usage (using this: http://www.linuxquestions.org/questions/linux-software-2/how-to-measure-a-programs-peak-memory-consumption-336376/ ):

Using: {T}{T} : {token,TokenChars,pushback(TokenChars)}.

- No explicit GC, no fullsweep_after:

Min 1318, Max1578, Individual Measurements: 1346, 1318, 1342, 1321, 1587, 1343 MB

- No explicit GC, fullsweep_after 0:

Min 1359, Max 1414, Meas: 1361, 1363, 1359, 1414, 1365 MB

- Explicit GC, no fullsweep_after:

Min 1009, Max 1093, Meas: 1010, 1015, 1009, 1093, 1015, 1022 MB

- Explicit GC, fullsweep_after 0

Min 840, Max 1193, Meas: 864, 844, 847, 840, 868, 1193, 860, 859 MB,

Using: {T}{T} : {token,list_to_binary(TokenChars),pushback(TokenChars)}.

- No explicit GC, no fullsweep_after:

1215-1232. 1232, 1215, 1225, 1222 MB

- No explicit GC, fullsweep_after 0

1191-1304. 1205, 1239, 1304, 1267, 1238, 1191, 1220 MB

- Explicit GC, no fullsweep_after:

750-955. 890, 813, 885, 822, 874, 955, 846, 890, 750 MB

- Explicit GC, fullsweep_after 0

813-822, 822, 820, 813, 815, 821, 817 MB

In summary (for my case): using binaries is a good idea but is not critical if using explicit GC and fullsweep_after 0. Also worth mentioning is that I create an ets table for each file to save the parsed result (working with lists is too slow). This is when the memory usage peaks.

It would also be good if the fullsweep_after option was mentioned in the GC chapter: