Add boolean methods for different unicode character groups (String.alphanumeric?, etc)

553 views
Skip to first unread message

w.m.w...@student.rug.nl

unread,
May 3, 2016, 3:31:44 PM5/3/16
to elixir-lang-core
I have seen multiple people (In the Elixir Slack groupon Reddit) during the last couple of days requiring something that checks if a (possibly long) string contains e.g. only alphanumeric characters.

It is possible to do this using regular expressions right now:
~r/[^[:alnum:]]/u

but this is very slow.

My proposal is to add the following boolean functions to the String module:

  •  alphabetic?
  •  numeric?
  •  alphanumeric?
  •  whitespace?
  •  uppercase? 
  •  lowercase?
  •  control_character?

Function heads for these functions can probably be best generated by using compile-time macros similar to what other unicode-based functions already use.

eksperimental

unread,
May 3, 2016, 5:20:00 PM5/3/16
to elixir-l...@googlegroups.com
and yesterday on stackoverflow
http://stackoverflow.com/questions/36966980/find-if-codepoint-is-upper-case-in-elixir

On Tue, 3 May 2016 12:31:44 -0700 (PDT)
w.m.w...@student.rug.nl wrote:

> I have seen multiple people (In the Elixir Slack group
> <https://elixir-lang.slack.com/archives/general/p1462294660007855>,
> on Reddit
> <https://www.reddit.com/r/elixir/comments/4h4y4e/whats_missing_from_the_elixir_ecosystem/d2nvbwd>)
> during the last couple of days requiring something that checks if a
> (possibly long) string contains e.g. only alphanumeric characters.
>
> It is possible to do this using regular expressions right now:
> ~r/[^[:alnum:]]/u
>
> but this is very slow.
>
> My proposal is to add the following boolean functions to the String
> module:
>
>
> - alphabetic?
> - numeric?
> - alphanumeric?
> - whitespace?
> - uppercase?
> - lowercase?
> - control_character?

eksperimental

unread,
May 3, 2016, 5:29:16 PM5/3/16
to elixir-l...@googlegroups.com
I'm not too sure if we should have all those many functions should be
added. it could be too many of them, and not easy to extend..
but how about an Unicode.info/1 function, that returns a tuple with
information about that character. such as
iex> Unicode.info("A")
...> {:alphanumeric, :uppercase, :ascii}

It will be easy to improve as we find more information can be added,
such as ISO types and other groups (Specially to encodings we are not
familiar with)

Additionally we could have check?/2 (or some better name probably!)
iex> Unicode.check?("A", :uppercase)
...> true
iex> Unicode.check?("A", :numeric)
...> false


created, but On Tue, 3 May 2016 12:31:44 -0700 (PDT)
> during the last couple of days requiring something that checks if a
> (possibly long) string contains e.g. only alphanumeric characters.
>
> It is possible to do this using regular expressions right now:
> ~r/[^[:alnum:]]/u
>
> but this is very slow.
>
> My proposal is to add the following boolean functions to the String
> module:
>
>
> - alphabetic?
> - numeric?
> - alphanumeric?
> - whitespace?
> - uppercase?
> - lowercase?
> - control_character?

Eric Meadows-Jönsson

unread,
May 3, 2016, 6:24:33 PM5/3/16
to elixir-l...@googlegroups.com
The problem is that the Unicode module is already big, the file size of the .beam file is one of the largest in elixir. There are also issues compiling this file on systems with 512mb memory. idna, an erlang library for unicode, have similar issues on systems with low memory. Adding more functions that will need a large number of function clauses will make the issue worse and the size of the compiled elixir we distribute larger.

I think it's better to have this functionality in a library until we can solve the memory issue and only have the bare necessities for unicode support in stdlib. If we later can move it into stdlib it would be good to have the API figured out and bugs fixed in another library that can iterate faster.

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/20160504042910.57fd86e0.eksperimental%40autistici.org.
For more options, visit https://groups.google.com/d/optout.



--
Eric Meadows-Jönsson

Peter Marreck

unread,
May 4, 2016, 12:05:21 PM5/4/16
to elixir-lang-core
As an example of what would need to be done by necessity for proper compliance with Unicode spec, check out the "Derived Property: Alphabetic" codepoint list section of this doc:

ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

"Total code points: 110943"

And that's just for the "is_alphabetic?" function! (Sure, this would be macroed out, but as Eric said, it would definitely increase the binary size further...)

I still think this is useful functionality (and would likely be many orders of magnitude faster than relying on Regex to determine these things due to Elixir/Erlang's fast function-head pattern matching)

--
Peter Marreck

jisaa...@gmail.com

unread,
May 4, 2016, 3:36:51 PM5/4/16
to elixir-lang-core
I have done some exploratory work on this sort of thing (implementing python's UnicodeData) and a _correct_ implementation is both large and difficult.

A full unicode properties file I created was 615k, which I think is about as large as the entire elixir standard library.

And that is without the east asian character set.

I think it would make a good 3rd party lib though.

I never finished the work, but I can throw it up on github if someone else want to finish it.

eksperimental

unread,
May 4, 2016, 5:29:19 PM5/4/16
to elixir-l...@googlegroups.com
Hi jisaacstone,
I would be really interested in seeing what you have,
specially that 600kb file

thank you

eksperimental

unread,
May 4, 2016, 5:37:17 PM5/4/16
to elixir-l...@googlegroups.com
> I would be really interested in seeing what you have,
> specially that 600kb file
oh shoot... that didn't sound right now that I read it on my screen,
no double meaning, I swear! :-P

José Valim

unread,
May 5, 2016, 3:43:10 AM5/5/16
to elixir-l...@googlegroups.com
I think there are a couple things that could be done to reduce the file size:

* Break the properties into multiple modules. One module for alphanumeric, another module for a couple others, etc, etc

* Work on integer codepoints instead of binaries, i.e. support UnicodeData.alphanumeric?(?é) instead of UnicodeData.alphanumeric("é"). If the latter is given, you can convert it to the former. The reasoning is, if you work with integers, you can support ranges. So instead of:

    alphanumeric?("a")
    alphanumeric?("b")
    ...
    alphanumeric?("z")

You can write:

    alphanumeric?(x) when x in ?a..?z

Ranges are slightly slower but I have pushed patches to optimize them on Erlang 19.

Both techniques are used in Elixir's Unicode files IIRC. The ranges would have to be extracted from the Unicode file. 

José Valim
Skype: jv.ptec
Founder and Director of R&D

w.m.w...@student.rug.nl

unread,
May 5, 2016, 12:44:07 PM5/5/16
to elixir-lang-core
From what I gather so far it seems that the general consensus is that:

- this is useful functionality
- it is nontrivial to implement, which makes it even more important that there is a standardized solution.
- when included in the stdlib, compiling/usin Elixir on embedded devices becomes impossible because of the increased file size.

I therefore propose that we put this functionality in a separate library that either from the get-go or once it has reached a certain point of stability/completion is managed under the 'elixir-lang' nomer.

How does that sound?

~Qqwy/Wiebe-Marten

Tallak Tveide

unread,
May 5, 2016, 2:07:17 PM5/5/16
to elixir-lang-core
I did such an optimization in my library, but interestingly there was no improvement in performance and compiled size. I am guessing the compiler does a better job by default than my optimizations. :-P i believe the optimizations are still in the version history of codepagex

w.m.w...@student.rug.nl

unread,
May 6, 2016, 10:50:44 AM5/6/16
to elixir-lang-core
The DerivedCoreProperties.txt file that was linked to already internally uses ranges to keep the file manageable.

For at least the basic functionality described above, this would be enough information to create proper implementations.

I have read through Chapter 4(Character Properties) of the Unicode 3.2 standard, and it seems like some checks are more complicated. (i.e. because there are some graphemes presenting ligatures, there is not just Lowercase and Uppercase but also the Titlecase property. Also, some characters are used differently depending on context) These, however, also do seem like they are more specific/less often useful in programming.

I will, unless someone voices an objection, start on a Unicode package this weekend, in the hope that others will join and that it will become stable enough over time to at some point become 'officially' supported.


Have a wonderful day,

~Qqwy/Wiebe-Marten

jisaa...@gmail.com

unread,
May 6, 2016, 3:28:41 PM5/6/16
to elixir-lang-core, eksper...@autistici.org
Hey sorry for the late reply

just pushed the code to github.

The code to create the file is here:

https://github.com/jisaacstone/ex_unicodedata/blob/master/lib/data/codepoint.ex

Could probably reduce the size somewhat by using tuples instead of structs.

 -isaac

w.m.w...@student.rug.nl

unread,
May 8, 2016, 5:21:18 AM5/8/16
to elixir-lang-core
I have created a Hex package called `unicode` that contains part of the functionality: Right now it reads DerivedCoreProperties.txt and infers the different Derived Core Properties from that, as well as implementing some of the Regexp Compatibility Properties defined at http://www.unicode.org/reports/tr18/#Compatibility_Properties that only depend on the Derived Core Properties (digit, alphabetic, alphanumeric, lower, upper).

The properties in the Regexp Compatibility Properties table are the ones that seem the most useful, as these are often the checks someone wants to do on strings where Regular Expressions are too slow.

I haven't yet touched the properties for which the General Category of the character has to be known (We could maybe adapt Isaac's code for this? (-: ). Right now, the compiled .beam file when using the Derived Core Properties (Math, Alphabetic, Lowercase, Uppercase) is +- 53kb large; becoming +- 280kb when defining function clauses for the other Derived Core Properties as well.


Have a wonderful day,

~Wiebe-Marten

Ed W

unread,
May 9, 2016, 12:36:45 PM5/9/16
to elixir-l...@googlegroups.com
On 05/05/2016 17:44, w.m.w...@student.rug.nl wrote:

> - when included in the stdlib, compiling/usin Elixir on embedded devices becomes impossible because of the increased file size.

Head above the parapet, but I would be interested in using Elixir in
some embedded stuff. Object size and runtime size is important. I guess
I'm not really "embedded" as my linux image sizes are usually 10MB+
(including kernel), but I still want to keep things small when I can.
Having the ability to strip down unicode support may be useful for some
requirements...

Thanks for consideration...

Ed W



Ben Wilson

unread,
May 9, 2016, 8:17:12 PM5/9/16
to elixir-lang-core

Ed W

unread,
May 11, 2016, 1:34:03 PM5/11/16
to elixir-l...@googlegroups.com
Hi, I'm not totally sure what Nerves does for me?  My understanding is that Nerves is something like a mini distro builder, eg openbake (or whatever it's called?).  The target audience I think is where you want an embedded setup which is TOTALLY erlang only?  I did look at it, but I'm not using it as I need a full linux distro and Elixir would really be used only as the glue.

I currently build my own embedded distro (starts about 2MB and goes up from there - full base distro with lots of toys is about 10MB).  For sure the proper embedded guys think that 2MB is crazy big, hence my caveat.  However, given that the average user probably thinks 0.5GB install is a basic install, I am going out on a limb and calling my 10MB distro an "embedded" system.

Point though is when I add perl, I have a heavily stripped down unicode setup (ie I'm stripping many MB off the size, from memory the unicode DB is about 60MB in perl). 

My appeal is that it would be useful to keep sight of such uses of Elixir (ie Nerves and the like), where the target is a total install size of some few MBs and be aware that there are some use cases (driving lego robots) where a full unicode DB isn't needed

Thanks for listening!

Ed W
--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages