Proposal: String.longer?

59 views
Skip to first unread message

Alvin Lindstam

unread,
Jun 25, 2016, 10:54:26 AM6/25/16
to elixir-lang-core
Checking the length of a String requires traversing the whole string. I fairly common use case is checking if the String is longer or shorter than a given value. I might not be interested if the string is 123 456 graphemes long or 654 321, if all I want to know if it is longer than 50 000 characters, but still has to calculate the full length to know if it is.

I'd propose adding String.longer? which calculates the length up to a given limit, and returns a boolean indicating if the string is longer than the limit. It could of course also be user to see if the string is shorter, longer-or-equal-to or shorter-or-equal-to:

String.length(string) > 10 == String.longer?(string, 10)
String.length(string) >= 10 == String.longer?(string, 10 - 1)
String.length(string) <= 10 == !String.longer?(string, 10)
String.length(string) < 10 == !String.longer?(string, 10 - 1)

I'm not sure about the naming of the function, please suggest a different one if you'd like.

The proposal is implemented here: https://github.com/elixir-lang/elixir/compare/master...alvinlindstam:string-longer. I'd be happy to send a PR if the proposal is accepted.

Alternatives
There is nothing stopping a user from implementing this themselves, since next_grapheme_size is public, but I'd guess it's such a hassle that few would do it.

There is no way to use this to check if a string's length is equal to a certain limit, or within a given range. We could use a more verbose api or more functions if that is desired.

One alternative is to add an optional limit paramter to String.length, which always returns the string's length if below the limit, but returns the limit or some atom if it's longer. It would be slightly more verbose to check the length, but enables checks for a given value or range (while still preventing unnecessary calculations).

Benchmarks
With my implementation, I get the following output from a simple benchmark. String.longer? seems to be few percent slower than String.length for short strings when the length is not above the limit, but only grows linerarly up to the given limit. The tests are names bu function, string length and limit.

Warning: The function you are trying to benchmark is super fast, making time measures unreliable!

Benchee won't measure individual runs but rather run it a couple of times and report the average back. Measures will still be correct, but the overhead of running it n times goes into the measurement. Also statistical results aren't as good, as they are based on averages now. If possible, increase the input size so that an individual run takes more than 10μs


Name                                    ips        average    deviation         median

string.length, 10                1012798.94         0.99μs   (±305.69%)         0.90μs

string.longer?, 10, 10           1005594.61         0.99μs   (±184.19%)         0.90μs

string.longer?, 10000, 10         896158.77         1.12μs   (±364.52%)         1.00μs

string.longer?, 10000, 5000         2298.79       435.01μs    (±44.71%)       390.00μs

string.length, 10000                1241.32       805.59μs    (±25.77%)       761.00μs


Comparison: 

string.length, 10                1012798.94

string.longer?, 10, 10           1005594.61 - 1.01x slower

string.longer?, 10000, 10         896158.77 - 1.13x slower

string.longer?, 10000, 5000         2298.79 - 440.58x slower

string.length, 10000                1241.32 - 815.90x slower

Further optimizations

I considered checking the byte_size of the string, hoping to find conditions when we could say for sure what the results would be.


I planned to return false if the byte_size was below the limit, since that would mean that there are less codepoints than the limit. But I'm not sure there are no situations where a codepoint could produce more than one grapheme.

I also planned to return true if byte_size was more than 4 times the limit, since each codepoint uses at most four bytes. But since a grapheme could use multiple codepoints it could also use more than four bytes, and I'm not sure what the upper limit is (if there is any).

John W Higgins

unread,
Jun 25, 2016, 2:14:10 PM6/25/16
to elixir-l...@googlegroups.com

On Sat, Jun 25, 2016 at 7:54 AM, Alvin Lindstam <alvin.l...@gmail.com> wrote:
Checking the length of a String requires traversing the whole string. I fairly common use case is checking if the String is longer or shorter than a given value. I might not be interested if the string is 123 456 graphemes long or 654 321, if all I want to know if it is longer than 50 000 characters, but still has to calculate the full length to know if it is.

What about String.at("bob", 4) != nil would test if "bob" was greater then or equal to 4 chars in a single pass.

John

Alvin Lindstam

unread,
Jun 26, 2016, 10:33:40 AM6/26/16
to elixir-l...@googlegroups.com
Yes, I didn't think if using String.at/2 which would work and doesn't traverse the whole string.

If we think that checking a Strings length up to a certain limit is interesting, I'd still say that using `String.at/2` feels a little unintuitive. Also, it seems to be about 20% slower than my implementation of String.longer? for the same input (I did a benchmark with a 10 000 character string, with 5000 as limit).

If there is any possible guarantees between byte size for graphemes in a valid utf-8 string, there is also room for the optimizations I mentioned in a longer? function, which is not possible to include in the at function. If we know that there is always at least as bytes as graphemes in a string, we could return false without traversing up to the limit in most strings when the string is shorter than the limit. The same goes for eagerly returning true for long strings if we know that no grapheme uses more than x bytes (even if x is 16 or some other fairly high value).

I guess it breaks down to how verbose the stdlib should be, if there should be a minimal set of functions that works for most use cases or if there is room for specialized functions (and if this particular example is deemed useful enought to add the maintainability and compatability cost).

--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-core" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-core/eY_YkhopJRA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/CAPhAwGwM0aN_Mx2a-_mwe9qi_UNR91KMDH_BpkskSaXHS-D0%3DA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Filip Haglund

unread,
Jun 26, 2016, 12:03:12 PM6/26/16
to elixir-lang-core
Could this be done automatically by the compiler? Replacing `String.length str >= 13` with `String.at(str, 13) != nil`?

Michał Muskała

unread,
Jun 26, 2016, 12:19:28 PM6/26/16
to elixir-l...@googlegroups.com

On 26 Jun 2016, at 18:03, Filip Haglund <fille....@gmail.com> wrote:

Could this be done automatically by the compiler? Replacing `String.length str >= 13` with `String.at(str, 13) != nil`?

I don’t think this could be done as it has different semantics. String.length(binary) will fail if the string is not valid utf8, the proposed version won’t fail if the invalid sequence is past the searched index. This is similar to regular length/1 for lists with improper lists.

Michał.
signature.asc
Reply all
Reply to author
Forward
0 new messages