Proposal: String.index(string, substring)

299 views
Skip to first unread message

Qertoip

unread,
Feb 23, 2016, 10:47:15 AM2/23/16
to elixir-lang-core
We have a :binary.match/2 but it works on a byte level and is unwieldy with its tuple output and truthy :nomatch.

Proposal:
String.index(string, substring)         # => Integer or nil

Examples:
String.index("I am a text", "kk")       # => nil
String.index("I am a text", "a")        # => 2
String.index("I am a text", "text")     # => 7
String.index("I am a text", ~r/\s/)     # => 1

Alternative names to consider:
find_index/2 - like Enum.find_index/2
find/2 - present in Python and akin to Enum.find/2 although the latter does not return the index so a bit confusing

I would go with index/2 to invite Ruby, Java, C#, JavaScript and Perl programmers (its either "index" or "indexOf" in all of them).

I could probably follow-up with a PR if this turns out to be a welcome addition.

Ben Wilson

unread,
Feb 23, 2016, 11:28:57 AM2/23/16
to elixir-lang-core
I think we run into the general issues with indices and utf8 strings here don't we? What does 2 mean? is it the 2nd code point of the 2nd grapheme?

Qertoip

unread,
Feb 23, 2016, 11:58:01 AM2/23/16
to elixir-lang-core
W dniu wtorek, 23 lutego 2016 17:28:57 UTC+1 użytkownik Ben Wilson napisał:
I think we run into the general issues with indices and utf8 strings here don't we? What does 2 mean? is it the 2nd code point of the 2nd grapheme?

I believe it is pretty much established users by default want graphemes - like in String length, at, split_at.

José Valim

unread,
Feb 23, 2016, 12:04:30 PM2/23/16
to elixir-l...@googlegroups.com
I think it is more complicated than that because, depending on what you want to do, using the grapheme index is going to be much slower than working at the byte level. For example, if you want to split based on the index, doing a straight split operation or working with bytes is going to perform much better than doing two grapheme based operations (which are linear). So unless you have a use case that won't work with binary:match nor any of the other String functions, I am not convinced an index function is useful.



José Valim
Skype: jv.ptec
Founder and Director of R&D

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/2d1ba237-1755-4fd0-82f6-7f086a1c2b72%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Hamilton

unread,
Feb 23, 2016, 12:11:12 PM2/23/16
to elixir-l...@googlegroups.com

José: String.split_at already exists and works at the grapheme level. I'm not sure what you are suggesting.

I think it makes perfect sense to have index such that

String.at(str, String.index(str, chr)) == chr


Qertoip

unread,
Feb 23, 2016, 12:32:13 PM2/23/16
to elixir-lang-core, jose....@plataformatec.com.br
Thanks for quick response!

W dniu wtorek, 23 lutego 2016 18:04:30 UTC+1 użytkownik José Valim napisał:
I think it is more complicated than that because, depending on what you want to do, using the grapheme index is going to be much slower than working at the byte level.

I virtually never work on a byte level. Almost all real-world strings I need to process come in as UTF-8 and do indeed contain multi-byte graphemes.

Please lets constrain our discussion to UTF-8 strings.

For example, if you want to split based on the index, doing a straight split operation or working with bytes is going to perform much better than doing two grapheme based operations (which are linear). So unless you have a use case that won't work with binary:match nor any of the other String functions, I am not convinced an index function is useful.

I can't recall any language without String.index equivalent so it seems pretty justified. From C++ to Java to JS to Python to Ruby they all have it. We are talking simple string processing here. It will be hard for me to explain to every new programmer I convert to Elixir <3 that String.index is not present in Elixir because... slow or not idiomatic.

José Valim

unread,
Feb 23, 2016, 1:11:10 PM2/23/16
to Qertoip, elixir-lang-core
I virtually never work on a byte level. Almost all real-world strings I need to process come in as UTF-8 and do indeed contain multi-byte graphemes.

You are proposing a feature to Elixir. We need to consider many aspects, including usage and performance. So even though you are not working at the byte-level directly, a comparison with the byte-level operation must be part of the discussion if we want to understand the properties of the operation you are proposing.

From C++ to Java to JS to Python to Ruby they all have it. We are talking simple string processing here.

That's a bad argument because none of the languages above provide grapheme level operations (at most, they are codepoint based). They also provide plenty of features that are not part of Elixir (and won't ever be).
 
It will be hard for me to explain to every new programmer I convert to Elixir <3 that String.index is not present in Elixir because... slow or not idiomatic.

If every new programmer learning Elixir *needs* String.index then I would love to hear their use cases. :) I would love to hear yours as well.

Because it seems I was not clear in my previous e-mail, I will try to rephrase it. Using String.index can be a bad idea if we are using its result as an input for another String computation. For example, this would be super inefficient:

    String.split_at(String.index("foo", "o"))

This will perform much better:

    String.split("foo", "o", parts: 2)

If you need exactly the same result as the split call:

    [left, right] = String.split("foo", "o", parts: 2)
    {left, "o" <> right}
  
That's because String.split can work at the byte-level since it is not returning indexes. On the other hand, String.split_at + String.index will force us to do a binary search, then convert the byte index to a grapheme index and then traverse the string again looking for that grapheme index. It will be obviously expensive for large strings.

So unless we have a lot of use cases that can only be represented with String.index, I don't see the appeal in adding a very inefficient operation to the language. You will be better served by :binary.match. You can also easily implement the function as:

    case String.split("foo", "o", parts: 2) do
      [left, _] -> String.length(left)
      [_] -> nil
    end




jisaa...@gmail.com

unread,
Feb 23, 2016, 1:18:36 PM2/23/16
to elixir-lang-core, jose....@plataformatec.com.br
Because other languages have it does not make it a good addition.

Also a tuple output is not unweildy, it is idiomatic.

My concern is we will ruin a perfectly good language if we prioritize the expectations of new users.

Qertoip

unread,
Feb 23, 2016, 2:30:21 PM2/23/16
to elixir-lang-core, qer...@gmail.com, jose....@plataformatec.com.br
Thanks so much for taking time to explain this! I think I'm much closer to the truth now :)

The remaining thing I would love to understand is what is the major cause for the slowness. Is it graphemes support or limitations of the underlying VM? If we worked on a codepoint level like most languages - would it help with performance?

As for my specific use case I have rewritten it to use String.split and it seems good enough.

PS I intend to document it on SO for future reference as soon as I will feel competent ;)

José Valim

unread,
Feb 23, 2016, 3:39:19 PM2/23/16
to Qertoip, elixir-lang-core
The remaining thing I would love to understand is what is the major cause for the slowness. Is it graphemes support or limitations of the underlying VM? If we worked on a codepoint level like most languages - would it help with performance?

Take my name twice in a string:

"joséjosé"

That's represented as:

   j    o    s    é         j    o    s    é
<<106, 111, 115, 195, 169, 106, 111, 115, 195, 169>>

Because binaries are byte arrays, accessing the 6th byte is as simple as going straight to the 6th byte. It is O(1). However accessing the 5th grapheme requires going byte by byte counting the graphemes.

It is hard to answer if this is a VM limitation. We could have an read-optimized data-structure where we would have an array superposed to the byte array so we could access graphemes quickly but that would use more storage and make all operations like concatenation more expensive.

In any case, to sum up: my hesitation in adding index is because working on graphemes indexes will likely be slow and I would rather optimize the cases people would use the index for (like splitting) than lead developers to work with indexes in the first place.

Onorio Catenacci

unread,
Feb 23, 2016, 3:50:22 PM2/23/16
to elixir-l...@googlegroups.com
Just on the off chance that I'm not the only one not quite clear on the distinction between byte array, grapheme and code point:


TL; DR A grapheme can consist of multiple code points (e. g. the é in José's name) and I would gather that a code point doesn't necessarily correspond one to one with a byte in a byte array (because of double byte character sets).  In other words we can't just assume that a given byte will be the same as a given grapheme.  

Please correct me if I've misunderstood.  I was finding the conversation a little bit tricky to follow so I thought I'd share this in case I wasn't the only one. 

--
Onorio


--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Qertoip

unread,
Feb 23, 2016, 3:58:41 PM2/23/16
to elixir-lang-core
I have documented this thread findings to the best of my understanding here:
http://stackoverflow.com/a/35587943/407952

Thanks again!

José Valim

unread,
Feb 23, 2016, 4:09:09 PM2/23/16
to elixir-l...@googlegroups.com
A grapheme is made of one or more codepoints.
A codepoint is made of one or more bytes (in the UTF-8 encoding).
A byte is 8 bits. :)

For example, the letter é can be written in multiple ways:

* It can be written as a single codepoint "é". This means the grapheme is "é", the codepoint is "é" and it has 2 bytes.

* It can be written as two codepoints: "e" followed by the acute sign "´". This means the grapheme is "é" (it is *perceived* as a single entity, the same as before), the codepoints are "e" and "´" and it has 3 bytes altogether.
Reply all
Reply to author
Forward
0 new messages