Erlang binary longest common substring

I Gusti Ngurah Oka Prinarjaya

unread,

Mar 30, 2020, 6:18:50 AM3/30/20

to Erlang

Hi,

I need longest common substring erlang implementation. Please help me

I've tried https://rosettacode.org/wiki/Longest_common_subsequence#Erlang but very very slow T_T

and i've tried binary:longest_common_prefix([<<"i love you 2020-03-28">>,<<"i love you 2020-03-01">>]) not effective / not valid

Thank you

Valentin Micic

unread,

Mar 30, 2020, 7:23:34 AM3/30/20

to I Gusti Ngurah Oka Prinarjaya, Erlang

It may be faster to do this with lists, but this should do the trick:

lcs( <<A/binary>>, <<B/binary>>, Off )

->

case {A, B} of

{<<_:Off/binary-unit:8, C:8, _/binary>>, <<_:Off/binary-unit:8, C:8, _/binary>>} -> lcs( A, B, Off+1 );

{<<Common:Off/binary-unit:8, _/binary>>, _} -> Common

end

.

Fro example, a call to lcs( <<“Hello World”>>, <<"Hellao World”>>, 0 )will return <<“Hell”>>.

NOTE (and just in case): you’d need to put this function in your own module.

V/

Marc Worrell

unread,

Mar 30, 2020, 7:41:45 AM3/30/20

to Valentin Micic, Erlang

Maybe you would like to take UTF8 strings into account.

As copying is sometimes better, and actually quite fast, you can also build a new binary.

common_prefix(A, B) ->
common_prefix(A, B, <<>>).

common_prefix(<<C/utf8, A/binary>>, <<C/utf8, B/binary>>, Acc) ->
common_prefix(A, B, <<Acc/binary, C/utf8>>);
common_prefix(_A, _B, Acc) ->
Acc.

Or use the offset method and return a sub-binary:

common_prefix(A, B) ->
common_prefix(A, B, 0).

common_prefix(A, B, Offs) ->
case {A, B} of
{<<_:Offs/binary, C/utf8, _/binary>>, <<_:Offs/binary, C/utf8, _/binary>>} ->
common_prefix(A, B, Offs + size(<<C/utf8>>));
{<<Prefix:Offs/binary, _/binary>>, _} ->
Prefix
end.

Cheers, Marc

Richard O'Keefe

unread,

Mar 30, 2020, 8:10:19 AM3/30/20

to I Gusti Ngurah Oka Prinarjaya, Erlang

You've mentioned three different things:
longest common prefix,
longest common substring,
longest common subsequence.
What do you actually want and why do you want it?
Also, do these binaries represent text?
In that case there's another distinction:
- code points?
- base character + combining characters?
- grapheme cluster?
What I'm getting at is that you *might* want
(O, combining overline) and (O, combining hook above)
to have (O) as a common whatever, but on the other
hand you might *not*.

Ben Murphy

unread,

Mar 30, 2020, 9:08:28 AM3/30/20

to I Gusti Ngurah Oka Prinarjaya, Erlang

Hi,

I think the implementation on rosetta code is slightly broken from a
performance point of view. Usually, if you are memoising in this
problem you would use the offsets for where you are in the string. So
if you were writing in an imperative language you would just have a N
x M array. Obviously, you can't do this in Erlang so they are storing
stuff in a dictionary. However, for the keys it looks like they are
using the string/list for the key instead of the offset. This means
the dict code will have to iterate over the remaining bit of the
string for every lookup. So instead of being O(NxM) you end up with an
algorithm that is O(NxMx(N + M)). So instead of being quadratic it
gives you cubic performance. This might explain why it is so slow.
Note: I haven't included the log() term that is introduced by the
dictionary lookup that can't be avoided in Erlang.

I Gusti Ngurah Oka Prinarjaya

unread,

Mar 30, 2020, 9:45:52 AM3/30/20

to Ben Murphy, Valentin Micic, Erlang

Hi All,

Thank your very much for the answers,

@Valentin Micic , Your lcs function skipping the "World" word. I need "o" and "World" also count.

@Marc Worrel, Your function also similar with Valentine Micic's function.

@Richard O'Keefe , I suppose substring or subsequence is same thing, but i think i am wrong, the point is, i need the function to also count

"o" from <<"Hellaao World">>, and "World" from <<"Hello World">>
from example: lcs(<<"Hello World">>, <<"Hellaao World">>)

I think i need to make a NIF .

I will try to make NIF based on this https://github.com/wollmers/c-lcs-bv wonderful example ANSI C code.

I've try Pyrlang https://pyrlang.github.io/Pyrlang/ in the hope i that i can use / communicate with https://pypi.org/project/pylcs/ through:

rpc:call('p...@127.0.0.1', 'pylcs', 'lcs', [<<"Hello World">>,<<"Hellaao World">>]) in a python-process node but Pyrlang not help much in my case. still very slow

because it run over the network.

Marc Worrell

unread,

Mar 30, 2020, 10:04:39 AM3/30/20

to I Gusti Ngurah Oka Prinarjaya, Erlang

On 30 Mar 2020, at 15:45, I Gusti Ngurah Oka Prinarjaya <okapri...@gmail.com> wrote:

Hi All,

Thank your very much for the answers,

@Valentin Micic , Your lcs function skipping the "World" word. I need "o" and "World" also count.

@Marc Worrel, Your function also similar with Valentine Micic's function.

@Richard O'Keefe , I suppose substring or subsequence is same thing, but i think i am wrong, the point is, i need the function to also count
"o" from <<"Hellaao World">>, and "World" from <<"Hello World">>
from example: lcs(<<"Hello World">>, <<"Hellaao World">>)

Ah, you needed really a substring function.

Then best is, indeed, use the C code or add memoization to the Erlang version.

- Marc

I Gusti Ngurah Oka Prinarjaya

unread,

Mar 30, 2020, 10:16:57 AM3/30/20

to Marc Worrell, Erlang

Hi,

Hi @Marc Worrel, I've try using memoize version from https://rosettacode.org/wiki/Longest_common_subsequence#Erlang
and the result also very very slow.

Valentin Micic

unread,

Mar 30, 2020, 11:24:19 AM3/30/20

to I Gusti Ngurah Oka Prinarjaya, Erlang

On 30 Mar 2020, at 15:45, I Gusti Ngurah Oka Prinarjaya <okapri...@gmail.com> wrote:

Hi All,

Thank your very much for the answers,

@Valentin Micic , Your lcs function skipping the "World" word. I need "o" and "World" also count.

Maybe I misunderstood what you meant by "longest common prefix”.

Presuming that A and B are arbitrary binary “strings”, then:

- by common, I presumed something that is common for both A and B

- by prefix, I presumed something that precedes something else. In this context, this would be everything that A and B have in common, from left to right, before they start to differ.

This lead me to believe that <<“Hello World”>> and <<“Hellao World”>> have only <<“Hell”>> that satisfy the above criteria.

My apologies for being presumptuous.

Kind regards

V/

Richard O'Keefe

unread,

Mar 31, 2020, 12:13:45 AM3/31/20

to I Gusti Ngurah Oka Prinarjaya, Erlang

"substring" is a special case of "subsequence".
"OAR" is a subsequence of "cOronA viRus" but not a substring of it.
"OAR" is a substring of "bOARs" but not a prefix of it.

I am still in the dark about whether you want to view these things
as sequences of BYTEs, as encoded sequences of CODEPOINTS,
as encoded sequences of base-character+combining characters,
as sequence of grapheme clusters, or quite possibly as sequences
of tokens.

A lesson I learned back in 1980 is that "STRINGS ARE WRONG".
As Alan Perlis said, "The string is a stark data structure and
everywhere it is passed there is much duplication of process.
It is a perfect vehicle for hiding information." Since then experience
has only reinforced this: if you are looking inside a string, then you
probably should not be using a string.

Take a step backwards.
What do these strings (binaries) *represent*?
What is the logical structure of the things you are representing?
Why do you want an operation that is completely insensitive
to that logical structure?
Can two different strings represent the same abstract concept?
What, if anything, does *part* of a string represent?

On Tue, 31 Mar 2020 at 02:45, I Gusti Ngurah Oka Prinarjaya

I Gusti Ngurah Oka Prinarjaya

unread,

Mar 31, 2020, 2:03:13 AM3/31/20

to Richard O'Keefe, Erlang

Hi,

@Richard O'Keefe, Thanks a lot for your attention, shared knowledge and enlightenment.

>> I am still in the dark about whether you want to view these things
>> as sequences of BYTEs, as encoded sequences of CODEPOINTS,
>> as encoded sequences of base-character+combining characters,
>> as sequence of grapheme clusters, or quite possibly as sequences
>> of tokens

Honestly, my knowledge for LCS is not deep. I just observe the output of each libraries that i test, if the output suite my need, then i choose that library.

https://pypi.org/project/pylcs/ already provide functions to solve LCSubsequence and LCSubstring. But unfortunately all of the integration options not suited my case except NIF. I've try os:cmd() , Pyrlang and i'm not interested with erlang port because similarr with Pyrlang there are a terms serialization handling.

So I decided to learn to write a NIF based on this https://github.com/wollmers/c-lcs-bv/blob/master/lcstest.c . I need your help if someday i facing lots os error when run my very first NIF. And i do really still welcome for any Erlang's BIF based solution (if any). So i don't need to write a NIF that risk crashing the VM.

>> Take a step backwards.
>> What do these strings (binaries) *represent*?
>> What is the logical structure of the things you are representing?
>> Why do you want an operation that is completely insensitive
>> to that logical structure?
>> Can two different strings represent the same abstract concept?
>> What, if anything, does *part* of a string represent?

Honestly, for now i don't really understand some of your guidance above, for now i will just act as a house builder using Ikea's solutions (analogy).

Someday i will spent my free time for learning LCS more deeply act as a house building architect (analogy).

Reply all

Reply to author

Forward