Question: Adding `lcount` to `Base.string.jl`

376 views
Skip to first unread message

julia.u...@gmail.com

unread,
Mar 10, 2015, 8:48:55 AM3/10/15
to juli...@googlegroups.com

Question: Adding `lcount` to `Base.string.jl`

Working with text/text files one might need more than once to get the number of leading char (e.g. space - text indentation)
 
e.g. different config files, yaml, markdown, python, ect..

The Base.string.jl contains currently lstrip but would it be possible to have something like lcount added.
This is also much faster than for instance lstrip.

@doc """### lcount(s::AbstractString, char_::Char=' ')

Return `number` of leading `char` in `string`.
"""
->
function lcount(s::AbstractString, char_::Char=' ')
    i
= start(s)
   
while !done(s,i)
        c
, j = next(s,i)
       
if c != char_
           
return i - 1
       
end
        i
= j
   
end
   
return i - 1
end



I'm aware that one can just write a function like this in each cases - but I thought as it is a common task with strings
(and different than `lstrip`) I ask.

I could send a pullrequest with entries for:

documentation / test case ect..
  • @help  - helpdb.jl
  • doc/stdlib/strings.rst
  • test/strings
  • ect..
just a question...


Mauro

unread,
Mar 10, 2015, 9:35:44 AM3/10/15
to juli...@googlegroups.com
I don't handle strings much so I cannot comment on the merits. However,
I found that soliciting feedback for such things is best done with a
pull request.

On Tue, 2015-03-10 at 13:12, julia.u...@gmail.com wrote:
> *Question: Adding `lcount` to `Base.string.jl`*
>
> Working with text/text files one might need more than once to get the
> number of leading char (e.g. space - text indentation)
>
> e.g. different config files, yaml, markdown, python, ect..
>
> The *Base.string.jl* contains currently *lstrip* but would it be possible
> to have something like *lcount* added.
> This is also much faster than for instance *lstrip.*
>
> @doc """### lcount(s::AbstractString, char_::Char=' ')
>
> Return `number` of leading `char` in `string`.
> """ ->
> function lcount(s::AbstractString, char_::Char=' ')
> i = start(s)
> while !done(s,i)
> c, j = next(s,i)
> if c != char_
> return i - 1
> end
> i = j
> end
> return i - 1
> end
>
>
>
> I'm aware that one can just write a function like this in each cases - but
> I thought as it is a common task with strings
> (and different than `lstrip`) I ask.
>
> *I could send a pullrequest with entries for:*
>
> documentation / test case ect..
>
> - @help - helpdb.jl
> - doc/stdlib/strings.rst
> - test/strings
> - ect..
>
> just a question...

Steven G. Johnson

unread,
Mar 10, 2015, 1:01:41 PM3/10/15
to juli...@googlegroups.com
On Tuesday, March 10, 2015 at 8:48:55 AM UTC-4, julia.u...@gmail.com wrote:
Working with text/text files one might need more than once to get the number of leading char (e.g. space - text indentation)

Before you submit a pull request, the first question to ask is: is such a function provided in other languages, particularly general-purpose languages like Python or Ruby that extensively support text processing?  What are they called in those languages, and what semantics do they support?  We want to copy other successful designs as much as possible.

In this particular case, even Python does not appear to have such a function (http://stackoverflow.com/questions/13648813/what-is-the-pythonic-way-to-count-the-leading-spaces-in-a-string), which leads me to doubt whether it is generally useful enough to merit inclusion in the Julia standard library.  This is particularly the case since, unlike Python, writing a trivial loop implementation in Julia is just as fast as anything in the standard library would be, so there is not quite as much pressure to put things like this in Base.

Jameson Nash

unread,
Mar 10, 2015, 7:14:43 PM3/10/15
to juli...@googlegroups.com

my one-liner solution:

first(search("   asdf",r"[^ ]|$")) ÷ sizeof(" ") - 1

(note the OP solution returns the wrong answer for multibyte characters, since it is missing the div by the size of the character representation in the original string type. or preferably just keep count of the characters, rather than returning the final byte offset.)

I suspect that most languages don’t have this because (a) lstrip is more generally useful, because it is generally doing what you wanted, and potentially avoids lots of slicing later (b) lstrip isn’t actually much of a contributor to the program speed for most real-world tasks (e.g. reading a file of the hard drive is typically assumed to be several orders of magnitude slower than copying a string), and if it was, we could just change it to return a SubString

julia.u...@gmail.com

unread,
Mar 11, 2015, 5:33:19 AM3/11/15
to juli...@googlegroups.com

Thanks for the feedbacks - was the reason I ask first.


Ps: @Jameson

the oneliner seems to be about 3times slower.


>>> Suggested lcount
elapsed time: 0.52221899 seconds (152 MB allocated, 0.44% gc time in 7 pauses with 0 full sweep)
elapsed time: 0.524616396 seconds (152 MB allocated, 0.80% gc time in 7 pauses with 0 full sweep)
elapsed time: 0.521950371 seconds (152 MB allocated, 0.45% gc time in 7 pauses with 0 full sweep)
>>> Oneliner
elapsed time: 1.767340364 seconds (152 MB allocated, 0.13% gc time in 7 pauses with 0 full sweep)
elapsed time: 1.764453694 seconds (152 MB allocated, 0.14% gc time in 7 pauses with 0 full sweep)
elapsed time: 1.77071118 seconds (152 MB allocated, 0.13% gc time in 7 pauses with 0 full sweep)
>>> Suggested lcount
elapsed time: 0.550863079 seconds (152 MB allocated, 0.45% gc time in 7 pauses with 0 full sweep)
elapsed time: 0.520632211 seconds (152 MB allocated, 0.48% gc time in 7 pauses with 0 full sweep)
elapsed time: 0.521652503 seconds (152 MB allocated, 0.70% gc time in 7 pauses with 0 full sweep)
>>> Oneliner
elapsed time: 1.767211962 seconds (152 MB allocated, 0.14% gc time in 7 pauses with 0 full sweep)
elapsed time: 1.768012599 seconds (152 MB allocated, 0.14% gc time in 7 pauses with 0 full sweep)
elapsed time: 1.767895431 seconds (152 MB allocated, 0.14% gc time in 7 pauses with 0 full sweep)

Running 1000000 times over

const text = """0 Indent
 1 Indent
  2 Indent
0 Indent
          10 Indent
      6 Indent
  2 Indent
0 Indent
 1 Indent
    4 Indent
    1 Tab indent
   

        8 Indent"""



Tomas Lycken

unread,
Mar 13, 2015, 2:48:22 AM3/13/15
to juli...@googlegroups.com
Is there a reason not to use

lcount(str) = length(str) - length(lstrip(str))

here? Seems a lot clearer, and is probably fast enough if disk io is the bottleneck anyway.

// T

Scott Jones

unread,
May 7, 2015, 8:25:41 AM5/7/15
to juli...@googlegroups.com
Instead of a specific function like this, I think it would be better to have a function that would find the index
of the first character that is *not* of some class... and that very well may already be available (I'll start looking for it, as I need that sort of functionality myself).

Scott


On Tuesday, March 10, 2015 at 8:48:55 AM UTC-4, julia.u...@gmail.com wrote:

Scott Jones

unread,
May 7, 2015, 8:26:48 AM5/7/15
to juli...@googlegroups.com
Just the cost of creating the potentially large object just to count it's length, and then throwing it away...

Páll Haraldsson

unread,
May 12, 2015, 5:13:08 AM5/12/15
to juli...@googlegroups.com
It depends.. The function lstrip returns s[i:end] (or ""). If that would be a subarry object (only possible in 0.4 and probably not, yet, for strings) then you would only allocate that small subarry object.

If Python is clever that way then I understand the missing function. Anyway, I'm looking into implementing better string handling that could do this and am probably duplicating your work.. If there are issue numbers I should know about please tell (even in private e-mail).

The small subarray objects with with the "pointers" or indexes would be fixed size and I assume could reside on the stack (escape analysis - again only in 0.4).

Palli.

Steven G. Johnson

unread,
May 15, 2015, 5:05:42 PM5/15/15
to juli...@googlegroups.com


On Thursday, May 7, 2015 at 8:26:48 AM UTC-4, Scott Jones wrote:
Just the cost of creating the potentially large object just to count it's length, and then throwing it away...

It is trivial to implement a loop to compute this in Julia efficiently.  I think the one-liner here was optimized for brevity (which is perfectly appropriate for some uses).  But not all trivial subroutines need to be in base (vs. just telling people to write their own loops), especially in a language like Julia where not-built-in != slow.

See my post above.  Is there any evidence that this is an important operation to include in a standard library?  Do other text-processing-oriented languages include it?  Python?  Perl?

Scott Jones

unread,
May 15, 2015, 5:12:38 PM5/15/15
to juli...@googlegroups.com
I do think that finding the first occurrence of a character that is *not* in a particular class (like whitespace) after a given location (default to start), is important... we did it constantly in our own code, and so did our customers.
I haven't looked into Julia's regex capabilities, or their speed... I would think that that might do the trick... (I know the ICU library would make this easy... but it wants UTF-16, which Julia is still rather slow in converting
to/from from UTF-8 ;-) )

ele...@gmail.com

unread,
May 15, 2015, 7:42:19 PM5/15/15
to juli...@googlegroups.com
In most things I do its also more common to want leading whitespace, ie a class of characters, not a single char.  So the OP doesn't add much in my case.  Also as pointed out above, the OP returns a code point count for UTF*strings, so its no use for indexing to find the non-leading char anyway. 

Steven G. Johnson

unread,
May 16, 2015, 8:36:20 AM5/16/15
to juli...@googlegroups.com


On Friday, May 15, 2015 at 7:42:19 PM UTC-4, ele...@gmail.com wrote:
In most things I do its also more common to want leading whitespace, ie a class of characters, not a single char.  So the OP doesn't add much in my case.  Also as pointed out above, the OP returns a code point count for UTF*strings, so its no use for indexing to find the non-leading char anyway.

If you want an index, you can always use a regex...

ele...@gmail.com

unread,
May 16, 2015, 8:59:03 AM5/16/15
to juli...@googlegroups.com
Yes, so lcount() as defined doesn't add anything. 
Reply all
Reply to author
Forward
0 new messages