Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Any alternative for substr() function

20 views
Skip to first unread message

kavita kulkarni

unread,
Apr 10, 2013, 6:16:00 AM4/10/13
to begi...@perl.org
Hi All,

I want to extract certain char sets (e.g. char 4 to 9, char 11 to 14 etc.)
from each line of my file and store them in separate variables for further
processing.
My file size can vary from thousand to millions of lines.
If I use perl in-built function substr() to data extraction, it has huge
impact on performance.
Is there any alternative for this?

Thanks in advance.

Cheers,
Kavita

Chankey Pathak

unread,
Apr 10, 2013, 6:28:20 AM4/10/13
to kavita kulkarni, begi...@perl.org
Hi Kavita,

You may try unpack (http://perldoc.perl.org/functions/unpack.html)

Also read these: http://www.perlmonks.org/?node_id=308607,
http://stackoverflow.com/questions/1083269/is-perls-unpack-ever-faster-than-substr


On Wed, Apr 10, 2013 at 3:46 PM, kavita kulkarni
<kavitah...@gmail.com>wrote:
--
Regards,
Chankey Pathak <http://www.linuxstall.com>

timothy adigun

unread,
Apr 10, 2013, 6:35:38 AM4/10/13
to Chankey Pathak, Perl Beginners, kavita kulkarni
Hi,
On 10 Apr 2013 11:30, "Chankey Pathak" <chank...@gmail.com> wrote:
>
> Hi Kavita,
>
> You may try unpack (http://perldoc.perl.org/functions/unpack.html)
>
unpack would not work if the OP has varying length of lines.

> Also read these: http://www.perlmonks.org/?node_id=308607,
>
http://stackoverflow.com/questions/1083269/is-perls-unpack-ever-faster-than-substr
>
>
> On Wed, Apr 10, 2013 at 3:46 PM, kavita kulkarni
> <kavitah...@gmail.com>wrote:
>
> > Hi All,
> >
> > I want to extract certain char sets (e.g. char 4 to 9, char 11 to 14
etc.)
> > from each line of my file and store them in separate variables for
further
> > processing.
> > My file size can vary from thousand to millions of lines.
> > If I use perl in-built function substr() to data extraction, it has huge
> > impact on performance.
> > Is there any alternative for this?

Please can you give example of what you want done and show how you are
going about it?

Jenda Krynicky

unread,
Apr 10, 2013, 7:02:16 AM4/10/13
to Perl Beginners
From: timothy adigun <2tee...@gmail.com>
> On 10 Apr 2013 11:30, "Chankey Pathak" <chank...@gmail.com> wrote:
> >
> > Hi Kavita,
> >
> > You may try unpack (http://perldoc.perl.org/functions/unpack.html)
> >
> unpack would not work if the OP has varying length of lines.

Nope. It would work just fine as long as the bits he's interested in
are fixed lengh and are on fixed positions. The length of the
uninteresting trailing stuff is irrelevant.

Jenda
===== Je...@Krynicky.cz === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery

kavita kulkarni

unread,
Apr 10, 2013, 8:02:02 AM4/10/13
to Jenda Krynicky, begi...@perl.org
Data file has the lines with same length and so the field positions I am
interested in (so unpack works for me).
Tried with "unpack" as well & it takes almost same time as substr().

Here is sample code:

Below function is called for each line of input data file..

sub extractFieldValue {
my $self = shift;
my $data = shift; #Line from data file
my $start = shift;
my $length = shift;
my $startPos = $start - 1;
my $val = unpack("a$length", $data);
$val =~ s/\|/ /g;
return $val;
}

$val is then used as key of a Hash for further processing.

Cheers,
Kavita

Regards,
Kavita :-)
> --
> To unsubscribe, e-mail: beginners-...@perl.org
> For additional commands, e-mail: beginne...@perl.org
> http://learn.perl.org/
>
>
>

Bob goolsby

unread,
Apr 10, 2013, 11:45:40 AM4/10/13
to Perl Beginners
G'Mornin' Kavita --

Before you go off on a goose chase, how do you know that substr() is going
to be a problem for you? Have you bench-marked it? If your file is as
large as you say, I strongly suspect that your bottleneck is going to be
I/O and any differences between unpack() and substr() will be lost in the
noise band below 1%.


B


On Wed, Apr 10, 2013 at 5:02 AM, kavita kulkarni
<kavitah...@gmail.com>wrote:
--

Bob Goolsby
bob.g...@gmail.com

David Precious

unread,
Apr 10, 2013, 11:49:28 AM4/10/13
to begi...@perl.org
On Wed, 10 Apr 2013 08:45:40 -0700
Bob goolsby <bob.g...@gmail.com> wrote:

> Before you go off on a goose chase, how do you know that substr() is
> going to be a problem for you? Have you bench-marked it? If your
> file is as large as you say, I strongly suspect that your bottleneck
> is going to be I/O and any differences between unpack() and substr()
> will be lost in the noise band below 1%.

^^ This.

Kavita, I'd strongly advise profiling your script with Devel::NYTProf
to see exactly where the time is being spent; it can be a surprise
sometimes.

Once you've got some profiling data showing you where the most time is
spent, you know where to focus your efforts.

Ken Slater

unread,
Apr 10, 2013, 12:02:41 PM4/10/13
to kavita kulkarni, Jenda Krynicky, begi...@perl.org
As others have mentioned, you should profile the program to get an idea of
what code is taking up time. That said, here are a couple of comments:

When you have that many arguments, it is usually preferrable (may be a
little faster) to use the following (rather than shifts):

my ($self, $data, $start, $length) = @_;

That said, if you can, get rid of arguments you aren't using. Also, you are
setting a variable named $startPos, and never using it.

Also, for your regular expression, utilize the qr operator:

my $regexp = qr#\|#;

sub extractFieldValue {
...
...
$val =~ s/$regexp/ /g;
...
}

HTH, Ken


On Wed, Apr 10, 2013 at 8:02 AM, kavita kulkarni
<kavitah...@gmail.com>wrote:

Rob Dixon

unread,
Apr 10, 2013, 2:31:20 PM4/10/13
to Perl Beginners, Bob goolsby, kavitah...@gmail.com
On 10/04/2013 16:45, Bob goolsby wrote:
> G'Mornin' Kavita --
>
> Before you go off on a goose chase, how do you know that substr() is going
> to be a problem for you? Have you bench-marked it? If your file is as
> large as you say, I strongly suspect that your bottleneck is going to be
> I/O and any differences between unpack() and substr() will be lost in the
> noise band below 1%.

Hint:

Try adding

return "";

as the first statement of `extractFieldValue` (i.e. so that it does
nothing and returns a null string) then run your program and see how
long it takes just doing the IO.

Rob

kavita kulkarni

unread,
Apr 12, 2013, 7:23:52 AM4/12/13
to Rob Dixon, Perl Beginners, Bob goolsby
Thanks all, got many ideas from you..

My script took ~7 min to run with data file of ~50,000 lines with
substr()/unpack() enabled and same script took ~2 min after disabling
substr()/unpack().
That led me to the conclusion that substr/unpack is taking maximum of my
time (and that I should reduce).

I have updated changes like using "@_" instead of "shift", with very small
performance improvement.

For "Devel::NYTProf", I need to check if my SA if he will allow me to
install.

Thanks again for your responses.

Regards,
Kavita :-)

Rob Dixon

unread,
Apr 12, 2013, 7:55:27 AM4/12/13
to Perl Beginners, kavita kulkarni
On 12/04/2013 12:23, kavita kulkarni wrote:
>
> Thanks all, got many ideas from you..
>
> My script took ~7 min to run with data file of ~50,000 lines with
> substr()/unpack() enabled and same script took ~2 min after disabling
> substr()/unpack().
> That led me to the conclusion that substr/unpack is taking maximum of my
> time (and that I should reduce).
>
> I have updated changes like using "@_" instead of "shift", with very small
> performance improvement.
>
> For "Devel::NYTProf", I need to check if my SA if he will allow me to
> install.

I am worried about your figures. Two minutes for 50,000 lines is 2.4ms
per line, and that is a *huge* amount of time.

I think you should show us your data and code, as it sounds like
something is going wrong.

Rob

Paul Johnson

unread,
Apr 10, 2013, 6:32:37 AM4/10/13
to kavita kulkarni, begi...@perl.org
On Wed, Apr 10, 2013 at 03:46:00PM +0530, kavita kulkarni wrote:
> Hi All,

Hello.

> I want to extract certain char sets (e.g. char 4 to 9, char 11 to 14 etc.)
> from each line of my file and store them in separate variables for further
> processing.
> My file size can vary from thousand to millions of lines.
> If I use perl in-built function substr() to data extraction, it has huge
> impact on performance.

Compared to what?

> Is there any alternative for this?

Perhaps unpack() or regular expressions, but I doubt either would be
much faster, if at all.

--
Paul Johnson - pa...@pjcj.net
http://www.pjcj.net

Michael Rasmussen

unread,
Apr 12, 2013, 10:52:13 PM4/12/13
to kavita kulkarni, Rob Dixon, Perl Beginners, Bob goolsby
On Fri, Apr 12, 2013 at 04:53:52PM +0530, kavita kulkarni wrote:
> Thanks all, got many ideas from you..
>
> My script took ~7 min to run with data file of ~50,000 lines with
> substr()/unpack() enabled and same script took ~2 min after disabling
> substr()/unpack().

No one has asked what kind of hardware you're running this on, so I will.

Reading the thread, I created a very simplistic test:

michael@post:~$ wc -l /var/log/mail.info
973819 /var/log/mail.info
michael@post:~$ time perl -ne '$t = substr $_, 4, 9; $s = substr $_, 11, 15; print $t,$s,$/;' /var/log/mail.info > /dev/null

real 0m2.253s
user 0m2.104s
sys 0m0.148s
michael@post:~$

Over 970,000 lines processed with substr, extracting two substrings from positions described in an
earlier email of yours. Total processing time less than 3 seconds.

I don't believe substr extracting strings is your bottleneck. We really could use some sample data and code to assist.


--
Michael Rasmussen, Portland Oregon
Be Appropriate && Follow Your Curiosity
Other Adventures: http://www.jamhome.us/ or http://gplus.to/MichaelRpdx
A special random fortune cookie fortune:
In general, they do what you want, unless you want consistency.
~ Larry Wall

Charles DeRykus

unread,
Apr 13, 2013, 2:49:18 AM4/13/13
to kavita kulkarni, Rob Dixon, Perl Beginners, Bob goolsby
On Fri, Apr 12, 2013 at 4:23 AM, kavita kulkarni
<kavitah...@gmail.com>wrote:

> Thanks all, got many ideas from you..
>
> My script took ~7 min to run with data file of ~50,000 lines with
> substr()/unpack() enabled and same script took ~2 min after disabling
> substr()/unpack().
> ...
>
> For "Devel::NYTProf", I need to check if my SA if he will allow me to
> install


You can usually just install the module under your own directory...

See: perldoc -q "How do I keep my own module/library directory?"

--
Charles DeRykus
0 new messages