Need a little help

14 views
Skip to first unread message

Rider

unread,
May 14, 2021, 3:43:22 AMMay 14
to
Hi experts,

I need a little perl help. Here is the requirement (at unix shell).

I have a text file, entries.txt with the following lines (just giving a few lines here, but there are around 100 entries in the actual file). Each line has one email id' followed by a user id (both separated by a tab). I am just giving first three lines here.
========================
a...@google.com abc1
cd...@yahoo.com cde
x...@gmail.com xyz2
=========================
Now the perl script should parse through a big dump of data (a file called text.xml) and replace the first email with the second entry (example: all a...@google.com entries in the dump should be replaced by abc1 and so on and so forth). Can someone help me with the code?

Now the Perl script should be like this:

read entries.txt file;
separate each line (split) in to two entries
loop through the below dump (whatever is below __DATA__)
Replace the first email entry with the second user id
Write all the updated data to a new file, updated.xml

__DATA__ (the below dump is in fact a file text.xml)
Hello world a...@google.com this is line 1
This is the second line with a lot text cd...@yahoo.com and much more
Here is the third line x...@gmail.com and lot of stuff here
One more line with a...@google.com


Now the output file, updated.xml should contain the following dump:
============
Hello world abc1 this is line 1
This is the second line with a lot text cde and much more
Here is the third line xyz2 and lot of stuff here
One more line with abc2
=============

Thanks in advance..
Ryder


Ben Bacarisse

unread,
May 14, 2021, 7:44:16 AMMay 14
to
Rider <clear...@yahoo.com> writes:

> Hi experts,
>
> I need a little perl help. Here is the requirement (at unix shell).
>
> I have a text file, entries.txt with the following lines (just giving
> a few lines here, but there are around 100 entries in the actual
> file). Each line has one email id' followed by a user id (both
> separated by a tab). I am just giving first three lines here.
> ========================
> a...@google.com abc1
> cd...@yahoo.com cde
> x...@gmail.com xyz2
> =========================
> Now the perl script should parse through a big dump of data (a file
> called text.xml) and replace the first email with the second entry
> (example: all a...@google.com entries in the dump should be replaced by
> abc1 and so on and so forth). Can someone help me with the code?

What have you tried? What bits are causing you trouble? If you just
want someone to write it for you, you may get lucky, but most people
prefer to help with learning rather than coding for free.

> Now the Perl script should be like this:
>
> read entries.txt file;
> separate each line (split) in to two entries
> loop through the below dump (whatever is below __DATA__)
> Replace the first email entry with the second user id
> Write all the updated data to a new file, updated.xml

Why must it be done like that? This very narrow prescription makes it
sound like coursework.

> __DATA__ (the below dump is in fact a file text.xml)
> Hello world a...@google.com this is line 1
> This is the second line with a lot text cd...@yahoo.com and much more
> Here is the third line x...@gmail.com and lot of stuff here
> One more line with a...@google.com
>
>
> Now the output file, updated.xml should contain the following dump:
> ============
> Hello world abc1 this is line 1
> This is the second line with a lot text cde and much more
> Here is the third line xyz2 and lot of stuff here
> One more line with abc2
> =============

I think that last abc2 shold be abc1.

--
Ben.

Otto J. Makela

unread,
May 14, 2021, 12:24:39 PMMay 14
to
Ben Bacarisse <ben.u...@bsb.me.uk> wrote:

> What have you tried? What bits are causing you trouble? If you just
> want someone to write it for you, you may get lucky, but most people
> prefer to help with learning rather than coding for free.

Indeed. The quick-and-dirty approach I've done in this kind of stuff is
to collect the strings & replacements into a hash, then make a regexp

$r='\\b('.join('|',map {...} sort {...} keys %myhash).')\\b';

(with appropriate regexp quoting for the individual keys with map {},
selecting the sort {} to put long strings first), then do something like

s/$r/$myhash{$1}/goe

on our whole target string.

I'm not terribly fond of using clunky string operations to build
regexps, and then there's the question of getting the regexp quoting
right. Is there some more elegant method people can think of?
--
/* * * Otto J. Makela <o...@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

Wasell

unread,
May 14, 2021, 2:22:46 PMMay 14
to
Can I assume this is homework? Maybe we can have some fun with
it...

I'm not much of a Perl golfer, but here's an attempt:

#!/usr/bin/perl
{local$/;$s=<DATA>};@ARGV='entries.txt';for(map{[split' ']}<>){$s
=~s/\Q$_->[0]/$_->[1]/g};open$g,'>updated.xml';print$g $s;
__DATA__
Hello world a...@google.com this is line 1
This is the second line with a lot text cd...@yahoo.com and much
more
Here is the third line x...@gmail.com and lot of stuff here
One more line with a...@google.com

It seems to follow the specification.

Rainer Weikusat

unread,
May 14, 2021, 3:02:40 PMMay 14
to
Wasell <usene...@wasell.eu> writes:
>> Hi experts,
>>
>> I need a little perl help. Here is the requirement (at unix shell).
>>
>> I have a text file, entries.txt with the following lines (just giving a
>> few lines here, but there are around 100 entries in the actual file).
>> Each line has one email id' followed by a user id (both separated by a
>> tab). I am just giving first three lines here.
>> ========================
>> a...@google.com abc1
>> cd...@yahoo.com cde
>> x...@gmail.com xyz2
>> =========================
>> Now the perl script should parse through a big dump of data (a file
>> called text.xml) and replace the first email with the second entry
>> (example: all a...@google.com entries in the dump should be replaced by
>> abc1 and so on and so forth). Can someone help me with the code?

[...]

> Can I assume this is homework? Maybe we can have some fun with
> it...
>
> I'm not much of a Perl golfer, but here's an attempt:
>
> #!/usr/bin/perl
> {local$/;$s=<DATA>};@ARGV='entries.txt';for(map{[split' ']}<>){$s
> =~s/\Q$_->[0]/$_->[1]/g};open$g,'>updated.xml';print$g $s;
> __DATA__
> Hello world a...@google.com this is line 1
> This is the second line with a lot text cd...@yahoo.com and much
> more
> Here is the third line x...@gmail.com and lot of stuff here
> One more line with a...@google.com
>
> It seems to follow the specification.

%m=map{split}`cat entries.txt`;
for(<DATA>){/\G\S+/gc&&(print($m{$&}//$&),redo);/\G\s+/gc&&(print($&),redo)}
__DATA__
Hello world a...@google.com this is line 1
This is the second line with a lot text cd...@yahoo.com and much more
Here is the third line x...@gmail.com and lot of stuff here
One more line with a...@google.com

:-)

Ben Bacarisse

unread,
May 14, 2021, 3:08:53 PMMay 14
to
o...@iki.fi (Otto J. Makela) writes:

> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>
>> What have you tried? What bits are causing you trouble? If you just
>> want someone to write it for you, you may get lucky, but most people
>> prefer to help with learning rather than coding for free.
>
> Indeed. The quick-and-dirty approach I've done in this kind of stuff is
> to collect the strings & replacements into a hash, then make a regexp
>
> $r='\\b('.join('|',map {...} sort {...} keys %myhash).')\\b';
>
> (with appropriate regexp quoting for the individual keys with map {},
> selecting the sort {} to put long strings first), then do something like
>
> s/$r/$myhash{$1}/goe
>
> on our whole target string.
>
> I'm not terribly fond of using clunky string operations to build
> regexps, and then there's the question of getting the regexp quoting
> right. Is there some more elegant method people can think of?

Not elegant, no, but I think I'd slurp the input and then loop over the
substitutions:

while (<$subs>) {
chomp;
my ($k, $s) = split /\t/;
$content =~ s/\b\Q$k\E\b/$s/g;
}

\Q and \E ensure the quoting is correct. And I stole your \b...\b
because I'd forgotten about that! I expect the OP wants it.
--
Ben.

Rainer Weikusat

unread,
May 14, 2021, 3:10:10 PMMay 14
to
Actually, that's much to complicated:

%m=map{split}`cat entries.txt`;
s|\S+|$m{$&}//$&|ge,print for<DATA>;

Rainer Weikusat

unread,
May 14, 2021, 4:56:36 PMMay 14
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:

[...]

> Not elegant, no, but I think I'd slurp the input and then loop over the
> substitutions:
>
> while (<$subs>) {
> chomp;
> my ($k, $s) = split /\t/;
> $content =~ s/\b\Q$k\E\b/$s/g;
> }

A pretty awful algorithm: The runtime will be proportional to the number
of substitutions times the length of the text, ie, quadratic.

More defensively written alternate suggestion:

--------
my %subs;

{
my $fh;
open($fh, '<', 'entries.txt') or die("open: $!");
%subs = map { split } <$fh>;
}

for (<DATA>) {
s|\S+|$subs{$&} // $&|ge;
print;
}

__DATA__
Hello world a...@google.com this is line 1
This is the second line with a lot text cd...@yahoo.com and much more
Here is the third line x...@gmail.com and lot of stuff here
One more line with a...@google.com
-------

That's a linear algorithm as it makes just one pass through the input
data.

NB: I didn't benchmark this and the O-difference doesn't necessarily
mean it'll be faster in practice for realistic amounts of input data. It
also won't replace results of prior replacements which may or may not be
desired.

Ben Bacarisse

unread,
May 14, 2021, 6:29:53 PMMay 14
to
Rainer Weikusat <rwei...@talktalk.net> writes:

> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>
> [...]
>
>> Not elegant, no, but I think I'd slurp the input and then loop over the
>> substitutions:
>>
>> while (<$subs>) {
>> chomp;
>> my ($k, $s) = split /\t/;
>> $content =~ s/\b\Q$k\E\b/$s/g;
>> }
>
> A pretty awful algorithm: The runtime will be proportional to the number
> of substitutions times the length of the text, ie, quadratic.

I don't think that's technically quadratic, but I know what you mean.
It's pretty awful. This looked like a throw-away task, so I didn't care
about the O(mn) complexity.

> More defensively written alternate suggestion:
>
> --------> my %subs;
>
> {
> my $fh;
> open($fh, '<', 'entries.txt') or die("open: $!");
> %subs = map { split } <$fh>;

(The OP had tab separated pairs)

> }
>
> for (<DATA>) {
> s|\S+|$subs{$&} // $&|ge;

This is likely to miss some expected cases in XML data since, say, <addr
mail="a...@goole.com"> won't match a...@goole.com.

> print;
> }
>
> __DATA__
> Hello world a...@google.com this is line 1
> This is the second line with a lot text cd...@yahoo.com and much more
> Here is the third line x...@gmail.com and lot of stuff here
> One more line with a...@google.com
> -------
>
> That's a linear algorithm as it makes just one pass through the input
> data.
>
> NB: I didn't benchmark this and the O-difference doesn't necessarily
> mean it'll be faster in practice for realistic amounts of input data. It
> also won't replace results of prior replacements which may or may not be
> desired.

Yup. What's fast, or fast enough, is going to depend on a lot of
details. But, sure, as the number of search strings grows, looping over
them will eventually kill the performance.

--
Ben.

Rainer Weikusat

unread,
May 14, 2021, 7:13:25 PMMay 14
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:
> Rainer Weikusat <rwei...@talktalk.net> writes:
>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>>
>> [...]
>>
>>> Not elegant, no, but I think I'd slurp the input and then loop over the
>>> substitutions:
>>>
>>> while (<$subs>) {
>>> chomp;
>>> my ($k, $s) = split /\t/;
>>> $content =~ s/\b\Q$k\E\b/$s/g;
>>> }
>>
>> A pretty awful algorithm: The runtime will be proportional to the number
>> of substitutions times the length of the text, ie, quadratic.
>
> I don't think that's technically quadratic, but I know what you mean.
> It's pretty awful. This looked like a throw-away task, so I didn't care
> about the O(mn) complexity.

The first time in my life I can do an actual mathematical proof: There
are two sets involved here with lenghts n and m. The total running time
is proportional to n * m. There are two cases here:

1. n == m. In this case n * m = n * n which is obviously quadratic.

2. n < m or m < n, without less of generality, n < m is assumed. In this
case, n * m = n * n * (m / n) [, m / n > 1 because n * m > n * n]. Hence,
it's quadratic as well.

:-))

>> More defensively written alternate suggestion:
>>
>> --------> my %subs;
>>
>> {
>> my $fh;
>> open($fh, '<', 'entries.txt') or die("open: $!");
>> %subs = map { split } <$fh>;
>
> (The OP had tab separated pairs)

split without arguments splits $_ on \s+. That's going to cover
tab-separated text.

>> }
>>
>> for (<DATA>) {
>> s|\S+|$subs{$&} // $&|ge;
>
> This is likely to miss some expected cases in XML data since, say, <addr
> mail="a...@goole.com"> won't match a...@goole.com.

It's supposed to work for the provided example. It's also going to miss
addresses at the end of a sentence, eg

His email address was woo...@chewbacca.com.

Martin Vaeth

unread,
May 15, 2021, 2:25:51 AMMay 15
to
Rainer Weikusat <rwei...@talktalk.net> wrote:
> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>> Rainer Weikusat <rwei...@talktalk.net> writes:
>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>>>
>>> [...]
>>>
>>>> Not elegant, no, but I think I'd slurp the input and then loop over the
>>>> substitutions:
>>>>
>>>> while (<$subs>) {
>>>> chomp;
>>>> my ($k, $s) = split /\t/;
>>>> $content =~ s/\b\Q$k\E\b/$s/g;
>>>> }
>>>
>>> A pretty awful algorithm: The runtime will be proportional to the number
>>> of substitutions times the length of the text, ie, quadratic.
>>
>> I don't think that's technically quadratic, but I know what you mean.
>> It's pretty awful. This looked like a throw-away task, so I didn't care
>> about the O(mn) complexity.
>
> The first time in my life I can do an actual mathematical proof: There
> are two sets involved here with lenghts n and m. The total running time
> is proportional to n * m. There are two cases here:
>
> 1. n == m. In this case n * m = n * n which is obviously quadratic.

So the worst case running time is quadratic in the input length,
and you are already done. (Usually, O(.) refers to the worst case.)

This is simultaneously the "average case" running time if one defines
the averaging in a natural way, but this is a bit harder to see.
(And one can argue which definition of averaging is really natural in
this example - that is, it depends about the planned use case.)

> 2. n < m or m < n, without less of generality, n < m is assumed. In this
> case, n * m = n * n * (m / n) [, m / n > 1 because n * m > n * n]. Hence,
> it's quadratic as well.

If you mean to say here that even in the "best case" you have quadratic
running, time you are wrong:
In the "best data" case you have n=1 or m=1 (or at least bounded by
a constant), despite the input data n+m can be arbitrarily long.
So the "best case" is only linear running time.

Otto J. Makela

unread,
May 15, 2021, 9:58:21 AMMay 15
to
Ben Bacarisse <ben.u...@bsb.me.uk> wrote:

> Not elegant, no, but I think I'd slurp the input and then loop over
> the substitutions:
>
> while (<$subs>) {
> chomp;
> my ($k, $s) = split /\t/;
> $content =~ s/\b\Q$k\E\b/$s/g;
> }
>
> \Q and \E ensure the quoting is correct. And I stole your \b...\b
> because I'd forgotten about that! I expect the OP wants it.

I believe your algorithm might fail if the replaced strings can be
substrings of each other, depending on the order they are presented?

OP's question of course didn't have any such cases, but since we're
talking algorithms here, it'd be nice if also the edge cases worked.

Ben Bacarisse

unread,
May 15, 2021, 11:07:35 AMMay 15
to
o...@iki.fi (Otto J. Makela) writes:

> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>
>> Not elegant, no, but I think I'd slurp the input and then loop over
>> the substitutions:
>>
>> while (<$subs>) {
>> chomp;
>> my ($k, $s) = split /\t/;
>> $content =~ s/\b\Q$k\E\b/$s/g;
>> }
>>
>> \Q and \E ensure the quoting is correct. And I stole your \b...\b
>> because I'd forgotten about that! I expect the OP wants it.
>
> I believe your algorithm might fail if the replaced strings can be
> substrings of each other, depending on the order they are presented?

Yes, but we don't even know if the \b...\b is correct so I think that's
too fine a point for the specific case.

> OP's question of course didn't have any such cases, but since we're
> talking algorithms here, it'd be nice if also the edge cases worked.

Sure, but as already pointed out, if we are talking algorithms I don't
think you'd want to do it this way. Mine was a quick-and-dirty get it
done now solution.

--
Ben.

gamo

unread,
May 16, 2021, 5:54:00 PMMay 16
to
El 14/5/21 a las 20:22, Wasell escribió:

> #!/usr/bin/perl
> {local$/;$s=<DATA>};@ARGV='entries.txt';for(map{[split' ']}<>){$s
> =~s/\Q$_->[0]/$_->[1]/g};open$g,'>updated.xml';print$g $s;
> __DATA__

It's a mistery for me why do you use split' '
instead of the more golfer split"\t" Could you explain?
Thanks!

--
http://gamo.sdf-eu.org/
perl -E 'say "[U]ndo or [c]ontinue? (y/N) ";'

Randal L. Schwartz

unread,
May 16, 2021, 9:43:10 PMMay 16
to
>>>>> "gamo" == gamo <ga...@telecable.es> writes:

gamo> It's a mistery for me why do you use split' '
gamo> instead of the more golfer split"\t" Could you explain?

One char in the string instead of two?

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<mer...@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Dart/Flutter consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig

gamo

unread,
May 16, 2021, 9:50:10 PMMay 16
to
El 17/5/21 a las 3:42, Randal L. Schwartz escribió:
>>>>>> "gamo" == gamo <ga...@telecable.es> writes:
>
> gamo> It's a mistery for me why do you use split' '
> gamo> instead of the more golfer split"\t" Could you explain?
>
> One char in the string instead of two?
>

Oh, yes, sorry. I didn't know if the quest was
about spacing or typping. Anyway, I think that
the obfuscation could be done in any language,
and the possibility of being concise is not
a fault of the lang as I read.

--
http://gamo.sdf-eu.org/
perl -E 'say "[W]ant a [m]isunderstood? (X/y) ";'
Reply all
Reply to author
Forward
0 new messages