Tool for sequence searches

Guillaume Dargaud

unread,

Jun 5, 2012, 6:01:59 AM6/5/12

to

Hello all,
what generic command line tools can I use when dealing with searching
sequences missteps ?
Let me explain better with some examples. Say I have a simple sequence of
numbers and some may be missing:
1
2
3
6
7
8
11
...
I'd like to obtain the gap limits, so here 3, 6, 8, 11...

My first thought is to do a loop in bash that inc a number and simply check
if it's present in the expected position, but I'm sure you guys can come up
with a better way.

Note that the real problem is more complicated (I should also detect
repetitions) so I'm open to various suggestions.

--
Guillaume Dargaud
http://www.gdargaud.net/

Janis Papanagnou

unread,

Jun 5, 2012, 6:46:53 AM6/5/12

to

Am 05.06.2012 13:01, schrieb Guillaume Dargaud:
> Hello all,
> what generic command line tools can I use when dealing with searching
> sequences missteps ?
> Let me explain better with some examples. Say I have a simple sequence of
> numbers and some may be missing:
> 1
> 2
> 3
> 6
> 7
> 8
> 11
> ...
> I'd like to obtain the gap limits, so here 3, 6, 8, 11...

Here's one solution (where I output the corresponding two bounds
in one line which I think is clearer, but you may also write two
print statements instead)...

awk '++c!=$1 {print c-1,$1; c=$1}'

>
> My first thought is to do a loop in bash that inc a number and simply check
> if it's present in the expected position, but I'm sure you guys can come up
> with a better way.
>
> Note that the real problem is more complicated (I should also detect
> repetitions) so I'm open to various suggestions.

The program that I suggested will show repetitions as lines with
two identical values, so an extension is straightforward, e.g.

awk '++c!=$1 {print (c-1!=$1 ? "gap:" : "rep:"),c-1,$1; c=$1}'

Janis

>

Guillaume Dargaud

unread,

Jun 5, 2012, 8:33:37 AM6/5/12

to

> The program that I suggested will show repetitions as lines with
> two identical values, so an extension is straightforward, e.g.

Thanks. I always have a hard time understanding awk scripts but I'll look at
it better.

For simple repetitions, I usually use 'uniq -d', but in my case the lines
can repeat like this:
SomeRandomChars-123-SomeOtherChars
SomeOtherRandomChars-123-SomeOtherChars
So I can't simply use 'uniq'.

I'll see if I can adapt your awk script. Thanks

Janis Papanagnou

unread,

Jun 5, 2012, 8:49:24 AM6/5/12

to

Am 05.06.2012 15:33, schrieb Guillaume Dargaud:
>> The program that I suggested will show repetitions as lines with
>> two identical values, so an extension is straightforward, e.g.
>
> Thanks. I always have a hard time understanding awk scripts but I'll look at
> it better.

If you have concrete questions, feel free to ask.

>
> For simple repetitions, I usually use 'uniq -d', but in my case the lines
> can repeat like this:
> SomeRandomChars-123-SomeOtherChars
> SomeOtherRandomChars-123-SomeOtherChars
> So I can't simply use 'uniq'.

IIUC, repetitions are defined by the suffix?
Then the awk program could be extended in two steps; define the FS="-"
so that you can access the three parts individually as $1, $2, and $3,
and then compare only the relevant parts of the whole line, in your case
probably the concatenation $2 $3 (or maybe only the mid part $2 ?).

Janis

Ed Morton

unread,

Jun 5, 2012, 11:05:08 AM6/5/12

to

If you want a real solution, post the real problem. Often the solutions to
simple problems aren't extensible to more complex problems and then everyone
gets frustrated that you didn't just post the real problem in the first place
and so wasted everyone's time helping you solve a non-existent problem.

Ed.

Thomas 'PointedEars' Lahn

unread,

Jun 5, 2012, 2:54:20 PM6/5/12

to

Guillaume Dargaud wrote:

> what generic command line tools can I use when dealing with searching
> sequences missteps ?

For example the bash debugger.

--
PointedEars

Please do not Cc: me. / Bitte keine Kopien per E-Mail.

bsh

unread,

Jun 5, 2012, 10:50:32 PM6/5/12

to

On Jun 5, 3:01 am, Guillaume Dargaud
<use_the_contact_f...@www.gdargaud.net> wrote:
> What generic command line tools can I use when dealing with searching
> sequences missteps ?

> ...
> My first thought is to do a loop in bash that inc a number and simply check
> if it's present in the expected position, but I'm sure you guys can come up
> with a better way.

Does the following help? It finds the _first_ missing digit of the
sequence, not all of them, but perhaps it can be inserted into a
loop to iteratively capture all of them in turn, after processing.

http://groups.google.com/group/comp.unix.shell/browse_thread/thread/d8bb72ff1e936aa1/2b484f1e13d4c8a5?q=%22find+lowest+alarm+number%22+group:comp.unix.shell

=Brian

Ed Morton

unread,

Jun 6, 2012, 12:40:54 PM6/6/12

to

Seems kinda wordy compared to:

awk '$0 != ++expected{print expected; exit}'

Regards,

Ed.

Posted using www.webuse.net

bsh

unread,

Jun 6, 2012, 8:03:42 PM6/6/12

to

On Jun 6, 9:40 am, "Ed Morton" <mortons...@gmail.com> wrote:

> bsh <brian_hi...@rocketmail.com> wrote:
> > On Jun 5, 3:01 am, Guillaume Dargaud
> > <use_the_contact_f...@www.gdargaud.net> wrote:

> http://groups.google.com/group/comp.unix.shell/browse_thread/thread/d...

> Seems kinda wordy compared to:
> awk '$0 != ++expected{print expected; exit}'

Hmmm. I anticipated this comment, but frankly, I thought
that it had been sufficiently hashed out several years ago....

Wordy? Well, yes and no. Reminds me of the amusing
programmer humor that purports to program "Hello World"
as a newby, intermediate, graduate student, and professional
programmer. As the latter, his "Hello World" is a 200-line
C++ class....

Scriptarians will presumably preternaturally elevate the
criterion of _character count_ to quantize elegance, ur,
wordiness. Well, suum cuique pulchrum est....

(I would have myself made an argument around the criterion
of elegance, not wordiness).

And inasmuch as the "wordiness" is very much there, but
well hidden by use of the abstraction of the VHLL awk
interpreter... _and_ my code is two orders of magnitude more
efficient for small data samples, my solution cannot not be
no not-small exercise in not-wordiness, now can it?

(Or I could have just been snarky, and have said, "Take it to
comp.lang.awk!" -- but I wouldn't do that: the one-liner _is_
kinda nice, but then I knew that, my provided solution coming
after having seen that one).

=Brian

Ed Morton

unread,

Jun 7, 2012, 2:50:41 AM6/7/12

to

On 6/6/2012 7:03 PM, bsh wrote:
> On Jun 6, 9:40 am, "Ed Morton"<mortons...@gmail.com> wrote:
>> bsh<brian_hi...@rocketmail.com> wrote:
>>> On Jun 5, 3:01 am, Guillaume Dargaud
>>> <use_the_contact_f...@www.gdargaud.net> wrote:
>> http://groups.google.com/group/comp.unix.shell/browse_thread/thread/d...
>
>> Seems kinda wordy compared to:
>> awk '$0 != ++expected{print expected; exit}'
>
> Hmmm. I anticipated this comment, but frankly, I thought
> that it had been sufficiently hashed out several years ago....
>
> Wordy? Well, yes and no. Reminds me of the amusing
> programmer humor that purports to program "Hello World"
> as a newby, intermediate, graduate student, and professional
> programmer. As the latter, his "Hello World" is a 200-line
> C++ class....

...and he is fired for not getting his product to market before Microsoft and
his job is outsourced to a newby who will do it next time in one line and a
fraction of the development interval. I get what you're trying to say but
spending the time to write the most robust, extensible, efficient code possible
isn't always the best idea.

> Scriptarians will presumably preternaturally elevate the
> criterion of _character count_ to quantize elegance, ur,
> wordiness. Well, suum cuique pulchrum est....
>
> (I would have myself made an argument around the criterion
> of elegance, not wordiness).
>
> And inasmuch as the "wordiness" is very much there, but
> well hidden by use of the abstraction of the VHLL awk
> interpreter... _and_ my code is two orders of magnitude more
> efficient for small data samples, my solution cannot not be
> no not-small exercise in not-wordiness, now can it?
>
> (Or I could have just been snarky, and have said, "Take it to
> comp.lang.awk!" -- but I wouldn't do that: the one-liner _is_
> kinda nice, but then I knew that, my provided solution coming
> after having seen that one).
>
> =Brian

So you knew the OP could output the first missing number with a trivial
one-liner and you suggested he do it with a fairly lengthy script instead. Would
you mind sharing why that'd be the preferred approach?

Ed.