Hello all,
what generic command line tools can I use when dealing with searching sequences missteps ?
Let me explain better with some examples. Say I have a simple sequence of numbers and some may be missing:
1
2
3
6
7
8
11
...
I'd like to obtain the gap limits, so here 3, 6, 8, 11...
My first thought is to do a loop in bash that inc a number and simply check if it's present in the expected position, but I'm sure you guys can come up with a better way.
Note that the real problem is more complicated (I should also detect repetitions) so I'm open to various suggestions.
> Hello all,
> what generic command line tools can I use when dealing with searching
> sequences missteps ?
> Let me explain better with some examples. Say I have a simple sequence of
> numbers and some may be missing:
> 1
> 2
> 3
> 6
> 7
> 8
> 11
> ...
> I'd like to obtain the gap limits, so here 3, 6, 8, 11...
Here's one solution (where I output the corresponding two bounds
in one line which I think is clearer, but you may also write two
print statements instead)...
awk '++c!=$1 {print c-1,$1; c=$1}'
> My first thought is to do a loop in bash that inc a number and simply check
> if it's present in the expected position, but I'm sure you guys can come up
> with a better way.
> Note that the real problem is more complicated (I should also detect
> repetitions) so I'm open to various suggestions.
The program that I suggested will show repetitions as lines with
two identical values, so an extension is straightforward, e.g.
> The program that I suggested will show repetitions as lines with
> two identical values, so an extension is straightforward, e.g.
Thanks. I always have a hard time understanding awk scripts but I'll look at it better.
For simple repetitions, I usually use 'uniq -d', but in my case the lines can repeat like this:
SomeRandomChars-123-SomeOtherChars
SomeOtherRandomChars-123-SomeOtherChars
So I can't simply use 'uniq'.
>> The program that I suggested will show repetitions as lines with
>> two identical values, so an extension is straightforward, e.g.
> Thanks. I always have a hard time understanding awk scripts but I'll look at
> it better.
If you have concrete questions, feel free to ask.
> For simple repetitions, I usually use 'uniq -d', but in my case the lines
> can repeat like this:
> SomeRandomChars-123-SomeOtherChars
> SomeOtherRandomChars-123-SomeOtherChars
> So I can't simply use 'uniq'.
IIUC, repetitions are defined by the suffix?
Then the awk program could be extended in two steps; define the FS="-"
so that you can access the three parts individually as $1, $2, and $3,
and then compare only the relevant parts of the whole line, in your case
probably the concatenation $2 $3 (or maybe only the mid part $2 ?).
> Hello all,
> what generic command line tools can I use when dealing with searching
> sequences missteps ?
> Let me explain better with some examples. Say I have a simple sequence of
> numbers and some may be missing:
> 1
> 2
> 3
> 6
> 7
> 8
> 11
> ...
> I'd like to obtain the gap limits, so here 3, 6, 8, 11...
> My first thought is to do a loop in bash that inc a number and simply check
> if it's present in the expected position, but I'm sure you guys can come up
> with a better way.
> Note that the real problem is more complicated (I should also detect
> repetitions) so I'm open to various suggestions.
If you want a real solution, post the real problem. Often the solutions to simple problems aren't extensible to more complex problems and then everyone gets frustrated that you didn't just post the real problem in the first place and so wasted everyone's time helping you solve a non-existent problem.
<use_the_contact_f...@www.gdargaud.net> wrote:
> What generic command line tools can I use when dealing with searching
> sequences missteps ?
> ...
> My first thought is to do a loop in bash that inc a number and simply check
> if it's present in the expected position, but I'm sure you guys can come up
> with a better way.
Does the following help? It finds the _first_ missing digit of the
sequence, not all of them, but perhaps it can be inserted into a
loop to iteratively capture all of them in turn, after processing.
bsh <brian_hi...@rocketmail.com> wrote:
> On Jun 5, 3:01 am, Guillaume Dargaud
> <use_the_contact_f...@www.gdargaud.net> wrote:
> > What generic command line tools can I use when dealing with searching
> > sequences missteps ?
> > ...
> > My first thought is to do a loop in bash that inc a number and simply
check
> > if it's present in the expected position, but I'm sure you guys can
come up
> > with a better way.
> Does the following help? It finds the _first_ missing digit of the
> sequence, not all of them, but perhaps it can be inserted into a
> loop to iteratively capture all of them in turn, after processing.
Hmmm. I anticipated this comment, but frankly, I thought
that it had been sufficiently hashed out several years ago....
Wordy? Well, yes and no. Reminds me of the amusing
programmer humor that purports to program "Hello World"
as a newby, intermediate, graduate student, and professional
programmer. As the latter, his "Hello World" is a 200-line
C++ class....
Scriptarians will presumably preternaturally elevate the
criterion of _character count_ to quantize elegance, ur,
wordiness. Well, suum cuique pulchrum est....
(I would have myself made an argument around the criterion
of elegance, not wordiness).
And inasmuch as the "wordiness" is very much there, but
well hidden by use of the abstraction of the VHLL awk
interpreter... _and_ my code is two orders of magnitude more
efficient for small data samples, my solution cannot not be
no not-small exercise in not-wordiness, now can it?
(Or I could have just been snarky, and have said, "Take it to
comp.lang.awk!" -- but I wouldn't do that: the one-liner _is_
kinda nice, but then I knew that, my provided solution coming
after having seen that one).
> Hmmm. I anticipated this comment, but frankly, I thought
> that it had been sufficiently hashed out several years ago....
> Wordy? Well, yes and no. Reminds me of the amusing
> programmer humor that purports to program "Hello World"
> as a newby, intermediate, graduate student, and professional
> programmer. As the latter, his "Hello World" is a 200-line
> C++ class....
...and he is fired for not getting his product to market before Microsoft and his job is outsourced to a newby who will do it next time in one line and a fraction of the development interval. I get what you're trying to say but spending the time to write the most robust, extensible, efficient code possible isn't always the best idea.
> Scriptarians will presumably preternaturally elevate the
> criterion of _character count_ to quantize elegance, ur,
> wordiness. Well, suum cuique pulchrum est....
> (I would have myself made an argument around the criterion
> of elegance, not wordiness).
> And inasmuch as the "wordiness" is very much there, but
> well hidden by use of the abstraction of the VHLL awk
> interpreter... _and_ my code is two orders of magnitude more
> efficient for small data samples, my solution cannot not be
> no not-small exercise in not-wordiness, now can it?
> (Or I could have just been snarky, and have said, "Take it to
> comp.lang.awk!" -- but I wouldn't do that: the one-liner _is_
> kinda nice, but then I knew that, my provided solution coming
> after having seen that one).
> =Brian
So you knew the OP could output the first missing number with a trivial one-liner and you suggested he do it with a fairly lengthy script instead. Would you mind sharing why that'd be the preferred approach?