Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Is Way Search for Characters Separated by Unknown Number of Between Characters?

16 views
Skip to first unread message

Alexander Smith

unread,
Aug 16, 2011, 4:37:01 PM8/16/11
to

I searching my array 16000 9 of characters (no spaces) good for 3
letter sequencings but now needing to search for 3 letter sequencings
in rows of array where the 3 letters not next to each other.

I not got RegExp in this APL so must do in APL2.

For example, search for 'ACG' in this row

'ATTTUCUTUG' would return index of 'ACG' at 1 6 9


thank
Alexi

Rav

unread,
Aug 16, 2011, 5:06:35 PM8/16/11
to

First, in your example, it should return 1 6 10 (in origin 1, at any
rate); the string in your example has 10 characters in it, not 9.

I'd be glad to help here but I'm not sure what both arguments are and
exactly what result you want. You are searching in a 16000 by 9 array
of characters, that's clear. But are you searching for JUST ONE three
element character vector, or an Nx3 matrix of characters? If searching
for each row in that Nx3 matrix, do you want to know ALL the rows where
ALL THREE characters appear SOMEwhere in those rows? What if only SOME
of the three characters appear in a row? How do you want the result to
look? For example, what if the first row of the Nx3 appears in 5 rows of
your 16000x9, and the second row doesn't appear in any of the rows, and
the third row appears in 2 of the rows? Perhaps I'm making this overly
complex, so help us out by providing a larger example of both arguments,
and what the result would be for those arguments. p

Phil Last

unread,
Aug 16, 2011, 5:07:01 PM8/16/11
to
On Aug 16, 8:37 pm, Alexander Smith <doesonsm...@googlemail.com>
wrote:

Question: In the row:
'ATTTGCTUG' would the index be 1 6 5; 1 6 9; or would it be
disqualified or impossible?

Paul H

unread,
Aug 16, 2011, 5:47:35 PM8/16/11
to
"Alexander Smith" <doeso...@googlemail.com> wrote in message
news:192c44a9-e16c-41e9...@g8g2000prn.googlegroups.com...


Brute force: with {quad}IO of zero, A == character array, B == 3 char
string to search for.
(underbar is the APL negative sign below).
Returns a binary vector with 1's where a match is found in a row.

{del} R {is} A S B
R {is} 0, {or}\ 0 _1 {drop} A=B[0]
R {is} 0, {or}\ 0 _1 {drop} R {and} A=B[1]
R {is} {or}/ R {and} A=B[2]
{del}

...Paul


Alexander Smith

unread,
Aug 16, 2011, 6:53:45 PM8/16/11
to
On Aug 16, 9:06 pm, Rav <Pa...@cais.com> wrote:
> On 8/16/2011 4:37 PM, Alexander Smith wrote:
>
>
>
> > I searching my array 16000 9  of characters (no spaces) good for 3
> > letter sequencings but now needing to search for 3 letter sequencings
> > in rows of array where the 3 letters not next to each other.
>
> > I not got RegExp in this APL so must do in APL2.
>
> > For example, search for 'ACG'  in this row
>
> > 'ATTTUCUTUG'  would return index of 'ACG' at 1 6 9
>
> > thank
> > Alexi
>
> First, in your example, it should return 1 6 10 (in origin 1, at any
> rate); the string in your example has 10 characters in it, not 9.

Is me kounting wrong!

Do 'ATTUCUTUG'

>
> I'd be glad to help here but I'm not sure what both arguments are and
> exactly what result you want.  You are searching in a 16000 by 9 array
> of characters, that's clear.  But are you searching for JUST ONE three
> element character vector, or an Nx3 matrix of characters?  

I have another array of shaping 800 3 with 3 letter sequences in it.

I need searching the big 16000 by 9 array with each of the rows of the
3 character sequence in the 800 3 array.

So, am searching, one at a time for the 3 letter sequences. Wherever
they may be. Maybe more than one in each 9 character sequence, maybe
none. But only doing one at time. If you got way to search whole
array of 16000 by 9 looking for any 800 by 3 sequences is good. But
always, just want to find the 3 letter matches, where they may in row,
even if other characters inbetween.

thank
alexi

Graham

unread,
Aug 17, 2011, 4:39:25 AM8/17/11
to

"Alexander Smith" <doeso...@googlemail.com> wrote in message news:e94ea8bc-e302-448c...@p19g2000yqa.googlegroups.com...

On Aug 16, 9:06pm, Rav <Pa...@cais.com> wrote:
> On 8/16/2011 4:37 PM, Alexander Smith wrote:

[Snip]

> I have another array of shaping 800 3 with 3 letter sequences in it.

> I need searching the big 16000 by 9 array with each of the rows of the
> 3 character sequence in the 800 3 array.

> But always, just want to find the 3 letter matches, where they may in row,


> even if other characters inbetween.

There is still not enough information. Do the letters have to occur in the same order. What happens if letters are duplicated in the three letter sequence. What happens if some but possibly not all of the letters are duplicated in the 9 letter sequence.

What for example would you expect from 'ATTUCUTUG' index of 'TTU'

As other posters have requested please provide examples showing say three or more rows of each of you A and B matrices and your expected result. Please select the rows in these examples to demonstrate the possibilities you wish to cover as I have tried to do above.

Graham.

Alexander Smith

unread,
Aug 17, 2011, 7:03:31 AM8/17/11
to

1 6 9

Search always for just the 3 letters in the order they given.
Assume only one search per row. No spacings anywhere. We knowing
how find characters if they next to one other, question is having
other characters in between,how doing search then. In example, we
should getting 1 6 9 bak as indexings.

thank
Alexi

Alexander Smith

unread,
Aug 17, 2011, 7:32:54 AM8/17/11
to
On Aug 17, 8:39 am, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
wrote:
> "Alexander Smith" <doesonsm...@googlemail.com> wrote in messagenews:e94ea8bc-e302-448c...@p19g2000yqa.googlegroups.com...

Example for rav, grahem, phil and everyone


CTGGTTGAT
GTAGTCATA
CATGTCTAA
TCGAAAGTT
CCGGAGAAG

Ok here above exampel, we got some intron snippings. Let's say i
looking for 'TAG'
I got hit in second row, index 2 3 4
I got hit in 4th row, index 1 4 7 - notice when repeated characters,
hit only on the first found, in other words on first letter found, so
first T, first A found and first G found triggers the hit and we done
for that row. So if we see TTTAGAAGG the index for 'TAG'
should come back as 1 4 5. Later on I would like to pursue the other
hits but is too komplicated now to try and I just learning. Probably
ezy stuff for you gurus!

No reverse order search needing, just forward search as shown. Is
good?

thank
Alexi

Phil Last

unread,
Aug 17, 2011, 8:28:44 AM8/17/11
to
On Aug 17, 11:32 am, Alexander Smith <doesonsm...@googlemail.com>
wrote:

So lets say the entire search is restricted to the above five rows of
A and the triples (2 3⍴'TAGGAC') for B.
Required result must be either a very sparse 5 by 2 by 3 array or some
transposition thereof or a more relational format could be a five
column array, say, whose first two columns were the indices of the row
hits in A and B with the final three columns being the indices of the
hits themselves, this last being probably very much smaller.

Phil Last

unread,
Aug 17, 2011, 9:25:48 AM8/17/11
to

In another thread on a similar subject someone pointed out the limited
domain of the right argument. In this case, appearing as it does that
the domain of a single triple is restricted to 'ACGT' and possibly
'U', we are absolutely restricted to either 64:
∘.,/3⍴⊂'ACGT'
AAA AAC AAG AAT
ACA ACC ACG ACT
AGA AGC AGG AGT
ATA ATC ATG ATT

CAA CAC CAG CAT
CCA CCC CCG CCT
CGA CGC CGG CGT
CTA CTC CTG CTT

GAA GAC GAG GAT
GCA GCC GCG GCT
GGA GGC GGG GGT
GTA GTC GTG GTT

TAA TAC TAG TAT
TCA TCC TCG TCT
TGA TGC TGG TGT
TTA TTC TTG TTT
or 125
∘.,/3⍴⊂'ACGTU'
AAA AAC AAG AAT AAU
ACA ACC ACG ACT ACU
AGA AGC AGG AGT AGU
ATA ATC ATG ATT ATU
AUA AUC AUG AUT AUU

CAA CAC CAG CAT CAU
CCA CCC CCG CCT CCU
CGA CGC CGG CGT CGU
CTA CTC CTG CTT CTU
CUA CUC CUG CUT CUU

GAA GAC GAG GAT GAU
GCA GCC GCG GCT GCU
GGA GGC GGG GGT GGU
GTA GTC GTG GTT GTU
GUA GUC GUG GUT GUU

TAA TAC TAG TAT TAU
TCA TCC TCG TCT TCU
TGA TGC TGG TGT TGU
TTA TTC TTG TTT TTU
TUA TUC TUG TUT TUU

UAA UAC UAG UAT UAU
UCA UCC UCG UCT UCU
UGA UGC UGG UGT UGU
UTA UTC UTG UTT UTU
UUA UUC UUG UUT UUU

so it's probably worth removing duplicates from the 800 to begin with.

Alexander Smith

unread,
Aug 17, 2011, 9:55:22 AM8/17/11
to
On Aug 17, 12:28 pm, Phil Last <phil.l...@ntlworld.com> wrote:

Is good but not good. In future I using many letters of alphabet to
save work and keep my eyes not falling out (maybe I make 'X' stand for
certain sequences for exampel) so not just a few letters, assume all
letters possible in the 3 character and 9 character arrays.

Alexi

Alexander Smith

unread,
Aug 17, 2011, 9:59:40 AM8/17/11
to

Duplicatings already was removed. Genetic sequences usual just a set
of a few characters but I got to add many letters to stand for
different sequences later, so need to assume all 26 characters
alphabet. That why my first postings I just said array of names
instead of going into detailings. It gets komplicated the deeper in
you get!

thank
alexi

Alexander Smith

unread,
Aug 17, 2011, 11:31:12 AM8/17/11
to
On Aug 17, 11:32 am, Alexander Smith <doesonsm...@googlemail.com>
wrote:

--So if we see TTTAGAAGG the index for 'TAG'
--should come back as 1 4 5.

Is 1 4 7 not 1 4 5.

alexi

Graham

unread,
Aug 17, 2011, 12:34:15 PM8/17/11
to

"Alexander Smith" <doeso...@googlemail.com> wrote in message news:57d9c97b-8cfe-4e8d...@t29g2000vby.googlegroups.com...
On Aug 17, 11:32am, Alexander Smith <doesonsm...@googlemail.com>
wrote:
> On Aug 17, 8:39am, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>

Now I am confused:

TTTAGAAGG
T**AG****
123456789

Why is it not 1 4 5 as you originally stated above????

Graham.

Alexander Smith

unread,
Aug 17, 2011, 1:22:16 PM8/17/11
to
On Aug 17, 4:34 pm, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
wrote:
> "Alexander Smith" <doesonsm...@googlemail.com> wrote in messagenews:57d9c97b-8cfe-4e8d...@t29g2000vby.googlegroups.com...

Is 1 4 5, you korrekt! I not get enough sleep, gotta rest.

thank
alexi

Graham

unread,
Aug 17, 2011, 3:28:24 PM8/17/11
to

"Alexander Smith" <doeso...@googlemail.com> wrote in message news:699a7908-2cf9-40c3...@w18g2000yqc.googlegroups.com...
On Aug 17, 4:34pm, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
wrote:
[Snip]

> Is 1 4 5, you korrekt! I not get enough sleep, gotta rest.

OK try this. It is a crude looping example which can be extended via an outer loop to do all rows in you 800 row matrix. Obviously it can also be extended to cope with all 26 letters of the alphabet. Clearly in a working function you would make your two matrices the left and right argument to the function.

r←Join2;sc;a;b;na;nb;m;i;ind

⍝Set up sort criteria
sc←'ACGTU'

⍝Generate example data
a←10 3⍴sc[1+5|30?30]
b←30 9⍴sc[1+5|270?270]

⍝Convert letters to numbers
na←sc⍳a
nb←sc⍳b

⍝Setup matrix of indices of b
m←(⍴b)⍴⍳¯1↑⍴b

⍝Initialise the index matrix
r←((↑⍴b),(¯1↑⍴a))⍴0

:for i :in ⍳¯1↑⍴a

⍝Get indices of the ith letter in sequence in b
ind←na[1;i]=nb
ind←(<\ind)

⍝Set first occurence of ith number in zero to space
nb←nb×~ind

⍝Record the indices of the ith letter in b
r[;i]←+/ind×m

:endfor

⍝Zero rows that do not contain a three letter sequence
b←((¯1↑⍴a)=(+/r≠0))⌿b
r←((¯1↑⍴a)=(+/r≠0))⌿r

⍝Filter out rows that do not contain three letters in the correct sequence
b←(~×+/(-2-/r)<0)⌿b
r←(~×+/(-2-/r)<0)⌿r

⍝Output the original sequence in a, the sequence in b containing the squence in a and the index positions
r←(((↑⍴b),(¯1↑⍴a))⍴a[1;]),' ',b,' ',r

Graham.

Stefano "WildHeart"

unread,
Aug 18, 2011, 10:09:48 AM8/18/11
to
On Aug 17, 1:32 pm, Alexander Smith <doesonsm...@googlemail.com>
wrote:

> Example for rav, grahem, phil and everyone
>
> CTGGTTGAT
> GTAGTCATA
> CATGTCTAA
> TCGAAAGTT
> CCGGAGAAG
>

Let m be your matrix.

r←m find x;c;l;b;i
i←~l←r←(⍴m)⍴1

:For c :In x
b←c=m
r←∨\r∧l∧b
i∨←<\r
l∧←~<\l∧b
:EndFor
b←(⍴x)=+/i
r←b/⍳⍴b
i←i×(⍴i)⍴⍳¯1↑⍴i
i←(,i[r;])~0
r←r,(((⍴i)÷⍴x),⍴x)⍴i

m find 'TAG'
2 2 3 4
4 1 4 7
(which means: hit in row 2 at position 2 3 4, and in row 4 at position
1 4 7)

m find 'CTGT'
1 1 2 3 5
3 1 3 4 5
(which means: hit in row 1 at position 1 2 3 5, and in row 3 at
position 1 3 4 5)
m find 'GT'
1 3 5
2 1 2
3 4 5
4 3 8
(you get the gist...)

Have fun!
--
Stefano

Alexander Smith

unread,
Aug 19, 2011, 2:45:24 AM8/19/11
to
On Aug 18, 2:09 pm, "Stefano \"WildHeart\""

thank, but stop on very first line with "Value Error pointing at l.

What do "b {is} c=m does"? In APL2 DOS gives "Value Error".

Is some kindof auto initalzation in your APL?

Look powerful but will FIND function working in APL2 DOS?

thank
alexi

Alexander Smith

unread,
Aug 19, 2011, 2:49:23 AM8/19/11
to
On Aug 17, 7:28 pm, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
wrote:
> "Alexander Smith" <doesonsm...@googlemail.com> wrote in messagenews:699a7908-2cf9-40c3...@w18g2000yqc.googlegroups.com...

Is intersting but better to showing as workspace comands instead of
function. I see loop label :for but no place go to it. Only got APL2
DOS.

Alexi.

Phil Last

unread,
Aug 19, 2011, 3:58:11 AM8/19/11
to
On Aug 19, 6:49 am, Alexander Smith <doesonsm...@googlemail.com>
wrote:

I believe control structures are not implemented in APL2

Graham

unread,
Aug 19, 2011, 5:37:45 AM8/19/11
to

"Alexander Smith" <doeso...@googlemail.com> wrote in message news:0af6d1ca-52a2-49c4...@en1g2000vbb.googlegroups.com...

On Aug 17, 7:28 pm, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
wrote:
> "Alexander Smith" <doesonsm...@googlemail.com> wrote in messagenews:699a7908-2cf9-40c3...@w18g2000yqc.googlegroups.com...
>
> On Aug 17, 4:34pm, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
> wrote:
> [Snip]

> Is intersting but better to showing as workspace comands instead of


> function. I see loop label :for but no place go to it. Only got APL2
> DOS.

Sorry Alexi but you cannot do this outside a function except one search at a time. As Phil has said that APL2 might not support control structures which I have used for the looping then I have recast the function in the pre control structure style which should work in APL2.

I have also included the outer loop to test all three letter combinations in your 800 row matrix. It was also easy to add a couple of lines to the function to identify when the match was on the first three letters of your B matrix so it does both your last two jobs at the same time. Clearly you can separate the roles by using those two lines as a filter rather than as an indicator.

Remember this is a crude demonstration function (there might be more elegant pure APL forms!) which creates its own random letter sequences for validation and testing using restricted sized matrices and letter sets. If it does what you want you will have to modify it to create a production function:

r←Join3;sc;a;b;na;nb;m;i;ind;j;bm;z;s

⍝Generate vector of search letters
sc←'ACGTU'

⍝Generate test data

a←10 3⍴sc[1+5|30?30]
b←30 9⍴sc[1+5|270?270]

⍝Convert letters to numbers
na←sc⍳a

⍝Setup matrix of indices of b
m←(⍴b)⍴⍳¯1↑⍴b

⍝Initialise the results matrix
r←(0,2+(2×1+¯1↑⍴a)+¯1↑⍴b)⍴0

⍝Initialise outer loop counter
j←1
outer:

⍝Get b matrix
bm←b
nb←sc⍳b

⍝Initialise the index matrix
z←((↑⍴bm),(¯1↑⍴a))⍴0

⍝Initialise inner loop counter
i←1
inner:



⍝Get indices of the ith letter in sequence in b

ind←na[j;i]=nb
ind←(<\ind)

⍝Set first occurence of ith number to zero

nb←nb×~ind

⍝Record the indices of the ith letter in b

z[;i]←+/ind×m

⍝Increment inner loop counter and exit loop on completion
i←i+1
→(i≤¯1↑⍴a)/inner



⍝Zero rows that do not contain a three letter sequence

bm←((¯1↑⍴a)=(+/z≠0))⌿bm
z←((¯1↑⍴a)=(+/z≠0))⌿z



⍝Filter out rows that do not contain three letters in the correct sequence

bm←(~×+/(-2-/z)<0)⌿bm
z←(~×+/(-2-/z)<0)⌿z

⍝Identify rows of b where first three letters are a match
s←(↑⍴z)⍴' '
s[(6=+/z)/⍳⍴s]←'*'

⍝Concatenate the results of this search
z←((((↑⍴bm),(¯1↑⍴a))⍴a[j;]),' ',bm,' ',z,' '),s

⍝Concatenate to previous search results
r←r⍪' '⍪z

⍝Increment outer loop counter and exit loop on completion
j←j+1
→(j≤⍴a)/outer

Graham.

Message has been deleted

Alexander Smith

unread,
Aug 19, 2011, 8:35:52 AM8/19/11
to
On Aug 19, 9:37 am, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
wrote:
> "Alexander Smith" <doesonsm...@googlemail.com> wrote in messagenews:0af6d1ca-52a2-49c4...@en1g2000vbb.googlegroups.com...

Thank
Alexi

Graham

unread,
Aug 19, 2011, 11:23:03 AM8/19/11
to

"Alexander Smith" <doeso...@googlemail.com> wrote in message news:b621dd8b-54b3-44fb...@c8g2000prn.googlegroups.com...

> On Aug 19, 9:37 am, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
> wrote:
>> "Alexander Smith" <doesonsm...@googlemail.com> wrote in messagenews:0af6d1ca-52a2-49c4...@en1g2000vbb.googlegroups.com...
>>
>> On Aug 17, 7:28 pm, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
>> wrote:
>>
>> > "Alexander Smith" <doesonsm...@googlemail.com> wrote in messagenews:699a7908-2cf9-40c3...@w18g2000yqc.googlegroups.com...
>>
>> > On Aug 17, 4:34pm, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
>> > wrote:
>> > [Snip]

>> Sorry Alexi but you cannot do this outside a function except one search at a time. As Phil has said that APL2 might not support control structures which I have used for the looping then I have recast the function in the pre control structure style which should work in APL2.

[Snip]

> Thank
> Alexi

You are welcome. Did the function work in APL2 and more importantly did it do what you wanted it to do?

Graham.

Ric

unread,
Aug 21, 2011, 4:38:22 PM8/21/11
to
On Aug 17, 11:32 pm, Alexander Smith <doesonsm...@googlemail.com>
wrote:


You may find this thread on the J Programming forum of interest:
http://old.nabble.com/forum/ViewPost.jtp?post=32287342&framed=y&skin=24193

Lou

unread,
Aug 23, 2011, 10:27:23 AM8/23/11
to
Alexander Smith <doeso...@googlemail.com>:
> On Aug 17, 7:28 pm, "Graham" <h2gt2g42-micenewgro...@yahoo.co.uk>
> Is intersting but better to showing as workspace comands instead of
> function. I see loop label :for but no place go to it. Only got APL2
> DOS.
>
> Alexi.

The following should work in any APL implementation, this is the "old"
way of implementing loops. Just modify the aforementioned code
with this techniue and hopefully it will do the trick.


Simple loop from 1 to 10, I added the #: for refrences, if you
are using APL/2 you should see those somehow in your interpreter.


1: i ← 1
2: → (i > 10) / 6
3: 'i = ', i
4: i ← i + 1
5: → 2
6: 'done'


Lou

0 new messages