Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

More Efficient Method

2 views
Skip to first unread message

blamm64

unread,
Nov 20, 2009, 6:38:47 AM11/20/09
to
I have a couple of functions designed to poke a single hole, and to
poke multiple holes, in a one-level list:

We define a function which, given the imported pressure data, finds
the subset of that pressure data excluding the pressure data points
between "targetL " and "targetU".

In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_?
NumericQ] := Select[data,(#<=targetL || #>=targetU &)]

This function will pluck out multiple holes in the data list.

In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1},
tmp=data;
Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1,
{i,Dimensions[tarList][[1]]}];
tmp
]

The following works fine (big holes chosen not to give large result):

In[7]:= datalist=Range[11,3411,10];

In[12]:= targetlist={{40, 1500},{1600,3300}};

In[13]:= resultdata=subsets[datalist,targetlist]

Out[13]=
{11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591,3301,3311,3321,3331,3341,3351,3361,3371,3381,3391,3401,3411}

But if "datalist" happens to be very large, surely there is a (much)
more efficient method?

I tried unsuccessfully to use pure functions with Select, but have a
somewhat nebulous feeling there's a pure function way of doing this
effectively much more efficiently.

I know, I know: the above have no consistency checking. I also know
"subsets" could be used in place of "findsubset" just by replacing the
call of "findsubset" with the code of "findsubset" in "subsets".

>From what I've seen on this forum there are some really experienced
people who might provide an efficient way of implementing the above.

-Brian L.

Daniel Lichtblau

unread,
Nov 21, 2009, 3:33:45 AM11/21/09
to

If you are working with integers then the method below should be fine.
Otherwise you may need to "fuzzify" a bit differently. I use
IntervalMemberQ to determine which elements in the data list to omit,
and then does the selection using Select (I tried Pick, and it was
perhaps a half a hair slower).

subsets2[data_?VectorQ,tarList_?ListQ] := Module[
{intv=Apply[Interval,Map[#+{.5,-.5}&,tarList]]},
Select[data, !IntervalMemberQ[intv,#]&]]

Here is a quick but slightly large test.

datalist = RandomInteger[11000,100000];
targetlist = Table[{n,n+20}, {n,100,10000,100}];

In[47]:= Timing[resultdata = subsets[datalist,targetlist];]
Out[47]= {14.4878, Null}

In[48]:= Timing[resultdata2 = subsets2[datalist,targetlist];]
Out[48]= {0.179973, Null}

In[49]:= resultdata === resultdata2
Out[49]= True

In[50]:= Length[resultdata2]
Out[50]= 82596

Daniel Lichtblau
Wolfram Research

Raffy

unread,
Nov 21, 2009, 3:36:42 AM11/21/09
to
On Nov 20, 3:38 am, blamm64 <blam...@charter.net> wrote:
> I have a couple of functions designed to poke a single hole, and to
> poke multiple holes, in a one-level list:
>
> We define a function which, given the imported pressure data, finds
> the subset of that pressure data excluding the pressure data points
> between "targetL " and "targetU".
>
> In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_?
> NumericQ] := Select[data,(#<=targetL || #>=targetU &)]
>
> This function will pluck out multiple holes in the data list.
>
> In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1},
> tmp=data;
> Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1,
> {i,Dimensions[tarList][[1]]}];
> tmp
> ]
>
> The following works fine (big holes chosen not to give large result):
>
> In[7]:= datalist=Range[11,3411,10];
>
> In[12]:= targetlist={{40, 1500},{1600,3300}};
>
> In[13]:= resultdata=subsets[datalist,targetlist]
>
> Out[13]=
> {11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591,3301,3311,332=
1, 3331,3341,3351,3361,3371,3381,3391,3401,3411}

>
> But if "datalist" happens to be very large, surely there is a (much)
> more efficient method?
>
> I tried unsuccessfully to use pure functions with Select, but have a
> somewhat nebulous feeling there's a pure function way of doing this
> effectively much more efficiently.
>
> I know, I know: the above have no consistency checking. I also know
> "subsets" could be used in place of "findsubset" just by replacing the
> call of "findsubset" with the code of "findsubset" in "subsets".
>
> >From what I've seen on this forum there are some really experienced
>
> people who might provide an efficient way of implementing the above.
>
> -Brian L.

I didn't do any speed testing yet, but this functionality is available
through Interval.

With[{interval = Interval[{40, 1500}, {1600, 3300}]},
Select[Range[11, 123123, 10], ! IntervalMemberQ[interval, #] &]
]

Bill Rowe

unread,
Nov 21, 2009, 3:37:19 AM11/21/09
to
On 11/20/09 at 6:38 AM, bla...@charter.net (blamm64) wrote:

>I have a couple of functions designed to poke a single hole, and to
>poke multiple holes, in a one-level list:

>We define a function which, given the imported pressure data, finds
>the subset of that pressure data excluding the pressure data points
>between "targetL " and "targetU".

>In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_?
>NumericQ] := Select[data,(#<=targetL || #>=targetU &)]

on my machine the following has the same result but executes faster

fs[data_?VectorQ, targetL_?NumericQ, targetU_?NumericQ] :=

Join[SparseArray[Clip[data, {First[data], targetL}, {0, 0}]] /.
SparseArray[_, _, _, {_, _, a_}] -> a,
SparseArray[Clip[data, {targetU, Last[data]}, {0, 0}]] /.
SparseArray[_, _, _, {_, _, a_}] -> a]

>This function will pluck out multiple holes in the data list.

>In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1},
>tmp=data;
>Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1,
>{i,Dimensions[tarList][[1]]}]; tmp
>]

I the following does the same thing is simpler in my view

subs[data_?VectorQ, tarList_?ListQ] :=
Fold[fs[#1, First[#2], Last[#2]] &, data, tarList]

>The following works fine (big holes chosen not to give large
>result):

>In[7]:= datalist=Range[11,3411,10];

>In[12]:= targetlist={{40, 1500},{1600,3300}};

=46irst to demonstrate the my solution produces the same result

In[7]:= subs[datalist, targetlist] == subsets[datalist, targetlist]

Out[7]= True

and then timing data on my system

In[8]:= Timing[subsets[datalist, targetlist];]

Out[8]= {0.000894,Null}

In[9]:= Timing[subs[datalist, targetlist];]

Out[9]= {0.000175,Null}


Ray Koopman

unread,
Nov 21, 2009, 3:37:41 AM11/21/09
to
On Nov 20, 3:38 am, blamm64 <blam...@charter.net> wrote:

In[1]:=
datalist = Range[11,3411,10];
targetlist = {{40, 1500},{1600,3300}}
rejectinterval = Interval@@({1,-1}+#&)/@targetlist
Select[datalist,!IntervalMemberQ[rejectinterval,#]&]

Out[2]=
{{40,1500},{1600,3300}}

Out[3]=
Interval[{41,1499},{1601,3299}]

Out[4]=
{11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591,
3301,3311,3321,3331,3341,3351,3361,3371,3381,3391,3401,3411}

David Park

unread,
Nov 21, 2009, 3:38:24 AM11/21/09
to
One possibility is:

datalist = Range[11, 3411, 10];

If[40 <= # <= 1500 \[Or] 1600 <= # <= 3300,
Unevaluated[Sequence[]], #] & /@ datalist

Sequence[] is dropped from a sequence of items. It has to be Unevaluated in
the If statement so it's not dropped there!


David Park
djm...@comcast.net
http://home.comcast.net/~djmpark/

Leonid Shifrin

unread,
Nov 21, 2009, 3:42:38 AM11/21/09
to
Hi Brian,

Your solution will be reasonably efficient if you need just a couple of
holes, but will become less and less efficient when you increase the number
of holes. Perhaps you will get better /more concise solutions which will
be using some built-in that I am not aware of, but here is a (rather
verbose) solution that should be fast enough even for a large number of
holes.

Clear[bsearchMinCompMassive ];
bsearchMinCompMassive =
Compile[{{list, _Real, 1}, {elems, _Real, 1}},
Module[{n0 = 1, n2 = 1, n1 = Length[list],
res = Array[0 &, {Length[elems]}], m = 1},
For[n2 = 1, n2 <= Length[elems], n2++,
n0 = 1; n1 = Length[list];
While[n0 <= n1,
m = Floor[(n0 + n1)/2];
If[list[[m]] < elems[[n2]], n0 = m + 1, n1 = m - 1]];
res[[n2]] = If[list[[m]] < elems[[n2]], m, m - 1]
];
res]];

Clear[bsearchMaxCompMassive];
bsearchMaxCompMassive =
Compile[{{list, _Real, 1}, {elems, _Real, 1}},
Module[{n0 = 1, n2 = 1, n1 = Length[list],
res = Array[0 &, {Length[elems]}], m = 1},
For[n2 = 1, n2 <= Length[elems], n2++,
n0 = 1; n1 = Length[list];
While[n0 <= n1,
m = Floor[(n0 + n1)/2];
If[list[[m]] > elems[[n2]], n1 = m - 1, n0 = m + 1]];
res[[n2]] = If[list[[m]] > elems[[n2]], m, m + 1]
];
res]];

Clear[deleteRegions];
deleteRegions[x : {__?NumberQ}, regs : {{_, _} ..}] :=
Module[{sorted, ord, result, positions, sortedRegs = Sort@regs},
sorted = x[[ord = Ordering[x]]];
positions =
Sort@ord[[Complement[Range[Length[sorted]],
Apply[Sequence,
Range @@@
Transpose[{bsearchMaxCompMassive[sorted, #1],
bsearchMinCompMassive[sorted, #2]} & @@
Transpose[sortedRegs]]]]]];
result = x[[positions]]];


Here are some benchmarks:

In[1]:= datalist=RandomSample[#,Length[#]]&@Range[11,100000,10];
In[2]:= targetlist={{40,1500},{1600,3300}};
In[3]:= targetList2 = Table[{10*i,10*i+50},{i,2,150}];

In[4]:= (resultdata10=subsets[datalist,targetlist])//Short//Timing

Out[4]=
{0.032,{88671,91041,81971,6041,86891,<<9673>>,55451,73361,28271,51601,72831}}

In[5]:= (resultdata20 = deleteRegions[datalist,targetlist])//Short//Timing

Out[5]=
{0.,{88671,91041,81971,6041,86891,<<9673>>,55451,73361,28271,51601,72831}}

In[6]:= resultdata10==resultdata20

Out[6]= True

In[7]:= (resultdata1=subsets[datalist,targetList2])//Short//Timing

Out[7]=
{2.313,{88671,91041,81971,6041,86891,<<9836>>,55451,73361,28271,51601,72831}}

In[8]:= (resultdata2 = deleteRegions[datalist,targetList2])//Short//Timing

Out[8]=
{0.016,{88671,91041,81971,6041,86891,<<9836>>,55451,73361,28271,51601,72831}}

In[9]:= resultdata1==resultdata2

Out[9]= True


Regards,
Leonid

Leonid Shifrin

unread,
Nov 22, 2009, 6:16:06 AM11/22/09
to
Hi Brian,

This is a follow-up to my previous post. Since you mentioned performance on
large datasets as a key objective, I further optimized the code of my
previous post. It now gives 3-4 times the speed of the solutions based on
Select/IntervalMemberQ,
at least in the region of the problem's parameter space that I was able to
test. Here is the code:

-------------------------------------------------------------------------------

Clear[deleteRegionsNew];
deleteRegionsNew[x_?VectorQ, regs_?ListQ, splitNumber_: Automatic] :=
With[{snum =
If[splitNumber === Automatic,
50000*(1 + IntegerPart[Length[regs]/10000]), splitNumber]},
Join @@
Map[deleteRegionsAux[#, regs] &, partitionWithTail[x, snum]]];

Clear[partitionWithTail];
partitionWithTail[l_List, size_Integer] :=
If[IntegerQ[Length[l]/size],
#,
Append[#, Drop[l, Length[Flatten[#, 1]]]]] &@
Partition[l, size];

Clear[deleteRegionsAux];
deleteRegionsAux[x_, regs_] :=
Module[{sorted, ord, sortedRegs, ones , xl, minn = Min[x] - 2,
maxx = Max[x] + 2},
xl = Join[x, {minn, maxx}];
sortedRegs =
Sort@DeleteCases[
Clip[regs, {minn + 1, maxx - 1}], {minn + 1,
minn + 1} | {maxx - 1, maxx - 1}];
ones = ConstantArray[1, {Length[sortedRegs]}];
sorted = xl[[ord = Ordering[xl]]];
Drop[ xl[[Sort[ord[[#]]]]], -2] &@


Complement[Range[Length[sorted]],

Apply[Join,
Range @@@ Transpose[{bsearchMaxMassive[sorted, #1, ones],
bsearchMinMassive[sorted, #2, ones]} & @@
Transpose[sortedRegs]]]]];

Clear[bsearchMinMassive];
bsearchMinMassive =
Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}},
Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0,
m = ones, diff = ones, un1 = ones, un2 = ones},
n1 = Length[list]*n0;
While[Sign[n0 - n1] != ones,


m = Floor[(n0 + n1)/2];

un1 = Floor@UnitStep[list[[m]] - elems];
un2 = ones - un1;
n1 = n1*un2 + (m - 1)*un1;
n0 = n0*un1 + (m + 1)*un2;
];
Floor[m - UnitStep[list[[m*UnitStep[m]]] - elems]]]];

Clear[bsearchMaxMassive];
bsearchMaxMassive =
Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}},
Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0,
m = ones, diff = ones, un1 = ones, un2 = ones, m1 = ones},
n1 = Length[list]*n0;
While[Sign[n0 - n1] != ones,


m = Floor[(n0 + n1)/2];

un1 = Floor@UnitStep[elems - list[[m]]];
un2 = ones - un1;
n1 = n1*un1 + (m - 1)*un2;
n0 = n0*un2 + (m + 1)*un1;
];
Floor[m + UnitStep[elems - list[[m]]]]]];


(* Daniel's solution for comparison*)

Clear[subsets2];
subsets2[data_?VectorQ, tarList_?ListQ] :=
Module[{intv = Apply[Interval, Map[# + {.5, -.5} &, tarList]]},
Select[data, ! IntervalMemberQ[intv, #] &]]

----------------------------------------------------------------------------------

Benchmarking on a large dataset:

In[1]:=
datalist=RandomInteger[1100000,6500000];
targetlist=Table[{n,n+20},{n,100,6500000,50}];

In[2]:= Length[targetlist]

Out[2]= 129999

In[3]:=

Timing[resultdata=subsets2[datalist,targetlist];]

Out[3]= {88.797,Null}

In[4]:= Timing[resultdata2=deleteRegionsNew[datalist,targetlist];]

Out[4]= {26.629,Null}

In[5]:= resultdata===resultdata2

Out[5]= True

In[6]:=
datalist = RandomInteger[1100000, 6500000];
targetlist = Table[{n, n + 20}, {n, 100, 6500000, 5000}];

In[7]:= Length[targetlist]

Out[7]= 1300

In[8]:=
Timing[resultdata = subsets2[datalist, targetlist];]

Out[8]= {46.927, Null}

In[9]:= Timing[resultdata2 = deleteRegionsNew[datalist, targetlist];]

Out[9]= {10.625, Null}

In[10]:= resultdata === resultdata2

Out[10]= True

My code uses one tuning parameter - the third optional parameter which
determines the size of the data chunk to split the original data list into.
I use some heuristics to set it up when it is not explicitly set.

The main idea of the implementation is to sort the initial dataset, use
binary search to determine the starting and ending positions of elements in
sorted set which get into the holes, and then extract only those elements
which do not. I can expand this description upon request.


Regards,
Leonid

Ray Koopman

unread,
Nov 23, 2009, 6:59:11 AM11/23/09
to
If you want something less complex than Leonid's code, here are two
simple speedups for the IntervalMemberQ approach. First, it seems
to run a little faster with integer limits than with real limits.

datalist = Table[Random[Integer,11*^5],{m = 1*^6}];
targetlist = Table[{n,n+20},{n,100,m,50}];

Block[{realreject = Interval@@({.5,-.5}+#&)/@targetlist},
Select[datalist,!IntervalMemberQ[realreject,#]&]] //Length //Timing

{9.75 Second, 654464}

Block[{integereject = Interval@@({1,-1}+#&)/@targetlist},
Select[datalist,!IntervalMemberQ[integereject,#]&]] //Length //Timing

{8.69 Second, 654464}

Second, if you expect to keep more items than you drop then
it's probably faster to look for the donuts than the holes.

Block[{keep = Interval@@Partition[Join[{Min[datalist,targetlist]-1},
Flatten@targetlist,{Max[datalist,targetlist]+1}], 2]},
Select[datalist,IntervalMemberQ[keep,#]&]] //Length //Timing

{7.47 Second, 654464}

On Nov 22, 3:16 am, Leonid Shifrin <lsh...@gmail.com> wrote:
> Hi Brian,
>
> This is a follow-up to my previous post. Since you mentioned performance
> on large datasets as a key objective, I further optimized the code of my
> previous post. It now gives 3-4 times the speed of the solutions based
> on Select/IntervalMemberQ,
> at least in the region of the problem's parameter space that I was able
> to test. Here is the code:
>
> ----------------------------------------------------------------------
>

Ray Koopman

unread,
Nov 24, 2009, 5:51:49 AM11/24/09
to
On Nov 20, 3:38 am, blamm64 <blam...@charter.net> wrote:
On Nov 20, 3:38 am, blamm64 <blam...@charter.net> wrote:

In your example, 'datalist' contains sorted integers, and 'targetlist'
contains sorted nonoverlapping "holes" with integer bounds. If this
will always be the case then the following code, that "walks" the two
lists together, will be much faster than using IntervalMemberQ.

funk = Compile[{{datalist,_Integer,1},{targetlist,_Integer,2}},
Module[{b = Append[Flatten[{0,-1}+#&/@targetlist],Last@datalist+1],
j = 1}, Select[datalist,(While[b[[j]]<#,j++]; OddQ@j)&]]]

0 new messages