GeoPipeline hasNext() / next() functions take a long time for the first time

147 views
Skip to first unread message

Abhijeet Deshpande

unread,
Oct 15, 2012, 8:10:11 AM10/15/12
to ne...@googlegroups.com
Hi
I am using neo4j spatial to find out locations near by a lat/long. The results returned are sorted on OrthodromicDistance and are further paginated using the range function (100 results per page). The code for this is

GeoPipeline flowList = (GeoPipeline)GeoPipeline.startNearestNeighborLatLonSearch(layer, loc, dist).sort("OrthodromicDistance").copyDatabaseRecordProperties(keys).range(low, high);

Now to iterate over this GeoPipline, I am using hasNext() / next() functions but for some reason the first call for either of these functions takes long time to execute.

When the application was run, first call to next() or hasNext() took approximately 400 sec to execute, subsequent 99 calls took 0ms.

Can someone please point out where the mistake is and if there is a faster way to find the nearest places. The layer has 10 million entries.

Regards
Abhijeet

Peter Neubauer

unread,
Oct 15, 2012, 9:04:42 AM10/15/12
to ne...@googlegroups.com
Hi there,
the first run is probably with cold caches. In real world scenarios,
you are running with warm caches, so you should try to warm up the
database by doing a few searches before the real work, or maybe loop
through your interesting nodes or so?

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j 1.8 GA - http://www.dzone.com/links/neo4j_18_release_fluent_graph_literacy.html
> --
>
>

Abhijeet Deshpande

unread,
Oct 19, 2012, 8:13:45 AM10/19/12
to ne...@googlegroups.com
Hi Peter,
Thank you for the response.

Actually this is not related to the very first run of query where the cache is empty.

The problem is that when ever I try to search places and request for 100 places per page in response, first call to Pipeline.hasNext() / Pipeline.next() always takes about 400 sec. time and this happens for every request.

Hope I am able to convey the problem that I am facing.

Regards
Abhijeet

Craig Taverner

unread,
Oct 19, 2012, 9:34:53 AM10/19/12
to ne...@googlegroups.com
I can suggest that since you have a sorting component in the pipeline, this will cause the pipeline to internally cache the entire resultset on the first hasNext or next call, sort it and return the first sorted result. Subsequent calls will access the internal, sorted, cache. This is a necessary consequence of the sorting algorithm, and is mostly unavoidable.

So avoid this, you would need to take away the sort. Is it possible to get the same results you want without the sort? Or at least by moving the sort into your own code, or perhaps after a filter or some other action that reduces the resultset size?

--
 
 

Peter Neubauer

unread,
Oct 21, 2012, 7:59:31 AM10/21/12
to ne...@googlegroups.com
Mmh,
I think the whole query is re-issued when you request the next page.
in order to avoid that, could you hold the result in cache on the
server side for the next request and page it on your own?

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j 1.8 GA - http://www.dzone.com/links/neo4j_18_release_fluent_graph_literacy.html


On Fri, Oct 19, 2012 at 2:13 PM, Abhijeet Deshpande
> --
>
>

Abhijeet Deshpande

unread,
Oct 22, 2012, 3:41:02 PM10/22/12
to ne...@googlegroups.com

Hi Craig, Peter

Thank you for the inputs.

Removing sorting helped quite a lot and I also reduced the result size from 100 to 10. Now the response times have come down to a few milliseconds but the downside is that the results are not sorted upon the distance.

The objective is to get 10 places per page starting with the closest one. To implement this I tried following GeoPipeline functions

1. startNearestNeighborSearch(layer, point, distance) and
2, startNearestNeighborLatLonSearch(layer, point, distance)

but none of them returned the results sorted on distance even if the function names seem to suggest so. As per the function definition, no additional sort filters are needed so please let me know if I am missing something.

I also observed that the results of startNearestNeighborLatLonSearch are better sorted than startNearestNeighborSearch and think it has got something to do with the propertyFilter that these functions use. I would like to know more on  "OrthodromicDistance" and "Distance" properties and the difference between their respecitive property filters.

Regards
Abhijeet

Abhijeet Deshpande

unread,
Oct 25, 2012, 6:10:49 AM10/25/12
to ne...@googlegroups.com
Hi
As specified below, can someone please help me understand why startNearestNeighborSearch / startNearestNeighborLatLonSearch functions don't return the results sorted on the distance. I have read some documentation which suggests that these functions return the results sorted on distance starting with the nearest one.

Thanks
Abhijeet

Peter Neubauer

unread,
Oct 25, 2012, 7:23:17 AM10/25/12
to ne...@googlegroups.com
Abhijeet,
from the docs and implementation, there is no sorting going on here.
Instead, all of the returned points satisfy the bonuding box you are
requesting

/**
* Extracts Layer items with a distance from the given point that is
less than or equal the given distance.
*
* @param layer with latitude, longitude coordinates
* @param point
* @param maxDistanceInKm
* @return geoPipeline
*/
public static GeoPipeline startNearestNeighborLatLonSearch(Layer
layer, Coordinate point, double maxDistanceInKm) {
Envelope searchWindow =
OrthodromicDistance.suggestSearchWindow(point, maxDistanceInKm);
GeoPipeline pipeline = start(layer, new SearchIntersectWindow(layer,
searchWindow))
.calculateOrthodromicDistance(point);

if (layer.getGeometryType() == Constants.GTYPE_POINT) {
pipeline = pipeline.propertyFilter("OrthodromicDistance",
maxDistanceInKm, FilterPipe.Filter.LESS_THAN_EQUAL);
}

return pipeline;
}

/**
* Calculates the distance between Layer items nearest to the given
point and the given point.
* The search window created is based on Layer items density and it
could lead to no results.
*
* @param layer
* @param point
* @param numberOfItemsToFind tries to find this number of items
for comparison
* @return geoPipeline
*/
public static GeoPipeline startNearestNeighborSearch(Layer layer,
Coordinate point, int numberOfItemsToFind) {
Envelope searchWindow =
SpatialTopologyUtils.createEnvelopeForGeometryDensityEstimate(layer,
point, numberOfItemsToFind);
return startNearestNeighborSearch(layer, point, searchWindow);
}


For this to happen, you could either contribute a
SortingNearestNeighborSearch (that does the sorting for you) or have a
second sort step?

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j 1.8 GA - http://www.dzone.com/links/neo4j_18_release_fluent_graph_literacy.html


On Thu, Oct 25, 2012 at 12:10 PM, Abhijeet Deshpande
> --
>
>

Craig Taverner

unread,
Oct 25, 2012, 7:45:53 AM10/25/12
to ne...@googlegroups.com
The best solution is to perform the query with a sufficiently large bounding box to give at least the number of results you expect, and then soft and limit in the client code afterwards. This works very well, if you guess the bounding box correctly. That guess is best done with domain knowledge, something the client code is more likely to have.

The fundamental problem here, and the reason why the sorting is not done internally, is that the spatial index is based on location, not distance. While it is possible to make an index based on distance, the origin of the search would be specific to the index. This means the index would only work for searches of distance from a particular point always, not generalized to any point. So, to support searches around any point (the point you pass in the search query), we need to build a bounding box, query the index on that, and then filter to points at the right distance.

--



Abhijeet Deshpande

unread,
Nov 7, 2012, 1:32:50 PM11/7/12
to ne...@googlegroups.com
Hi Craig

Thank you for the response. As suggested, I have removed the sorting component from the request and now simply fetching paginated results, 100 per page, using the following call.

String[] keys = {"id","name","address","city","state","zip"};
GeoPipeline flowList = ((GeoPipeline)GeoPipeline.startNearestNeighborLatLonSearch(layer, loc, dist).range(low, high)).copyDatabaseRecordProperties(keys);

Further I traverse the result using this loop

while(flowList.hasNext()){
                    geoPipeFlow = flowList.next();
                    ------
                    -------
                }

I have observed that the graph search time is negligible but when I traverse the result in while loop, for lower page ranges (Page 1: 1-100, Page 2: 101-200, Page 3: 201-300) it takes about 800-900 milliseconds to execute the first hasNext() call and for later pages like 8,9,10 it takes about 4000 to 5000 ms. This slows down the overall performance.

Please let me know if it is possible to significantly reduce this result traversal time irrespective of the page being requested.

I suspect it may be related to the way I am creating the layer in graph database and hence I have also attached the java code file that populates graph database. Code steps at a high level are

1. Create graphDatabaseService
2. Create spatialDatabaseService
3. Create SimplePointLayer called places
4. Read a place record that contains lat and long along with other details from file and input file add it to the layer to create SpatialDatabaseRecord
5. To this newly created node add other properties like place name, address, zip code etc

Can the access time be reduced if we create some index which can be used while traversal?

Please let me know your thoughts on this.

Regards
Abhijeet
PlaceImporter_V2.java

Peter Neubauer

unread,
Nov 7, 2012, 6:03:46 PM11/7/12
to Neo4j User
So,
this is with only a couple of hundred layer entries? Sounds very slow. Do you have a sample data file so I could test it?

/peter


Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

Neo4j 1.8 GA - http://www.dzone.com/links/neo4j_18_release_fluent_graph_literacy.html


--
 
 

Abhijeet Deshpande

unread,
Nov 8, 2012, 5:34:27 AM11/8/12
to ne...@googlegroups.com
Hi Peter,
The graph database on which I am running these tests contains 10 million place records.
Attached here is a sample file that contains about 10,000 records. Each line contains a record and is seperated by "|". The columns are as mentioned below.

id | latitude | longitude | name | address | city | state | zip


-Abhijeet
places1.txt

Abhijeet Deshpande

unread,
Nov 8, 2012, 5:35:23 AM11/8/12
to ne...@googlegroups.com
Attached is the second chunk of 5000 records.
places2.txt

Craig Taverner

unread,
Nov 9, 2012, 8:14:35 AM11/9/12
to ne...@googlegroups.com
Hi,

I've not tried your data yet, but I did have a chance to write some more junit test code for the GeoPipes, and I tested two cases, both involving creating a random sample of 10000 points in a point layer. The first test times how long each chunk of 1000 points takes to read on the stream, just iterating over the entire stream. The second uses the range() operation you mentioned before, to select the page of data. So both tests have the same final result, but the second one will always re-start the iterator and only return the 1000 results for the relevant page.

Not surprisingly, the plain iterator runs at linear speed, with each 1000 points taking roughly as long to return, while the paging iterator slows down, because it keeps restarting at the beginning.

So, this leads me to believe you are also restarting at the beginning. The trick with using a plain iterator is to maintain state between calls. If you are running in embedded mode, this is obviously trivial. But if you are running a web app, I think it gets more subtle.

If you want to see my test code to verify if I am in fact modeling the same process as you, I can commit it to neo4j-spatial. I plan to do that anyway, after a little more double-checking.

Regards, Craig

--
 
 

Peter Neubauer

unread,
Nov 9, 2012, 11:58:27 AM11/9/12
to Neo4j User
Thank you Craig for chiming in!

/peter


Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

Neo4j 1.8 GA - http://www.dzone.com/links/neo4j_18_release_fluent_graph_literacy.html


--
 
 

Craig Taverner

unread,
Nov 9, 2012, 1:04:46 PM11/9/12
to ne...@googlegroups.com
No problem.

I realize also my entire discussion could be summarized with 'don't use range()'....
;-)

--
 
 

Reply all
Reply to author
Forward
0 new messages