[Infra] Add generic CSV/coord graph loaders + Andorra OSM template for JMH benches

Shai Eilat

unread,

May 19, 2026, 2:58:41 PMMay 19

to jgrapht-dev

Hi all,

Heads-up before opening a test-infrastructure PR. Test-scope only — no main-source change, no new public API in jgrapht-core, no behavioural difference to any algorithm.

The gap

Recent perf PRs have benefited a lot from real-graph benches (m2m PR #1340 and AllDirectedPaths PR #1341 both shipped JMH cells against the Andorra OSM road network), but every contributor has been rolling their own CSV loader and A* heuristic from scratch. As I'm queueing more perf work — batch point-to-point shortest paths, parallel coreness, incremental MST — that pattern needs to stop being one-off.

What's in the PR

Three commits in seilat/jgrapht:chore/perf-bench-graph-loaders, all under jgrapht-core/src/test/ (plus one test-scope dep in the parent pom.xml).

Commit 1 — generic utilities (org.jgrapht.perf.util)

WeightedEdgeListCsvReader<V, E> — reads src,dst,weight CSV (gzip or plain) into a provided Graph. Fluent options for vertex parser, header, delimiter, gzip-wrapping, and an addMissingVertices(boolean) toggle for callers that pre-add the vertex set in a specific order. Entry points: read(InputStream), readResource(Class, String), readFile(Path).
CoordinatesCsvReader<V> — same shape, reads id,lat,lon CSV into Map<V, double[]>.
HaversineHeuristic<V> — generic AStarAdmissibleHeuristic over a Function<V, double[]> coordinate lookup, default Earth radius with an override for non-Earth bodies.
JmhBenchRunner — small helper that runs a JMH bench class with forks=0 so it inherits the surefire JVM's argLine, and writes the text summary to a caller-supplied path. Useful when running benches via -Dtest=MyBenchSuite#runFoo rather than the command-line org.openjdk.jmh.Main entry point.

24 JUnit 5 tests cover the readers and heuristic (in-memory streams, gzip handling, line-numbered parse errors, custom parsers, missing-resource behaviour, etc.).

Commit 2 — Andorra OSM template (org.jgrapht.perf.shortestpath.osm)

AndorraGraphLoader — thin wrapper using the generic readers; returns the loaded graph plus a Map<Integer, double[]> of node coordinates that pairs with HaversineHeuristic. Adds isFixtureAvailable() and a helpful IllegalStateException pointing at the README when the CSV fixtures aren't on the classpath.
AndorraGraphLoaderSmokeTest — skips via Assumptions.assumeTrue when the fixture is absent, so the suite stays green on clean checkouts.
AndorraDijkstraManyToManyShortestPathsBench and AndorraAllDirectedPathsNonSimpleBench — AverageTime benchmark templates demonstrating the loader + JMH setup on the 36,618-vertex / 67,354-edge Andorra strongly-connected component.
AndorraBenchSuite — surefire @Test entry points dispatched through JmhBenchRunner.
src/test/resources/perf/osm/README.md — download URL, preprocessor invocation, and the "fixtures absent → tests skip" behaviour.

The raw 700 KB of Andorra CSV.gz fixtures are not committed: the data is freely re-derivable from Geofabrik under the Open Database License, changes upstream daily, and shipping a snapshot was the wrong default for a library repo. Contributors run the preprocessor once and drop the CSVs into src/test/resources/perf/osm/; the existing .gitignore *.gz rule keeps them from accidentally being committed.

Commit 3 — GPKG → CSV preprocessor (org.jgrapht.perf.util.GpkgRoadGraphPreprocessor)

The preprocessor that produces the CSVs is itself a Java class on the test classpath, not a Python or shell script:

Reads the Geofabrik GPKG via org.xerial:sqlite-jdbc (test scope, added to parent dependencyManagement).
Parses GPKG-wrapped WKB LINESTRING blobs by hand (the only geometry type the road layer uses — about 30 lines of ByteBuffer).
Snaps coordinates at 1e-7 deg to dedupe shared endpoints between adjacent line-strings.
Runs KosarajuStrongConnectivityInspector to find the largest Strongly Connected Component (SCC).
Dedupes parallel edges keeping the shortest Haversine weight.
Writes both gzipped CSVs in the schema the loader expects.

java.sql is a JDK module not currently required by org.jgrapht.core. Rather than polluting the production module-info.java for a test-only need, the additions live in <testCompilerArgs> and the surefire argLine (--add-modules java.sql --add-reads org.jgrapht.core=java.sql). The same class works against any Geofabrik free-tier region; I verified the byte stream produces a graph identical (to the level of vertex / edge counts and Dijkstra-reachability) to the smoke-test expectations on Andorra.

Why bother

The same utilities scale beyond Andorra. I've validated the readFile(Path) path locally against a Geofabrik Israel extract (2,338,873 vertices / 4,342,430 edges in the largest SCC, ~64× Andorra). Load time on this hardware: coords 0.7 s, vertex pre-add 0.3 s, edges 5.7 s, single full-target Dijkstra 1.9 s. The same scripts/andorra_to_csv.py works on any Geofabrik free-tier extract; the docstring spells out the schema assumptions.

Tests

24 new tests in org.jgrapht.perf.util (in-memory parsing, gzip, resource lookup, file I/O, error reporting).
AndorraGraphLoaderSmokeTest validates the worked example end-to-end when the fixture is present and skips cleanly when absent.
Full mvn -pl jgrapht-core test is green (6926 pass, 15 unrelated upstream skips).

Questions for the dev list

Package placement. org.jgrapht.perf.util for the generic readers / heuristic / runner / preprocessor; org.jgrapht.perf.shortestpath.osm for the Andorra worked example. Is that the right home, or do you want me to put the readers under a different package — org.jgrapht.test.util, org.jgrapht.bench.util, etc.?

New test dep + JPMS plumbing. sqlite-jdbc (test scope) is what the GPKG preprocessor needs; java.sql is reached via <testCompilerArgs> + surefire argLine so the production module-info.java stays unchanged. Comfortable with that pattern, or would you rather see requires static java.sql; (or requires java.sql;) declared in module-info.java?

Sample data policy. I deliberately ship zero raw data; the README points at Geofabrik. If the project would rather vendor a tiny sample fixture (~50-vertex hand-built graph, say) so the bench templates can run on a clean checkout without external setup, I'm happy to add one — but I didn't want to assume that on my own.

Anything else worth folding in before opening the PR — extra reader options (readDirected vs readUndirected toggles?), additional sample bench templates, naming nits?

I'll open the PR shortly unless someone wants me to hold for discussion.

Thanks, Shai

P.S. When I ran the new Java preprocessor on a Geofabrik Israel extract to validate it at scale, vertex 0 of the resulting graph landed at (29.5512804, 34.9536767) — the southernmost city in Israel, Eilat. Geofabrik exports gis_osm_roads_free in OSM way_id ascending, so vertex 0 is always the first node of the lowest-numbered way in the cut; in Israel that happens to be an early-digitized road in Eilat. The surname Eilat is held by ~835 people in Israel (and not many more globally), so the odds of this preprocessor landing your namesake at vertex 0 — given an Israeli developer wrote it — are around 1 in a million. I'll take it as the universe signing off on the PR.

John Sichi

unread,

May 20, 2026, 1:47:43 AMMay 20

to Shai Eilat, jgrapht-dev

Before answering your questions, it feels to me like the OSM portion wants to become its own new top-level jgrapht-osm module. Whether via preprocessing or otherwise, this would allow an application (not just a test) to import OSM data into a JGraphT graph. We've generally tried to keep those kinds of integrations out of jgrapht-core itself. The downside is that your new performance tests would live in the new module, rather than in core, but I don't think that's the end of the world.

That approach would also allow you to create a dependency on the existing CSV reader in the jgrapht-io module.

What do you think?

(A search led to https://github.com/NGXAUT/OSMToJGraphT but it's very old.)

--
You received this message because you are subscribed to the Google Groups "jgrapht-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jgrapht-dev...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/jgrapht-dev/b72dc594-6a5c-4661-aba1-ee8a934a6b82n%40googlegroups.com.

Shai Eilat

unread,

May 23, 2026, 5:30:09 AMMay 23

to jgrapht-dev

Hi John,

Thanks — that framing is a real improvement and I went with it. Pushed the rework to the same branch (seilat/jgrapht:chore/perf-bench-graph-loaders); the head is now 97a73df4cd chore(osm): refactor OSM integration into new jgrapht-osm module.

What changed since the original post

There is now a top-level jgrapht-osm module:

jgrapht-osm/

├── pom.xml # jgrapht-core (compile), jgrapht-io (compile), sqlite-jdbc (compile), JMH (test)

├── src/main/java/module-info.java # requires java.sql properly -- no --add-reads gymnastics

├── src/main/java/org/jgrapht/osm/

│ ├── GpkgRoadGraphPreprocessor.java # GPKG -> headerless CSVs (production, callable as main() or library)

│ ├── OsmCsvGraphLoader.java # CSV -> Graph<Integer, DefaultWeightedEdge>, wraps jgrapht-io CSVImporter

│ ├── OsmCoordinatesReader.java # CSV -> Map<Integer, double[]>

│ └── HaversineHeuristic.java # generic AStarAdmissibleHeuristic

└── src/test/java/org/jgrapht/osm/

├── AndorraGraphLoader.java # convenience fixture helper

├── AndorraGraphLoaderSmokeTest.java

├── OsmCsvGraphLoaderTest.java

├── OsmCoordinatesReaderTest.java

├── HaversineHeuristicTest.java

└── perf/ # JMH templates (m2m, ADP) + AndorraBenchSuite + JmhBenchRunner

Concretely:

The edge-list reader is now org.jgrapht.nio.csv.CSVImporter from jgrapht-io. The preprocessor writes headerless CSVs (src,dst,weight) so the importer consumes them with CSVFormat.EDGE_LIST + EDGE_WEIGHTS=true and vertexFactory=Integer::valueOf. My old WeightedEdgeListCsvReader is gone.
OsmCoordinatesReader is the only ad-hoc reader I kept: the coordinate CSV is a flat key-value table, not graph-shaped, so the existing importers do not apply. Twenty lines of BufferedReader plus parse-error line tagging.
HaversineHeuristic is now production code in jgrapht-osm (was test-scope perf-util in core). The module-info declares requires java.sql; so the <testCompilerArgs> + surefire argLine hack is gone from jgrapht-core/pom.xml.
sqlite-jdbc is now a compile dep of jgrapht-osm only. Version still managed in the parent dependencyManagement. Nothing in jgrapht-core depends on it.
Perf benches (Andorra m2m, AllDirectedPaths walk-enumeration, the batch shortest-paths bench queued for the follow-up PR) live in jgrapht-osm/src/test/java/org/jgrapht/osm/perf/. Surefire's <exclude>**/perf/**</exclude> keeps them out of the default test run, same as the existing org.jgrapht.perf.* packages in jgrapht-core.

Validation

mvn -pl jgrapht-osm test with locally-produced Andorra fixtures: 16 tests pass.
mvn -pl jgrapht-osm test without fixtures: 15 pass, 1 skip on the Andorra smoke test via Assumptions.assumeTrue.
mvn -pl jgrapht-core test: 6914 pass, 15 unrelated upstream skips. Net delta from master is the deletion of the moved perf-util / osm test classes.
Andorra round-trip end-to-end through the new module: GpkgRoadGraphPreprocessor → fixtures → AndorraGraphLoader.load() → DijkstraShortestPath on 36,618 V / 67,354 E.
Smoke of AndorraDijkstraManyToManyShortestPathsBench and the queued AndorraBatchShortestPathsBench via org.openjdk.jmh.Main from the new module's test classpath: both run end-to-end.

Main question for your call

GpkgRoadGraphPreprocessor is the only piece where I want explicit sign-off, because it is what brings sqlite-jdbc into the module's compile classpath. Right now it lives in src/main so users can wire it into their own apps, not only into JMH benches — that matches the "import OSM data into JGraphT graphs" framing from your earlier note.

If the project would rather not have a sqlite-jdbc compile dependency on a JGraphT module, the easy fallback is to move the preprocessor into src/test or an examples source root, and ship jgrapht-osm proper as CSV-loader + coordinates + Haversine only. That keeps the module dependency-light at the cost of forcing users to either fetch pre-produced CSVs or run the preprocessor as a one-shot tool outside the main module API.

My preference is to keep the current shape, because without the preprocessor the module is mostly a CSV loader for a format users must produce externally — but this is the part I am least attached to.

Minor housekeeping

module-info has requires transitive org.jgrapht.core and requires org.jgrapht.io. jgrapht.io is non-transitive because the loader wraps CSVImporter and consumers do not need to see the importer directly. Happy to flip if there is a project convention I am missing.
AndorraBenchSuite (@Test methods that dispatch JMH via JmhBenchRunner) is convenience only — a way to run a bounded smoke cell from Maven without remembering a long JMH command. The canonical benchmark path stays org.openjdk.jmh.Main. If the project prefers not to mix JUnit dispatch with JMH at all, dropping the suite and the runner is a one-class delete and does not touch the module's public API.

I will wait on this thread before opening the PR. I will keep the follow-up BatchShortestPathAlgorithm PR separate so this PR stays focused on OSM/test-infra plumbing.

Thanks again for the steer.

Shai

John Sichi

unread,

May 24, 2026, 1:01:20 AMMay 24

to Shai Eilat, jgrapht-dev

I'm fine with the module-level sqlite-jdbc compile-time dependency.

Also, I think including the tiny sample fixture data would be good instead of just skipping altogether.

For the questions about the remaining details, I'll take a look once you open the PR.

To view this discussion visit https://groups.google.com/d/msgid/jgrapht-dev/a65c57ba-4184-45c0-92bc-dc10876e2c3dn%40googlegroups.com.

Reply all

Reply to author

Forward