Dear Mongo users,
I would like to analyze a Mongo database which intensively uses DBRefs to link different parts of a record to the central point. Sometimes these links form a multilevel hierarchy. In total there are about 20 collections in the database which are linked with the DBRefs. The mongo-spark settings requires to specify one collection, and for me it seems that there is no was to resolve these references within the library. I tried to add the normal java connector, as well, but some of its classes don't implement the Serializable interface so the can not be used within Spark. I tried different methods, and spend quite some time with searching for solutions, but at the end I come of with a strange solution. I've created a REST interface which resolves the links and builds the JSON of the whole object, and I call it for each record in Spark:
JavaRDD<String> baseCountsRDD = rdd.map(record -> {
String id = record.get("about", String.class);
String jsonString = client.getRecord(id);
return analyzer.analyze(jsonString);
});
Since the database is large, this REST layer is an overhead on the system.
what I would like achive is something like this:
JavaRDD<String> baseCountsRDD = rdd.map(record -> {
String jsonString = resolver.resolveLinks(record);
return analyzer.analyze(jsonString);
});
Unfortunatelly DBRef does not appear at all in the mongo-spark code base and the documentation and all the examples are based on a single collection.
Do you have any idea I could try?
Best,
Péter