column-family
type nosql database as storage backend, and hence that is why we have Scylla, Cassandra, HBase etc. SnowFlake isn't a column family database, but it has a column data type which can store any sort of data. So we can store complete JSON Oriented Column family data
here after massaging / pre-processing the data. Is that a practical thought? Is is practical enough to implement?janusgraph-cassandra
and janusgraph-berkley
projects. Please correct me if I'm wrong in my understanding.StoreManager
class like HBaseStoreManager, AbstractCassandraStoreManager, BerkeleyJEStoreManager
which extends either DistributedStoreManager or LocalStoreManager
and implements KeyColumnValueStoreManager
class right? These class needs to have build features
object which is more or less like storage connection configuration. They need to have a beginTransaction
method which creates the actual connection to corresponding storage backend. Is that correct?*CassandraTransaction* or *BerkeleyJETx*. The transaction class needs to extend
AbstractStoreTransaction` class. Though I can see and understand the transaction being created in BerkeleyJETx
I don't see something similar for CassandraTransaction
. So am I missing something in my undesrtanding here?KeyColumnValueStore
class for backend. Like *AsyntaxKeyColumnValueStore* or *BerkeleyJEKeyValueStore*
etc. They need to extend KeyColumnValueStore
. This class takes care of massaging the data into KeyColumnFormat
so that they can then be inserted into corresponding table inside Storage Backend.methods
which needs to be present always like I see getSlice()
being used across in all classes. Also, how do they work? convert incoming gremlin queries into KeyColumnValue structure
? read aspect of janusgraph/query aspect
gets solved? Are there any changes needed as well on that end or JanusGraph is so abstracted that it can now start picking up from new source?--
You received this message because you are subscribed to the Google Groups "JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/8169f717-9923-478d-b7f1-28d6ee894e9d%40googlegroups.com.
Hi Debashish,here are my 2 cents:First of all, you need to be clear with yourself as to why exactly you want to build a new backend? E.g. do you find that the existing ones are sub-optimal for certain use cases, or they are too hard to set up, or you just want to provide a backend to a cool new database in the hope that it will increase adoption, or smth else? In other words, do you have a clear idea of what is this new backend going to provide which the existing ones do not, e.g. advanced scalability or performance or ease of setup, or just an option for people with existing Snowflake infra to put it to a new use?Second, you are almost correct, in that basically all you need to implement are three interfaces:- KeyColumnValueStoreManager, which allows opening multiple instances of named KeyColumnValueStores and provides a certain level of transactional context between different stores it has opened- KeyColumnValueStore - which represents an ordered collection of "rows" accessible by keys, where each row is a- KeyValueStore - basically an ordered collection of key-value pairs, which can be though of as individual "columns" of that row, and their respective valuesBoth row and column keys, and the data values are generic byte data.Have a look at this piece of documentation: https://docs.janusgraph.org/advanced-topics/data-model/Possibly the simplest way to understand the "minimum contract" required by Janusgraph from a backend is to look at the inmemory backend. You will see that:- KeyColumnValueStoreManager is conceptually a Map of store name -> KeyColumnValueStore,- each KeyColumnValueStore is conceptually a NavigableMap of "rows" or KeyValueStores (i.e. a "table") ,- each KeyValueStore is conceptually an ordered collection of key -> value pairs ("columns").In the most basic case, once you implement these three relatively simple interfaces, Janusgraph can take care of all the translation of graph operations such as adding vertices and edges, and of gremlin queries, into a series of read-write operations over a collection of KCV stores. When you open a new graph, JanusGraph asks the KeyColumnValueStoreManager implementation to create a number of specially named KeyColumnValueStores, which it uses to store vertices, edges, and various indices. It creates a number of "utility" stores which it uses internally for locking, id management etc.Crucially, whatever stores Janusgraph creates in your backend implementation, and whatever it is using them for, you only need to make sure that you implement those basic interfaces which allow to store arbitrary byte data and access it by arbitrary byte keys.So for your first "naive" implementation, you most probably shouldn't worry too much about translation of graph model to KCVS model and back - this is what Janusgraph itself is mostly about anyway. Just use StoreFeatures to tell Janusgraph that your backend supports only most basic operations, and concentrate on thinking how to best implement the KCVS interfaces with your underlying database/storage system.Of course, after that, as you start thinking of supporting better levels of consistency/transaction management across multiple stores, about performance, better utilising native indexing/query mechanisms, separate indexing backends, support for distributed backend model etc etc - you will find that there is more to it, and this is where you can gain further insights from the documentation, existing backend sources and asking more specific questions.See for example this piece of documentation: https://docs.janusgraph.org/advanced-topics/eventual-consistency/Hope this helps,Dmitry
On Thu, 24 Oct 2019 at 21:27, Debasish Kanhar <d.k...@gmail.com> wrote:
--I know that JanusGraph needs acolumn-family
type nosql database as storage backend, and hence that is why we have Scylla, Cassandra, HBase etc. SnowFlake isn't a column family database, but it has a column data type which can store any sort of data. So we can store completeJSON Oriented Column family data
here after massaging / pre-processing the data. Is that a practical thought? Is is practical enough to implement?If it is practical enough to implement, what needs to be done? I'm going through the source code, and I'm basing my ideas based on my understanding fromjanusgraph-cassandra
andjanusgraph-berkley
projects. Please correct me if I'm wrong in my understanding.
- We need to have a
StoreManager
class likeHBaseStoreManager, AbstractCassandraStoreManager, BerkeleyJEStoreManager
which extends eitherDistributedStoreManager or LocalStoreManager
and implementsKeyColumnValueStoreManager
class right? These class needs to have buildfeatures
object which is more or less like storage connection configuration. They need to have abeginTransaction
method which creates the actual connection to corresponding storage backend. Is that correct?- You will need to have corresponding Transaction classes which create the transaction to corresponding backend like
*CassandraTransaction* or *BerkeleyJETx*. The transaction class needs to extend
AbstractStoreTransaction` class. Though I can see and understand the transaction being created inBerkeleyJETx
I don't see something similar forCassandraTransaction
. So am I missing something in my undesrtanding here?- You need to have
KeyColumnValueStore
class for backend. Like*AsyntaxKeyColumnValueStore* or *BerkeleyJEKeyValueStore*
etc. They need to extendKeyColumnValueStore
. This class takes care of massaging the data intoKeyColumnFormat
so that they can then be inserted into corresponding table inside Storage Backend.
- So question to my mind are, what will be structure of those classes?
- Are there some
methods
which needs to be present always like I seegetSlice()
being used across in all classes. Also, how do they work?- Do they just
convert incoming gremlin queries into KeyColumnValue structure
?- Are there any other classes I'm missing out on or these 3 are the only ones needed to be modified to create a new storage backend?
- Also, if these 3 are only classes needed, and let's say we success in using SnowFlake as storage backend, how do the
read aspect of janusgraph/query aspect
gets solved? Are there any changes needed as well on that end or JanusGraph is so abstracted that it can now start picking up from new source?- And, I thought there would be some classes which would be reading in from "gremlin queries" doing certain "pre-processing into certain data structures (tabular)" and then pushed it through some connection into respective backends. This is where we cant help, is there a way to visualize those objects after "pre-processing" and then store those objects as it is in SnowFlake and reuse it to fulfill gremlin queries.
I know we can store random objects in SnowFlake, just looking at changed needed at JanusGraph level to achieve those.Any help will be really appreciated.Thanks in Advance.
You received this message because you are subscribed to the Google Groups "JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/fe1118aa-5132-44ed-b59e-209e9b7adaab%40googlegroups.com.
Debasish,This sounds like an interesting project, but I do have a question about your choice of Snowflake. If I missed your response to this in the email chain, I apologize, but what problems with the existing high-performance backends (Scylla, for instance) are you trying to solve with Snowflake? The answer to that would probably inform your specific implementation over Snowflake.Thanks,Ryan
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/fe1118aa-5132-44ed-b59e-209e9b7adaab%40googlegroups.com.
Hi.
Is this backend open-source/will be open-sourced?
Best regards,
Evgeniy Ignatiev.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/bc498b4e-6950-46b9-b7b9-a853da174830%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/bc498b4e-6950-46b9-b7b9-a853da174830%40googlegroups.com.
this.getSlice = this.session.prepare(select()
.column(COLUMN_COLUMN_NAME)
.column(VALUE_COLUMN_NAME)
.fcall(WRITETIME_FUNCTION_NAME, column(VALUE_COLUMN_NAME)).as(WRITETIME_COLUMN_NAME)
.fcall(TTL_FUNCTION_NAME, column(VALUE_COLUMN_NAME)).as(TTL_COLUMN_NAME)
.from(this.storeManager.getKeyspaceName(), this.tableName)
.where(eq(KEY_COLUMN_NAME, bindMarker(KEY_BINDING)))
.and(gte(COLUMN_COLUMN_NAME, bindMarker(SLICE_START_BINDING)))
.and(lt(COLUMN_COLUMN_NAME, bindMarker(SLICE_END_BINDING)))
.limit(bindMarker(LIMIT_BINDING)));
final Future<EntryList> result = Future.fromJavaFuture(
this.executorService,
this.session.executeAsync(this.getSlice.bind()
.setBytes(KEY_BINDING, query.getKey().asByteBuffer())
.setBytes(SLICE_START_BINDING, query.getSliceStart().asByteBuffer())
.setBytes(SLICE_END_BINDING, query.getSliceEnd().asByteBuffer())
.setInt(LIMIT_BINDING, query.getLimit())
.setConsistencyLevel(getTransaction(txh).getReadConsistencyLevel())))
.map(resultSet -> fromResultSet(resultSet, this.getter));
interruptibleWait(result);
.where(eq(KEY_COLUMN_NAME, query.getKey().asByteBuffer()))
.and(gte(COLUMN_COLUMN_NAME, query.getSliceStart().asByteBuffer()))
.and(lt(COLUMN_COLUMN_NAME, query.getSliceEnd().asByteBuffer()))
.limit(query.getLimit()));
Under the Bigtable data model each table is a collection of rows. Each row is uniquely identified by a key. Each row is comprised of an arbitrary (large, but limited) number of cells. A cell is composed of a column and value. A cell is uniquely identified by a column within a given row. Rows in the Bigtable model are called "wide rows" because they support a large number of cells and the columns of those cells don’t have to be defined up front as is required in relational databases.
JanusGraph has an additional requirement for the Bigtable data model: The cells must be sorted by their columns and a subset of the cells specified by a column range must be efficiently retrievable (e.g. by using index structures, skip lists, or binary search).
===/quote===
Basically, getSlice method is the formal representation of above requirement in bold: based on the order defined for "column keys" space, it should return all "columns" whose keys lay "between" a start and end key values, given in SliceQuery... that is, >= start and <=end... Please refer to the javadoc for more detail.To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/8558c1df-3750-4d0d-aafb-bf14e13a0de9%40googlegroups.com.
Hi Debashish,
in terms of wrapping one's head around what getSlice() method does - conceptually it is not hard to understand, if you peruse the link I have referred you to in my original reply:The relevant part of it is really short so I'll just copy it here (with added emphasis in bold):===quote===Bigtable Data Model
Under the Bigtable data model each table is a collection of rows. Each row is uniquely identified by a key. Each row is comprised of an arbitrary (large, but limited) number of cells. A cell is composed of a column and value. A cell is uniquely identified by a column within a given row. Rows in the Bigtable model are called "wide rows" because they support a large number of cells and the columns of those cells don’t have to be defined up front as is required in relational databases.
JanusGraph has an additional requirement for the Bigtable data model: The cells must be sorted by their columns and a subset of the cells specified by a column range must be efficiently retrievable (e.g. by using index structures, skip lists, or binary search).
===/quote===
Basically, getSlice method is the formal representation of above requirement in bold: based on the order defined for "column keys" space, it should return all "columns" whose keys lay "between" a start and end key values, given in SliceQuery... that is, >= start and <=end... Please refer to the javadoc for more detail.
However, answering the question of how do you effectively implement it in your backend is pretty much the crux of your potential contribution.
If the underlying DB's data model more or less "natively" supports the above (as e.g. in the case of Cassandra, BDB etc), then it becomes relatively easy.
If the underlying data model is different, then it gets us back to the question which has been asked a couple of times in this thread - i.e. whether it is actually feasible and/or desirable to try and implement it?
For example, in order to implement it in a "classical" RDBMS, your would have to find one which supports ordering and indexing of byte columns/blobs, and then probably encounter scalability issues if you chose to model the whole key-column-value store as one table with row key, column key and data... It might still be possible to address these issues and implement it reasonably effectively, but it is unclear what would be the point - as you would effectively have to circumvent the "relational/SQL" top abstraction layer, which is the whole point of RDBMS, to get back to lower level implementation details.
Unfortunately I know nothing about Snowflake and it's data model, and don't have the time to learn about it in any sufficient detail any time soon, so I cannot really advise you neither on feasibility nor on any implementation details.Hope this helps,
Dmitry
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/8558c1df-3750-4d0d-aafb-bf14e13a0de9%40googlegroups.com.
Hello.
Awesome job! I have a couple of questions about your data loading
approach if you don't mind.
Is it simply aggregating writes locally before writing them to
Snowflake? Or do you also use BerkeleyDB as a local write-through
cache, from where reads are served for data is not yet in
Snowflake?
Drop in performance sounds expectable in comparison to Cassandra, it is not simply RDBMS vs NoSQL, but DWH vs NoSQL, Snowflake is really not optimized to perform multiple small operations, single insert is almost of the same latency as bulk insert, ideally significantly large bulk insert for JDBC driver to leverage internal stage loading optimization, as I understand you are going to do it manually through PUT FILE + COPY INTO combination. Updates are significantly slower and single updates are really devastating to performance (order of magnitude degradation with hundreds of concurrent threads) due to locking behavior and write amplification that Snowflake micro-partitioning should perform (overwriting whole micro-partition and/or creating single record file which will result in single object stored in underlying storage like S3).
Also bulk reading by means of SQL might not be worth it too, e.g.
if you want to use SparkGraphComputer - Snowflake Spark connector
itself, issues direct SQL queries only to request metadata, even
for native SQL backed DataFrames/DataSets. Actual reading happens
in parallel from executors by offloading data to S3 stage and
reading directly from it.
Best regards,
Evgeniy Ignatiev.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/99dff8e5-e263-40be-9c4e-7c7ecd0d2316%40googlegroups.com.
-- Best regards, Evgeniy Ignatiev.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/99dff8e5-e263-40be-9c4e-7c7ecd0d2316%40googlegroups.com.