Preloading in CloudDataStore

90 views
Skip to first unread message

thstart

unread,
May 28, 2018, 10:22:52 PM5/28/18
to Google App Engine
How to preload data in CloudDataStore?

I have .csv file of size 100GB/100 million records. 

CloudDataStore would be read only. How is the best way to do that
minimizing cost? I would have 200 fields but only 10 are searchable.

Because of the size I should make them in such a way that to have only these 10
in CloudDataStore with one key as a link to a bucket in CloudStorage where they can be in Json
format.

I would like to preload the 10 keys and the link in CloudDataStore and to be ready to use.

How is the best and most cost efficient way to do this?



Jordan (Cloud Platform Support)

unread,
May 29, 2018, 11:12:18 AM5/29/18
to Google App Engine
Since Google Cloud Datastore is a non-relational, NoSQL, highly-scalable database, you must create a script in the supported language of your choice that reads your CSV, converts your data into Datastore Entities, and then saves those Entities to your Datastore. Once your data is loaded into the Datastore you can then easily export it and import it into other projects' Datastores via the Managed Import/Export service

The way to minimize costs in the Datastore is to use the least amount of indexes as possible. You can think of an index as a sorted table. For every query you use, you need an associated sorted index (aka a copy of your data sorted specifically for that query). By default, an index is automatically created for each field of each Entity Kind. Therefore to avoid having 200 automatically created indexes, it is recommended to mark all of the properties that you do not plan on querying as un-indexed to save you money. 

It is actually recommended to use Datastore's automatic key generation when saving Entities, instead of creating custom keys. Datastore will ensure that your data is properly sharded and evenly distributed across Datastore servers in order to avoid hotspots and latency when specific entities are accessed a lot. Therefore, your link to Google Cloud Storage should be saved as an additional Entity property that will be returned when you query for an Entity.  In general it is recommended to follow the Best Practices for Datastore.

- Note that Google Groups is reserved for general product discussions and not for technical support. If you require further technical support for the Datastore, it is recommended to post your detailed questions to Stack Exchange using the supported Cloud tags. 

thstart

unread,
May 29, 2018, 12:35:01 PM5/29/18
to Google App Engine
Hi Jordan,

Thank you for detailed response. The only thing not clear to me is this.
For simple calculations if say I have 1 index only how much is the price and how it is calculated?

In price calculator I can see Document Index $2/GB. It seems too expensive.
Is this index the same index made from Datastore automatic index generation?

Thank you,
--Constantine

Jordan (Cloud Platform Support)

unread,
May 30, 2018, 12:55:26 PM5/30/18
to Google App Engine
You seem to be confusing the Search API document indexes with Datastore indexes. Datastore has its own tab in the Pricing Calculator, and charges $0.18 GB/Month.

thstart

unread,
May 30, 2018, 1:19:45 PM5/30/18
to Google App Engine
I know about Datastore tab.

In GC Pricing calculator:
AppEngine:
App Engine APIs and Services->Indexing Documents

Where I can read about what Documents are these?

Jordan (Cloud Platform Support)

unread,
May 30, 2018, 1:35:09 PM5/30/18
to Google App Engine
As previously mentioned, that is the Search API, and is not related to the Datastore. The Search API allows you to store document objects and query for these objects based on their contents. It uses ranking and indexing to find the most relevant objects, much like actual Google Search. It is not used as an actual main database, and is strictly for generic search (as the search results may change). 

thstart

unread,
May 30, 2018, 1:52:02 PM5/30/18
to Google App Engine
Where are these Documents stored - in Google Cloud Storage?

Jordan (Cloud Platform Support)

unread,
May 30, 2018, 2:13:47 PM5/30/18
to Google App Engine
Since the 'documents' are JSON objects and not raw binary, they cannot be saved into Google Cloud Storage. I think they use to be persisted to BigTable, but I believe they may now actually be stored in Spanner

thstart

unread,
May 30, 2018, 2:29:18 PM5/30/18
to Google App Engine
Thank you, the documentation is not clear about that.

thstart

unread,
May 30, 2018, 4:28:00 PM5/30/18
to Google App Engine
BigQuery can export to CSV and JSON in CloudStorage.
Reply all
Reply to author
Forward
0 new messages