Elasticsearch 6.3.2

2 views

Skip to first unread message

Tisa Ammann

unread,

Aug 3, 2024, 6:07:30 PM8/3/24

to ryonsatbolpa

I have the ArcGIS Enterpise (10.9.1) subscription and I would like to expose high volumes of data coming in real-time stored in Elasticsearch through the map/feature services in order to visualize it on a web app. The GeoEvent server is able to handle this huge data ingestion in real-time which can then store it in the Spatiotemporal Big Data Store, as seen in the screenshot attached taken from Esri's DevSummit Event ( =iW7_w9Evr6c&ab_channel=EsriEvents).

However, I could not find any tutorials on how we can connect to elasticsearch in the GeoEvent server input connectors. I would appreciate any guidance on how i can go about setting up a direct connection to elasticsearch.

Once you have registered the ArcGIS Enterprise within GeoEvent Manager you can create Output layers for adding and or updating features within a big data store. If the GeoEvent Server machine is federated you can just start adding the Output Layers since the ArcGIS Enterprise is already registered.

After you install ArcGIS Data Store application on a dedicated server, you will have an option to configure a data store. You will choose "Spatiotemporal" as the data store and then configure this with your ArcGIS Server instance that is federated with Portal. A common misconception is that you will configure the Spatiotemporal data store with GeoEvent. That is not the case.

Adding to what Jake and Dan have said above, the fact that Elasticsearch is the search and analytics engine for Esri's Spatiotemporal Big Data Store is an implementation detail. You should consider the SBDS another type of Enterprise geodatabase, a capability you configure when installing ArcGIS Data Store. Direct connections to the Elasticsearch engine are not supported.

The only supported way to connect and work with data stored in the Spatiotemporal Big Data Store is through tools included with the different ArcGIS Server advanced server roles (e.g. GeoEvent Server, GeoAnalytics, etc.) Typically client access is limited to the REST interfaces exposed through map/feature services published as you create/publish hosted feature layers.

I am trying to understand what shard and replica is in Elasticsearch, but I didn't manage to understand it. If I download Elasticsearch and run the script, then from what I know I have started a cluster with a single node. Now this node (my PC) have 5 shards (?) and some replicas (?).

When you download elasticsearch and start it up, you create an elasticsearch node which tries to join an existing cluster if available or creates a new one. Let's say you created your own new cluster with a single node, the one that you just started up. We have no data, therefore we need to create an index.

When you create an index (an index is automatically created when you index the first document as well) you can define how many shards it will be composed of. If you don't specify a number it will have the default number of shards: 5 primaries. What does it mean?

Every time you index a document, elasticsearch will decide which primary shard is supposed to hold that document and will index it there. Primary shards are not a copy of the data, they are the data! Having multiple shards does help taking advantage of parallel processing on a single machine, but the whole point is that if we start another elasticsearch instance on the same cluster, the shards will be distributed in an even way over the cluster.

Every elasticsearch index is composed of at least one primary shard since that's where the data is stored. Every shard comes at a cost, though, therefore if you have a single node and no foreseeable growth, just stick with a single primary shard.

Another type of shard is a replica. The default is 1, meaning that every primary shard will be copied to another shard that will contain the same data. Replicas are used to increase search performance and for fail-over. A replica shard is never going to be allocated on the same node where the related primary is (it would pretty much be like putting a backup on the same disk as the original data).

Back to our example, with 1 replica we'll have the whole index on each node, since 2 replica shards will be allocated on the first node and they will contain exactly the same data as the primary shards on the second node:

With a setup like this, if a node goes down, you still have the whole index. The replica shards will automatically become primaries and the cluster will work properly despite the node failure, as follows:

Since you have "number_of_replicas":1, the replicas cannot be assigned anymore as they are never allocated on the same node where their primary is. That's why you'll have 5 unassigned shards, the replicas, and the cluster status will be YELLOW instead of GREEN. No data loss, but it could be better as some shards cannot be assigned.

As soon as the node that had left is backed up, it'll join the cluster again and the replicas will be assigned again. The existing shard on the second node can be loaded but they need to be synchronized with the other shards, as write operations most likely happened while the node was down. At the end of this operation, the cluster status will become GREEN.

Replicas are copies of the shards and provide reliability if a node is lost. There is often confusion in this number because replica count == 1 means the cluster must have the main and a replicated copy of the shard available to be in the green state.

A cluster consists of one or more nodes which share the same cluster name. Each cluster has a single master node which is chosen automatically by the cluster and which can be replaced if the current master node fails.

I will explain this using a real word scenarios. Imagine you are a running a ecommerce website. As you become more popular more sellers and products add to your website. You will realize the number of products you might need to index has grown and it is too large to fit in one hard disk of one node. Even if it fits in to hard disk, performing a linear search through all the documents in one machine is extremely slow. one index on one node will not take advantage of the distributed cluster configuration on which the elasticsearch works.

So elasticsearch splits the documents in the index across multiple nodes in the cluster. Each and every split of the document is called a shard. Each node carrying a shard of a document will have only a subset of the document. suppose you have 100 products and 5 shards, each shard will have 20 products. This sharding of data is what makes low latency search possible in elasticsearch. search is conducted parallel on multiple nodes. Results are aggregated and returned. However the shards doesnot provide fault tolerance. Meaning if any node containing the shard is down, the cluster health becomes yellow. Meaning some of the data is not available.

To increase the fault tolerance replicas come in to picture. By deault elastic search creates a single replica of each shard. These replicas are always created on a other node where the primary shard is not residing. So to make the system fault tolerant, you might have to increase the number of nodes in your cluster and it also depends on number of shards of your index. The general formula to calculate the number of nodes required based on replicas and shards is "number of nodes = number of shards*(number of replicas + 1)".The standard practice is to have atleast one replica for fault tolerance.

Setting up the number of shards is a static operation, meaning you have to specify it when you are creating an index. Any change after that woulf require complete reindexing of data and will take time. But, setting up number of replicas is a dynamic operation and can be done at any time after index creation also.

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.

In ElasticSearch, at the top level we index the documents into indices. Each index has number of shards which internally distributes the data and inside shards exist the Lucene segments which is the core storage of the data. So if the index has 5 shards it means data has been distributed across the shards and not same data exist into the shards.

We have Bitbucket 5.5.0 running under a dedicated user account (atlbitbcket) on Linux (CentOS 7.x). We now have need to have Elasticsearch running on the server (I can see in the logs it doesn't start up automatically when Bitbucket does) and when I try to go to :7992 it doesn't respond.

Locate the $BITBUCKET_HOME/shared/search/buckler/buckler.yml file note the username/password and update these values on the "Server Settings" page of Bitbucket Server GUI. You can test the connection by clicking the 'Test' button on the page. Once done, elasticsearch should be operational.

I'm facing a similar issue. I checked buckler.yml for the default login and copied to the GUI but when I click the Test button, it fails. The log indicates the connection to ElasticSearch is refused with such entries as:

./atlassian-bitbucket.log:2018-05-16 13:52:49,492 INFO [pool-66-thread-1] c.a.b.s.s.t.DefaultElasticsearchConnectionTester Testing connection with Elasticsearch failed due to exception java.net.ConnectException: Connection refused

./atlassian-bitbucket.log:2018-05-16 13:52:49,494 ERROR [Caesium-1-3] c.a.b.i.s.c.s.p.AutomaticAuthenticationProvisioner Skipping automatic auth configuration: Elasticsearch instance is not available for connection.

I am having the same problem. Atlassian: could you provide some suggestions for debugging this problem? Embedded elasticsearch is not starting for me and I tried the recommendation above. The elasticsearch process is not running even though the output of the the init.d script says it is.