Namenode failover during fetch

27 views
Skip to first unread message

David Ongaro

unread,
Jan 23, 2017, 9:16:41 PM1/23/17
to project-voldemort
Unfortunately we experience namenode failovers regularly, which is leading to failed in-progress fetches because (also unfortunately) webhdfs URLs are not namenode agnostic (we can't use HttpFS, since proxying all the data through a single server seems prohibitive). So the question is, is there some mechanism to handle failovers (either server side or via the BnP configuration)? Even something simple like trying another URL if the default one is failing might be fine.

In that regard we also noticed that oftentimes the cleanup after a failed BnP doesn't work correctly. Since we don't have much disc space on our voldemort nodes that means the next fetch also fails after a failed fetch because disc space is running out. Is there a way to address that (besides cleaning them up manually)? Server side we are currently Using Voldemort 1.10.13. On BnP side we are pretty much up to date (1.10.22).

Thanks

David

Arunachalam

unread,
Jan 24, 2017, 12:36:23 AM1/24/17
to project-...@googlegroups.com
About deleting all the failed nodes, following will help.


If you specify the  push.rollback to true, it will delete from all nodes.


About the name node failover,  Voldmeort use webhdfs. If I understand the WebHdfs file system correctly, it talks to namenode for retrieving the blocks and retrieves all the files from the data nodes. If you have frequent failover, the build and push protocol needs to be chatty, for updating the name node. That will be very tricky.

Better bet is to proxy the requests to the name node via  proxy server and implement the redirection or identification there. You can configure the voldemort server to do a regex replacement of the name node to your proxy server.

Thanks,
Arun.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

Félix GV

unread,
Jan 24, 2017, 1:33:33 AM1/24/17
to project-...@googlegroups.com
Arun's right.

The commit he pointed to (included in 1.10.23) has a fix for deleting failed fetches properly. It also a few other important fixes related to BnP. The config he mentioned, push.rollback, is true by default, so you don't need to set it.

For the NN issue, I was also going to suggest the same. Perhaps a proxy or a DNS change during NN failover could solve the issue?

-F
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.

David Ongaro

unread,
Jan 24, 2017, 2:53:28 PM1/24/17
to project-voldemort
On Monday, January 23, 2017 at 10:33:33 PM UTC-8, Félix GV wrote:
Arun's right.

The commit he pointed to (included in 1.10.23) has a fix for deleting failed fetches properly. It also a few other important fixes related to BnP. The config he mentioned, push.rollback, is true by default, so you don't need to set it.

Thanks for the quick answer! That indeed seems like it could help and also explains why the cleanup sometimes is working (since successful fetches on individual nodes are cleaned up). I try to convince our ops team to do the necessary updates.

For the NN issue, I was also going to suggest the same. Perhaps a proxy or a DNS change during NN failover could solve the issue?

Do we have to implement that ourself or is there a ready to use Proxy for that? As mentioned we cannot use HttpFS, even though it supports webhdfs. The usecase for HttpFS is to provide external access to an internal HDFS, so it doesn't deliver redirects to the appropriate datanodes but instead proxies all the traffic through a single service. What we would need is a proxy which delivers the request to the active namenode but just returns the unaltered redirect to the corresponding namenode.

Trying to do it via DNS doesn't sound very feasible to me, considering all the DNS caching going on on different levels?

Arunachalam

unread,
Jan 24, 2017, 4:08:34 PM1/24/17
to project-...@googlegroups.com
David, you need to implement a simple proxy server.

WebHDFS if I understand the protocol correctly, first talks to namenode for a file to get the location of the data blocks. Then it talks to each of the data nodes and retrieves the file. So for metadata it relies on the name node, but not for the actual data ( this is where it spends most of the time, downloading the files).

Ofcouse, this is my understanding and I haven't verified it. If I were you, I would start with 
1) You can fetch a file using  a command line program. Funnel all the outgoing request via Proxy or TCP capture.
2) See how many times it is connecting to namenode and other data nodes directly.
3) You have to implement a proxy server, which when sees the name node, replaces the name node in the request with the current name node. This should not be tricky. 

Let us take an example and see how it will look like. You have file

webhdfs://namenode:port/path/file

In my understanding, 
1)It talks to the name node and retrieves the location of the data blocks.
2) For each data block, it downloads the file. (Not sure, if the information is retrieved and cached, or there is some other variation of this).

As you notice above, if you replace the name node with your proxy server, only the metadata requests are hitting your proxy server. They should be far and few between. The data requests are going directly to the data nodes.

Please let us know if you need further information.

Thanks,
Arun.


--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

Greg Banks

unread,
Jan 24, 2017, 4:50:16 PM1/24/17
to project-...@googlegroups.com
G'day,

The HDFS NameNode is a SPoF for all HDFS users and not just for Voldemort. Modern versions of Hadoop have a feature designed to mitigate this


I have no direct experience with it.


To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
--


You received this message because you are subscribed to the Google Groups "project-voldemort" group.


To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.

David Ongaro

unread,
Jan 25, 2017, 4:09:21 AM1/25/17
to project-voldemort
On Tuesday, January 24, 2017 at 1:08:34 PM UTC-8, Arun Thirupathi wrote:
David, you need to implement a simple proxy server.

WebHDFS if I understand the protocol correctly, first talks to namenode for a file to get the location of the data blocks. Then it talks to each of the data nodes and retrieves the file. So for metadata it relies on the name node, but not for the actual data ( this is where it spends most of the time, downloading the files).

The webhdfs service on the namenode itself does indeed only answer metadata queries (like LISTSTATUS) directly. For actual data it delivers a redirect (307 TEMPORARY_REDIRECT) for the appropriate webhdfs API call on the corresponding datanode. The webhdfs service of the HttpFS proxy/gateway always delivers data directly (by querying the active namenode and resolving redirects), which means this is useless for us since it doesn't scale to put all traffic through a single service. 

So we would need something in between: a webhdfs service which doesn't run on a namenode but queries the active namenode for all metadata and leaves redirects unaltered.

Ofcouse, this is my understanding and I haven't verified it. If I were you, I would start with 
1) You can fetch a file using  a command line program. Funnel all the outgoing request via Proxy or TCP capture.
2) See how many times it is connecting to namenode and other data nodes directly.
3) You have to implement a proxy server, which when sees the name node, replaces the name node in the request with the current name node. This should not be tricky. 

Let us take an example and see how it will look like. You have file

webhdfs://namenode:port/path/file

In my understanding, 
1)It talks to the name node and retrieves the location of the data blocks.
2) For each data block, it downloads the file. (Not sure, if the information is retrieved and cached, or there is some other variation of this).

As you notice above, if you replace the name node with your proxy server, only the metadata requests are hitting your proxy server. They should be far and few between. The data requests are going directly to the data nodes.

Yes if we implement such a proxy server, we would contact it directly instead of trying to inspect and rewrite outgoing TCP packages (which would seem hackish to me). It surely shouldn't be too difficult to implement such a proxy, but that keeps me wondering why there isn't such a thing already available. It seems a pretty common problem to me. Basically webhdfs URLs which are not namenode agnostic seem pretty useless to me if a namenode failover is supposed to be a hadoop HA feature.

But best would be if the fetch code of Voldemort could figure out the active namenode, then we don't have to setup and maintain another service.

David


Please let us know if you need further information.

Thanks,
Arun.

On Tue, Jan 24, 2017 at 11:53 AM, David Ongaro <bitt...@gmail.com> wrote:
On Monday, January 23, 2017 at 10:33:33 PM UTC-8, Félix GV wrote:
Arun's right.

The commit he pointed to (included in 1.10.23) has a fix for deleting failed fetches properly. It also a few other important fixes related to BnP. The config he mentioned, push.rollback, is true by default, so you don't need to set it.

Thanks for the quick answer! That indeed seems like it could help and also explains why the cleanup sometimes is working (since successful fetches on individual nodes are cleaned up). I try to convince our ops team to do the necessary updates.

For the NN issue, I was also going to suggest the same. Perhaps a proxy or a DNS change during NN failover could solve the issue?

Do we have to implement that ourself or is there a ready to use Proxy for that? As mentioned we cannot use HttpFS, even though it supports webhdfs. The usecase for HttpFS is to provide external access to an internal HDFS, so it doesn't deliver redirects to the appropriate datanodes but instead proxies all the traffic through a single service. What we would need is a proxy which delivers the request to the active namenode but just returns the unaltered redirect to the corresponding namenode.

Trying to do it via DNS doesn't sound very feasible to me, considering all the DNS caching going on on different levels?

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.

Arunachalam

unread,
Jan 25, 2017, 12:59:40 PM1/25/17
to project-...@googlegroups.com

I am not sure about the popularity of the webhdfs protocol. Major memory leak bugs are not fixed, until much later in the life cycle of the webhdfs protocol.

Implementing dynamic discovery inside Voldemort will be very tricky. Voldemort started with the philosophy that it is only for Serving online traffic, and fetcher is not part of Voldemort, but fetchers are pluggable. 

It is possible to add another file fetcher protocol, which can do this redirect. But probably you need to modify the webhdfs code to support the dynamic namenode as webhdfs does not expose any of this. The downside of modifying the protocol is, if webhdfs makes future changes, the derived protocol needs to change as well 

About the proxy server, it will be quick and dirty script, but it will work reasonably well. At LinkedIn, we are not currently running into this problem, so it is not a burning problem for us.

Some other ways, I could think of are.
1) Voldemort supports regex replacement of URL. You can replace any Host, protocol with any other host protocol( if it does not fixing this is simple and trivial). Make a VIP which can point to your current name node.  Discuss with your hadoop operations, on when there is a name node failover ( either manually or get a hook if it is automated), ask them to update the VIP to the new name node. 

Thanks,
Arun.

To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

David Ongaro

unread,
Jan 25, 2017, 8:59:18 PM1/25/17
to project-...@googlegroups.com
On Jan 25, 2017, at 9:59 AM, Arunachalam <arunac...@gmail.com> wrote:


I am not sure about the popularity of the webhdfs protocol. Major memory leak bugs are not fixed, until much later in the life cycle of the webhdfs protocol.

I thought it’s the default and recommended protocol for store fetches? Are you saying I should use something different?

Implementing dynamic discovery inside Voldemort will be very tricky. Voldemort started with the philosophy that it is only for Serving online traffic, and fetcher is not part of Voldemort, but fetchers are pluggable. 

Yeah, while thinking about it I also concluded that it would be the wrong layer to implement this. Failovers should be transparent to the client, otherwise it’s not really an HA feature.

That doesn’t necessarily apply to the libs used by the client though. So I took a deeper look, and figured that indeed the hadoop 2.3.0 libraries currently used by Voldemort should adhere HA configurations. So it seems I just have to provide the dfs.nameservices, dfs.client.failover.proxy.provider.*, dfs.ha.namenodes.* and dfs.namenode.http-address.* properties to be able to use webhdfs URLs with logical authority names. 

If this works we don’t have to consider workarounds via a proxy anymore. I’ll let you know when I find out.

Thanks,

David

Arunachalam

unread,
Jan 25, 2017, 9:08:40 PM1/25/17
to project-...@googlegroups.com
It is the default protocol for the store fetches, but not sure what is the recommended way to get data out of the Hadoop these days though.

Thanks,
Arun.

Arunachalam

unread,
Jan 25, 2017, 9:09:51 PM1/25/17
to project-...@googlegroups.com
Please keep us posted on what you find out about, it will be useful if we have to go down that path in the future.

Thanks,
Arun.

David Ongaro

unread,
Jan 27, 2017, 9:12:24 PM1/27/17
to project-...@googlegroups.com
As it turns out Voldemort already has the provisions to support the Hadoop name node failover HA feature: just specify a local directory via the readonly.hadoop.config.path property in the server.properties and put your core-site.xml and hdfs-site.xml in there. To support webhdfs you need at the very least these properties:

From core.site.xml

hadoop.security.authentication

From hdfs-site.xml

dfs.nameservices
dfs.client.failover.proxy.provider.*
dfs.ha.namenodes.*
dfs.namenode.http-address.*

But to also support hdfs its can’t hurt to include the other dfs.namenode.* properties. If you fetch from different clusters you can combine their hdfs-site.xml, which isn’t a problem, since the relevant ha properties are parametrized by logical name.

Since Voldemort is using CDH 5.1.5 libs, but our cluster is running with CDH 5.4.5 I’m not sure if using hdfs directly would work reliable for us, but I will certainly try it out. Regarding that I wondering though if we should increase hdfs.fetcher.buffer.size when using hdfs, since it’s default is just 64 KB? Or doesn’t that matter much?

Thanks,

David

Felix GV

unread,
Jan 30, 2017, 6:42:56 PM1/30/17
to project-...@googlegroups.com
Hi David,

If I remember correctly, in some previous testing I did, I came to the conclusion that he 'hdfs.fetcher.buffer.size' parameter does not apply to the WebHDFS client, only the regular HDFS (fat) client.

--
Felix GV
Staff Software Engineer
Data Infrastructure
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv

David Ongaro

unread,
Jan 30, 2017, 7:11:16 PM1/30/17
to project-...@googlegroups.com
Yes, that’s also what I gathered from the source code comments. That’s why I asked explicitly "if we should increase hdfs.fetcher.buffer.size when using hdfs”.

Btw. I tested already with hdfs and it’s not working in our case: the fetching finishes “successfully” immediately after it started and the swap is failing because the fetch directory is not even created. The somewhat redacted logs looks like this:

[...]
2017-01-30 11:54:11 PM pool-1-thread-1 shell-job INFO: Push starting for cluster: tcp://dc1-voldemort06:6666
2017-01-30 11:54:12 PM pool-1-thread-1 VoldemortUtils INFO: Existing protocol = hdfs and port = -1
2017-01-30 11:54:12 PM pool-1-thread-1 VoldemortUtils INFO: New protocol = hdfs and port = 8020
2017-01-30 11:54:12 PM pool-1-thread-1 AbstractStoreClientFactory INFO: Client zone-id [-1] Attempting to get raw store [voldsys$_metadata_version_persistence] 
2017-01-30 11:54:12 PM pool-1-thread-1 net:6666 INFO: tcp://dc1-voldemort06:6666 : Initiating fetch of readonlyusers7 with dataDir: hdfs://had004/data/voldemort/readonlyusers7
2017-01-30 11:54:12 PM pool-3-thread-1 net:6666 INFO: tcp://dc1-voldemort06:6666 : Invoking fetch for Node dc1-voldemort25:6666 [id 9] for hdfs://had004:8020/data/voldemort/readonlyusers7/node-9
2017-01-30 11:54:12 PM pool-3-thread-1 AdminClient INFO: Node dc1-voldemort25:6666 [id 9] : AsyncOperationStatus(task id = 16, description = Fetch store 'readonlyusers7' v16, complete = false, status = 0 MB copied at 0 MB/sec - 0 % complete)
2017-01-30 11:54:12 PM pool-3-thread-1 AdminClient INFO: Node dc1-voldemort25:6666 [id 9] : AsyncOperationStatus(task id = 16, description = Fetch store 'readonlyusers7' v16, complete = true, status = Finished AsyncOperationStatus(task id = 16, description = Fetch store 'readonlyusers7' v16, complete = false, status = 0 MB copied at 0 MB/sec - 0 % complete))
2017-01-30 11:54:12 PM pool-3-thread-1 net:6666 INFO: tcp://dc1-voldemort06:6666 : Fetch succeeded on Node dc1-voldemort25:6666 [id 9]
2017-01-30 11:54:12 PM pool-1-thread-1 net:6666 INFO: tcp://dc1-voldemort06:6666 : Attempting swap for Node dc1-voldemort25:6666 [id 9], dir = Finished AsyncOperationStatus(task id = 16, description = Fetch store 'readonlyusers7' v16, complete = false, status = 0 MB copied at 0 MB/sec - 0 % complete)
2017-01-30 11:54:12 PM pool-1-thread-1 net:6666 ERROR: tcp://dc1-voldemort06:6666 : Error on Node dc1-voldemort25:6666 [id 9] during swap : 
voldemort.VoldemortException: Store directory 'Finished AsyncOperationStatus(task id = 16, description = Fetch store 'readonlyusers7' v16, complete = false, status = 0 MB copied at 0 MB/sec - 0 % complete)' is not a readable directory.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at voldemort.utils.ReflectUtils.callConstructor(ReflectUtils.java:116)
at voldemort.utils.ReflectUtils.callConstructor(ReflectUtils.java:103)
at voldemort.store.ErrorCodeMapper.getError(ErrorCodeMapper.java:84)
at voldemort.client.protocol.admin.AdminClient$HelperOperations.throwException(AdminClient.java:462)
at voldemort.client.protocol.admin.AdminClient$ReadOnlySpecificOperations.swapStore(AdminClient.java:4467)
at voldemort.store.readonly.swapper.AdminStoreSwapper.invokeSwap(AdminStoreSwapper.java:283)
at voldemort.store.readonly.swapper.AdminStoreSwapper.fetchAndSwapStoreData(AdminStoreSwapper.java:124)
at voldemort.store.readonly.mr.azkaban.VoldemortSwapJob.run(VoldemortSwapJob.java:159)
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob.runPushStore(VoldemortBuildAndPushJob.java:837)
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob$StorePushTask.call(VoldemortBuildAndPushJob.java:556)
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob$StorePushTask.call(VoldemortBuildAndPushJob.java:539)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2017-01-30 11:54:12 PM pool-1-thread-1 shell-job ERROR: Exception during push for cluster URL: tcp://dc1-voldemort06:6666. Rethrowing exception.
2017-01-30 11:54:12 PM main shell-job ERROR: Got exceptions during Build and Push:
2017-01-30 11:54:12 PM main shell-job ERROR: Exception for cluster: tcp://dc1-voldemort06:6666
java.util.concurrent.ExecutionException: voldemort.VoldemortException: Exception during swaps on nodes Node dc1-voldemort06:6666 [id 0] in zone 0 partitionList:[0, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168, 180, 192, 204, 216, 228, 240],Node dc1-voldemort07:6666 [id 1] in zone 0 partitionList:[1, 13, 25, 37, 49, 61, 73, 85, 97, 109, 121, 133, 145, 157, 169, 181, 193, 205, 217, 229, 241],Node dc1-voldemort08:6666 [id 2] in zone 0 partitionList:[2, 14, 26, 38, 50, 62, 74, 86, 98, 110, 122, 134, 146, 158, 170, 182, 194, 206, 218, 230, 242],Node dc1-voldemort09:6666 [id 3] in zone 0 partitionList:[3, 15, 27, 39, 51, 63, 75, 87, 99, 111, 123, 135, 147, 159, 171, 183, 195, 207, 219, 231, 243],Node dc1-voldemort10:6666 [id 4] in zone 0 partitionList:[4, 16, 28, 40, 52, 64, 76, 88, 100, 112, 124, 136, 148, 160, 172, 184, 196, 208, 220, 232, 244],Node dc1-voldemort21:6666 [id 5] in zone 0 partitionList:[5, 17, 29, 41, 53, 65, 77, 89, 101, 113, 125, 137, 149, 161, 173, 185, 197, 209, 221, 233, 245],Node dc1-voldemort22:6666 [id 6] in zone 0 partitionList:[6, 18, 30, 42, 54, 66, 78, 90, 102, 114, 126, 138, 150, 162, 174, 186, 198, 210, 222, 234, 246],Node dc1-voldemort23:6666 [id 7] in zone 0 partitionList:[7, 19, 31, 43, 55, 67, 79, 91, 103, 115, 127, 139, 151, 163, 175, 187, 199, 211, 223, 235, 247],Node dc1-voldemort24:6666 [id 8] in zone 0 partitionList:[8, 20, 32, 44, 56, 68, 80, 92, 104, 116, 128, 140, 152, 164, 176, 188, 200, 212, 224, 236, 248],Node dc1-voldemort25:6666 [id 9] in zone 0 partitionList:[9, 21, 33, 45, 57, 69, 81, 93, 105, 117, 129, 141, 153, 165, 177, 189, 201, 213, 225, 237, 249],Node dc1-voldemort26:6666 [id 10] in zone 0 partitionList:[10, 22, 34, 46, 58, 70, 82, 94, 106, 118, 130, 142, 154, 166, 178, 190, 202, 214, 226, 238, 250],Node dc1-voldemort27:6666 [id 11] in zone 0 partitionList:[11, 23, 35, 47, 59, 71, 83, 95, 107, 119, 131, 143, 155, 167, 179, 191, 203, 215, 227, 239, 251] failed
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob.run(VoldemortBuildAndPushJob.java:653)
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJobRunner.main(VoldemortBuildAndPushJobRunner.java:34)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: voldemort.VoldemortException: Exception during swaps on nodes Node dc1-voldemort06:6666 [id 0] in zone 0 partitionList:[0, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168, 180, 192, 204, 216, 228, 240],Node dc1-voldemort07:6666 [id 1] in zone 0 partitionList:[1, 13, 25, 37, 49, 61, 73, 85, 97, 109, 121, 133, 145, 157, 169, 181, 193, 205, 217, 229, 241],Node dc1-voldemort08:6666 [id 2] in zone 0 partitionList:[2, 14, 26, 38, 50, 62, 74, 86, 98, 110, 122, 134, 146, 158, 170, 182, 194, 206, 218, 230, 242],Node dc1-voldemort09:6666 [id 3] in zone 0 partitionList:[3, 15, 27, 39, 51, 63, 75, 87, 99, 111, 123, 135, 147, 159, 171, 183, 195, 207, 219, 231, 243],Node dc1-voldemort10:6666 [id 4] in zone 0 partitionList:[4, 16, 28, 40, 52, 64, 76, 88, 100, 112, 124, 136, 148, 160, 172, 184, 196, 208, 220, 232, 244],Node dc1-voldemort21:6666 [id 5] in zone 0 partitionList:[5, 17, 29, 41, 53, 65, 77, 89, 101, 113, 125, 137, 149, 161, 173, 185, 197, 209, 221, 233, 245],Node dc1-voldemort22:6666 [id 6] in zone 0 partitionList:[6, 18, 30, 42, 54, 66, 78, 90, 102, 114, 126, 138, 150, 162, 174, 186, 198, 210, 222, 234, 246],Node dc1-voldemort23:6666 [id 7] in zone 0 partitionList:[7, 19, 31, 43, 55, 67, 79, 91, 103, 115, 127, 139, 151, 163, 175, 187, 199, 211, 223, 235, 247],Node dc1-voldemort24:6666 [id 8] in zone 0 partitionList:[8, 20, 32, 44, 56, 68, 80, 92, 104, 116, 128, 140, 152, 164, 176, 188, 200, 212, 224, 236, 248],Node dc1-voldemort25:6666 [id 9] in zone 0 partitionList:[9, 21, 33, 45, 57, 69, 81, 93, 105, 117, 129, 141, 153, 165, 177, 189, 201, 213, 225, 237, 249],Node dc1-voldemort26:6666 [id 10] in zone 0 partitionList:[10, 22, 34, 46, 58, 70, 82, 94, 106, 118, 130, 142, 154, 166, 178, 190, 202, 214, 226, 238, 250],Node dc1-voldemort27:6666 [id 11] in zone 0 partitionList:[11, 23, 35, 47, 59, 71, 83, 95, 107, 119, 131, 143, 155, 167, 179, 191, 203, 215, 227, 239, 251] failed
at voldemort.store.readonly.swapper.AdminStoreSwapper.invokeSwap(AdminStoreSwapper.java:318)
at voldemort.store.readonly.swapper.AdminStoreSwapper.fetchAndSwapStoreData(AdminStoreSwapper.java:124)
at voldemort.store.readonly.mr.azkaban.VoldemortSwapJob.run(VoldemortSwapJob.java:159)
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob.runPushStore(VoldemortBuildAndPushJob.java:837)
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob$StorePushTask.call(VoldemortBuildAndPushJob.java:556)
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob$StorePushTask.call(VoldemortBuildAndPushJob.java:539)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2017-01-30 11:54:12 PM main shell-job INFO: Closing AdminClient with BootStrapUrls: [tcp://dc1-voldemort06:6666]
2017-01-30 11:54:12 PM main VoldemortBuildAndPushJobRunner ERROR: Exception while running BnP job!
voldemort.VoldemortException: An exception occurred during Build and Push !!
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob.run(VoldemortBuildAndPushJob.java:685)
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJobRunner.main(VoldemortBuildAndPushJobRunner.java:34)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: voldemort.VoldemortException: Got exceptions during Build and Push
at voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob.run(VoldemortBuildAndPushJob.java:681)
... 7 more


(I removed all node specific logs besides dc1-voldemort06 and dc1-voldemort25, because they are basically the same for all nodes.) This might be just due to the different versions of the CDH libs, but it still keeps me wondering if anyone is using hdfs for fetches at all? Maybe that code path is not really tested and not working anymore.

Thanks,

David

Arunachalam

unread,
Jan 30, 2017, 7:43:09 PM1/30/17
to project-...@googlegroups.com
About the error, Server side logs generally contain more information. It seems like Server is silently skipping the download and erroring out. But not communicating the error the BnP side. Some error with the download or wrong configuration ?

About the fetch buffer size, it is used for webhdfs reader. After the webhdfs file is opened, this buffer is passed in to read the content. I don't know how the webhdfs file system is implemented to read the content.  This code is not easier to trace, as the fetch handler is created using reflection to avoid the compile time dependency.



Thanks,
Arun.

David Ongaro

unread,
Jan 30, 2017, 11:58:25 PM1/30/17
to project-...@googlegroups.com
On Jan 30, 2017, at 4:43 PM, Arunachalam <arunac...@gmail.com> wrote:

About the error, Server side logs generally contain more information. It seems like Server is silently skipping the download and erroring out. But not communicating the error the BnP side. Some error with the download or wrong configuration ?

Ok, here is an extract of the server side logs:

Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,446 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Started streaming slop pusher job at Mon Jan 30 11:56:02 UTC 2017 [voldemort-scheduler-s
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,449 voldemort.client.AbstractStoreClientFactory] INFO Client zone-id [-1] Attempting to get raw store [voldsys$_metadata_version_persistence]  [voldemort
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,455 voldemort.client.AbstractStoreClientFactory] INFO Client zone-id [-1] Attempting to get raw store [voldsys$_store_quotas]  [voldemort-scheduler-servi
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Acquiring lock to perform streaming slop pusher job  [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Acquired lock to perform streaming slop pusher job  [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 0 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 1 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 2 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 3 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 4 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 5 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 6 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 7 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 8 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 9 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 10 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 11 - Succeeded - 0 - Attempted - 0 [voldemort-scheduler-service1-t2]
Jan 30 11:56:02 dc1-voldemort07 voldemort-server.sh[5815]: [11:56:02,457 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Completed streaming slop pusher job which started at Mon Jan 30 11:56:02 UTC 2017 [volde

So no exception and just immediate SUCCESS after after nothing is fetched. I don’t know what should be wrong with the configuration. It’s working fine when just changing “hdfs” to “webhdfs” and using the default port for webhdfs. If I get the chance I will try to recompile and deploy Voldemort with the CDH 5.4.5 libs to see if that is working better, but if can say it won’t work anyway I can save the time.

For now I’m happy that the failover with webhdfs is working, so wether or not hdfs is better is a minor issue.

About the fetch buffer size, it is used for webhdfs reader. After the webhdfs file is opened, this buffer is passed in to read the content. I don't know how the webhdfs file system is implemented to read the content.  This code is not easier to trace, as the fetch handler is created using reflection to avoid the compile time dependency.



Thanks for the pointers, but that still keeps me wondering if using 64KB buffer size is fine or using a bigger size is recommended.

Thanks,

David

Arunachalam

unread,
Jan 31, 2017, 1:41:42 AM1/31/17
to project-...@googlegroups.com
Streaming slop pusher is a different job. Voldemort read write servers rely on this for hinted handoff.

 Look for hdfs fetcher or anything with fetcher. Also look for any errors during the fetch time.

About the buffer 64kb should be fine. Changing the value should not impact the performance unless you lower it below 8 Kb in my opinion. 

Félix GV

unread,
Jan 31, 2017, 10:12:37 AM1/31/17
to project-...@googlegroups.com
It is quite possible that the (non-web) hdfs fetcher is broken. I remember that there was a conflict between Voldemort's version of protobuf and Hadoop's version. We have been running Voldemort in WebHDFS mode at LinkedIn for many years now, so the native hdfs client integration has not been thoroughly tested for a while.

It would be ideal to restore compatibility with the native hdfs client, but if you want it to "just work", I would recommend WebHDFS, especially if you are also planning on using a different Hadoop / CDH version than the one we ship with.

-F
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.

David Ongaro

unread,
Jan 31, 2017, 10:10:29 PM1/31/17
to project-...@googlegroups.com
Yes, that’s why we use a specially patched version on hadoop side with an updated protobuf lib: https://github.com/bitti/voldemort/commit/e04d5eadbb082e2f785aedbb4574842f2e28aa8b, even though this is probably unnecessary since these dependencies are shadowed since 1.9.18 or so. Anyway since it’s shadowed it also shouldn’t hurt to update this dependency, so I would be happy to provide my patch as PR if you want. (In fact we also have a patch for updated Avro libs.)

I fail to see though how this should affect the server side. Even if protobuf data is (?) deserialized during fetching it shouldn’t matter since the binary form of protobuf is backwards and forwards compatible. Anyway we’re fine with webhdfs, I was just wondering if someone is actually using hdfs.

Thanks,

David

Felix GV

unread,
Feb 1, 2017, 2:23:02 PM2/1/17
to project-...@googlegroups.com
Hmm... That's interesting...

The shadowing build action introduced a few versions ago was intended to make it easier to deploy the BnP job, although it might in fact contain everything the server needs to run as well. So perhaps the shadowing alone does solve the Protobuf API incompatibility between Voldemort/Hadoop.

--
Felix GV
Staff Software Engineer
Data Infrastructure
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv



Thanks,

David


To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

David Ongaro

unread,
Feb 1, 2017, 4:15:13 PM2/1/17
to project-...@googlegroups.com
Again, if you’re talking about the Voldemort server, there shouldn't be any API conflicts to Hadoop Jobs. In fact the whole point of a serialization format is to avoid such conflicts between remote systems. If you’re talking about voldemort-contrib: sure there can be issues for parts wich are running in the Hadoop environment.

In fact I just realized that the reason we had to patch our Voldemort with updated Protobuf and Avro dependencies is that we just build and deployed the unshadowed contribJar, instead of the shadowed bnpJar (even though you include other unshadowed dependencies there like guava, which can yield to conflicts in Hadoop environment). That seems to be a little confusing. Can’t the contribJar also be shadowed to avoid such gotchas?

Thanks,

David


To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages