"Unable to start RBE reproxy" when build android13 with error log "Unavailable desc"

658 views
Skip to first unread message

Faqiang Zhu

unread,
Feb 21, 2023, 11:45:06 PM2/21/23
to Android Building
I'm trying to build android13 with RBE.

As suggested in this post: Build AOSP 11 with Google RBE service, I am trying an alternative option of BuildGrid listed here - https://bazel.build/community/remote-execution-services.

I setup the BuildGrid server based on the document, with bazel as the client to build C++ tutorial examples, the build action can be distributed from a machine to the GuildGrid Server, then I tried build android 13 with RBE and this BuildGrid server with below steps:
  • modify the file "build/soong/docs/rbe.json" as below:
    diff --git a/docs/rbe.json b/docs/rbe.json
    index f6ff10772..3f4c4ccf3 100644
    --- a/docs/rbe.json
    +++ b/docs/rbe.json
    @@ -10,8 +10,8 @@
             "RBE_R8": "1",
             "RBE_D8": "1",

    -        "RBE_instance": "[replace with your RBE instance]",
    -        "RBE_service": "[replace with your RBE service endpoint]",
    +        "RBE_instance": "main",
    +        "RBE_service": "grpc://10.193.102.33:50051",

             "RBE_DIR": "prebuilts/remoteexecution-client/live",


  • create a credential file of "$HOME/.config/gcloud/application_default_credentials.json" with below command:
    gcloud auth application-default login --no-launch-browser --disable-quota-project

  • try to start the build with below commands:
    ANDROID_BUILD_ENVIRONMENT_CONFIG=rbe ANDROID_BUILD_ENVIRONMENT_CONFIG_DIR=build/soong/docs make


but I got below failure and seems no related source code can be found:

    18:58:52 Unable to start RBE reproxy
    FAILED: RBE bootstrap failed with: exit status 10
    E0221 18:58:52.597734 1344945 bootstrap.go:96] Unable to start reproxy: "E0221 18:58:50.166111 1344959 main.go:205] Failed to initialize remote-execution client: rpc error: code = Unavailable desc = rpc error: code = Unavailable desc = retry budget exhausted (6 attempts): all SubConns are in TransientFailure, authentication type (identity) used=\"application default credentials\"\n"

    Try restarting the build after running the following command:
        gcloud auth application-default login --no-launch-browser --disable-quota-project



Dose anyone tried the alternative RE service options listed in https://bazel.build/community/remote-execution-services
what RE service is choosed? 
Is there similar or the same issue encountered as me? 
Are there any fixes for the issue I encountered?

Best Regards,
Zhu Faqiang.

李力召

unread,
Feb 22, 2023, 1:28:59 PM2/22/23
to Android Building
hi Faqiang 

Reproxy is call the rbe service by https with credential .  You can disable it by enviroment  "export RBE_service_no_security=true"

Faqiang Zhu

unread,
Mar 3, 2023, 2:08:04 PM3/3/23
to Android Building
Hi  力召,

Thank you, now the reproxy can be started.

then I have another issue:
    The client distribute actions to the service, the service schedules the actions to the workers, the workers does the actions.
    In android build system, there is a limitation of using the host installed tools, many tools under "prebuilts/" directory like clang++ is used, how can a worker get the environment as the local build?

    I set "RBE_CXX_EXEC_STRATEGY" to be "remote" then try to build a test module with RBE, the log shows that it blocks on "clang++ test_source_file.cpp", while on the service end, it can be known that there is input requests,  but the worker seems does nothing.
    I guesss it's related to the clang++ tool, although I installed clang++ on the worker machine. but android should use its own.

Best Regards,
Zhu Faqiang.

李力召

unread,
Apr 4, 2023, 10:23:27 AM4/4/23
to Android Building
Sorry for a late reply.

  The client distribute actions to the service, the service schedules the actions to the workers, the workers does the actions.
    In android build system, there is a limitation of using the host installed tools, many tools under "prebuilts/" directory like clang++ is used, how can a worker get the environment as the local build?

No,  the reproxy in the aosp will send all the depends to the CAS as normal input. So the worker can be very light weight.

    I set "RBE_CXX_EXEC_STRATEGY" to be "remote" then try to build a test module with RBE, the log shows that it blocks on "clang++ test_source_file.cpp", while on the service end, it can be known that there is input requests,  but the worker seems does nothing.
    I guesss it's related to the clang++ tool, although I installed clang++ on the worker machine. but android should use its own.

Maybe the worker is not completely implement the  remote api . 

Faqiang Zhu

unread,
Apr 6, 2023, 10:26:16 AM4/6/23
to Android Building
Oh, thank you, 力召.

last time I didn't wait for long enouth when the log blocks. also there are some issues with the serivce end, which made me misunderstand.

Now there could be failure logs like below:


    reclient[1a42b3d9-0a8c-4c83-860f-7fa398daf641]: RemoteErrorResultStatus: failed to upload /home/faqiang/android13/prebuilts/clang/host/linux-x86/clang-r450784d/bin/ld64.lld: retry budget exhausted (6 attempts): context deadline exceeded


I can see that the cas storage size keeps rising when dependencies are bing transfrred from client end to server end, and then preceding logs occur.

what does this mean usually? Does it mean the dependencies fail to be sent to the server end in time (i.e. network speed issue)? can I change the 6 attempts to a higher number to relieve this and how can I do that?

Best Regards,
Zhu Faqiang.

Faqiang Zhu

unread,
Apr 7, 2023, 12:52:04 AM4/7/23
to Android Building
Hi  力召.

it may really related to the network speed, but I'm not sure about it.

I switched to two high performance machines, one works as client end, and one works as server end. this time, the cas storage size grows faster. although it still fails at last, but this time, the fail log is from the server end.

Best Regards,
Zhu Faqiang.

Faqiang Zhu

unread,
Apr 7, 2023, 9:45:19 AM4/7/23
to Android Building
Hi   力召.

Used to high performance machines, changed some configs on the service end, and modified the code under "build/make", I can now build a test module with a cpp file on the remote server.

For the code I modified under "build/make", it is as below. I don't understand what's the purpose of the "container-image=docker" platform property. What is it used for?

    diff --git a/core/rbe.mk b/core/rbe.mk
    index fd3427abf4..2baff7302c 100644
    --- a/core/rbe.mk
    +++ b/core/rbe.mk
    @@ -64,7 +64,7 @@ ifneq ($(filter-out false,$(USE_RBE)),)
         d8_exec_strategy := remote_local_fallback
       endif

    -  platform := container-image=docker://gcr.io/androidbuild-re-dockerimage/android-build-remoteexec-image@sha256:582efb38f0c229ea39952fff9e132ccbe183e14869b39888010dacf56b360d62
    +  platform :=
       cxx_platform := $(platform),Pool=$(cxx_pool)
       java_r8_d8_platform := $(platform),Pool=$(java_pool)



it seems that two platforms properties are set in requests:
1. Pool=default

When start the worker on the server end, I can use "--platform Pool=default", but with "--platform Pool=default --platform container-image=docker://gcr.io/androidbuild-re-dockerimage/android-build-remoteexec-image@sha256:582efb38f0c229ea39952fff9e132ccbe183e14869b39888010dacf56b360d62", something went wrong. Also I don't know the purpose of this property to be set in my worker.


Best Regards,
Zhu Faqiang.

李力召

unread,
Apr 13, 2023, 10:44:10 AM4/13/23
to Android Building

    reclient[1a42b3d9-0a8c-4c83-860f-7fa398daf641]: RemoteErrorResultStatus: failed to upload /home/faqiang/android13/prebuilts/clang/host/linux-x86/clang-r450784d/bin/ld64.lld: retry budget exhausted (6 attempts): context deadline exceeded


I can see that the cas storage size keeps rising when dependencies are bing transfrred from client end to server end, and then preceding logs occur.

what does this mean usually? Does it mean the dependencies fail to be sent to the server end in time (i.e. network speed issue)? can I change the 6 attempts to a higher number to relieve this and how can I do that?

The CAS need a high performance storage. Maybe ld64.lld is too large for the http, or timeout.  Because the reproxy upload files in very high concurrent connections. 


For the code I modified under "build/make", it is as below. I don't understand what's the purpose of the "container-image=docker" platform property. What is it used for?

This is proto-specific. It means "I want this command is executed in a docker container".  If you don't care about it, you can ignore it.  
Reply all
Reply to author
Forward
0 new messages