Running Ray under Gramine-sgx or gramine-direct: Bootstrap cluster fails ...

16 views
Skip to first unread message

Aditya Gurajada

unread,
Dec 12, 2023, 3:10:08 AM12/12/23
to Gramine Users
Hello, folks --

This post is about using Ray under gramine-sgx / gramine-direct.

I'm on gramine-SGX-V1:

Gramine was built from commit: 4212a2525efffecbc787419ccf349299957b679f

I had previously written to Gramine Discussions page (see discussion 1664). Thanks to the help received from Kailun-Qin and Dmitrii, I am now past the basic config / template / memory resources issues. I am expanding the post to this channel in the hope that there may be someone else out there who has successfully started a Ray Cluster under gramine-sgx.

The problem is: 'ray start' does seem to go through successfully [1] but soon immediately thereafter, the bootstrap process fails with this cryptic message:

---
[P1:T1:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
[P1:T1:python3.8] do_process_exit() -> do_thread_exit(): process 1 exited with status 0
vsgx-vm:[43] $ [P12:T187:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
[P13:T189:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
---

The brief function names in above output is from debug instrumentation I added to figure out what's going on.

The Ray cluster never seems to come up successfully under gramine-sgx or gramine-direct. In [2] below I have shown snippets of the Ray dashboard_agent.log file showing more diagnostic messages leading to the failure. These messages simply give more info about the retry attempts, and show that eventually the node crashes.

The thing relevant to Gramine in that log is this brief message: FileNotFoundError: [Errno 2] No such file or directory: '/proc/net/dev'

Qs to Gramine-devs: Do you know if /proc/net/dev is supported under gramine-sgx or gramine-dev? I did go thru the online docs, but cant' recall if this dev is supported.

On the same SGX-enabled box, I am able to bring up Ray cluster (ray start, ray status) directly without using gramine-sgx.

Question is: has anyone in this group tried this exercise of integrating Ray under Gramine-sgx or gramine-direct?

Question to Gramine-sgx devs: Would it be possible for someone in your dev-/QA-team to try this integration out? And let me know whether you are able to get 'ray start' to work under gramine-sgx? it should be a fairly simple install of Ray s/w to get this working on some Linux box.

---
Digging further on the Ray-side, I found these two threads. Some of this may be useful to Gramine-devs to help triage / troubleshoot the problems I am seeing. The signature of the problem I am seeing is exactly the same as the issues reported here:

Ray Issue-29412: [Ray Core] Ray agent getting killed unexpectedly

Which lead to a tentative code-fix in ray Python libraries,
Ray PR-29540: [Agent] Make agent shutdown more informative and graceful

The point of these two threads is that: Seems like there might have been some issue with Python library, psutil.Process.parent() mis-reporting that parent node is down, causing some cascading shutdowns on the Ray-side.

Question is: Can Gramine-devs speculate if such issues with node patrolling on Gramine-side, induced by some Python library hiccups could lead to 'ray start' totally aborting?

--
Qs. to Gramine/CI owners: I was hoping someone would have tested out this integration on your end and put-up a nice tutorial on this page (Several other interesting integrations have been tried out.)

Given that Ray / ML/ Python workloads are becoming so very popular, I would have thought it would get some push from Gramine-CI/QA folks to try out this integration. And give us a helpful tutorial on how-to get this to work.

Thanks in advance, and thanks for reading this far. Any help / tips will be most graciously accepted.

--AdityA>

[1] Messages seen booting up 'ray start' from under gramine-direct:

[P16:T257:python3.8] libos_init():511: Process ID=16: LibOS initialized
[P16:T257:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
[P16:T257:python3.8] do_process_exit() -> do_thread_exit(): process 16 exited with status 255
2023-12-12 02:09:15,068 SUCC scripts.py:781 -- --------------------
2023-12-12 02:09:15,069 SUCC scripts.py:782 -- Ray runtime started.
2023-12-12 02:09:15,069 SUCC scripts.py:783 -- --------------------
2023-12-12 02:09:15,069 INFO scripts.py:785 -- Next steps
2023-12-12 02:09:15,069 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2023-12-12 02:09:15,070 INFO scripts.py:791 --   ray start --address='10.208.196.155:6379'
2023-12-12 02:09:15,070 INFO scripts.py:800 -- To connect to this Ray cluster:
2023-12-12 02:09:15,071 INFO scripts.py:802 -- import ray
2023-12-12 02:09:15,071 INFO scripts.py:803 -- ray.init()
2023-12-12 02:09:15,071 INFO scripts.py:834 -- To terminate the Ray runtime, run
2023-12-12 02:09:15,071 INFO scripts.py:835 --   ray stop
2023-12-12 02:09:15,071 INFO scripts.py:838 -- To view the status of the cluster, use
2023-12-12 02:09:15,071 INFO scripts.py:839 --   ray status


The above lines indicate that 'ray start' did go through cleanly ... albeit very briefly.

[2] Snippets of messages from Ray's dashboard_agent.log showing this issue with Python psutil.Process.parent() not being able to 'locate' a parent node.


---
 43 2023-12-11 22:22:20,007»INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2023-12-11_22-22-10_412945_1/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_c    onfig._CorsConfigImpl object at 0x2d4742bbf670>>
 44 2023-12-11 22:22:20,007»INFO http_server_agent.py:79 -- Registered 30 routes.
 45 2023-12-11 22:22:20,012»INFO process_watcher.py:44 -- raylet pid is 15
 46 2023-12-11 22:22:20,012»WARNING process_watcher.py:89 -- Raylet is considered dead 1 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
 47 2023-12-11 22:22:20,016»INFO event_agent.py:56 -- Report events to 10.208.196.155:45899
 48 2023-12-11 22:22:20,017»INFO event_utils.py:132 -- Monitor events logs modified after 1702331539.842946 on /tmp/ray/session_2023-12-11_22-22-10_412945_1/logs/events, the source types are all.
 49 2023-12-11 22:22:20,019»ERROR reporter_agent.py:1149 -- Error publishing node physical stats.
 50 Traceback (most recent call last):
 51   File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1132, in _perform_iteration
 52     stats = self._get_all_stats()
 53   File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 630, in _get_all_stats
 54     network_stats = self._get_network_stats()
 55   File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 434, in _get_network_stats
 56     v for k, v in psutil.net_io_counters(pernic=True).items() if k[0] == "e"
 57   File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/__init__.py", line 2122, in net_io_counters
 58     rawdict = _psplatform.net_io_counters()
 59   File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1023, in net_io_counters
 60     with open_text("%s/net/dev" % get_procfs_path()) as f:
 61   File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/_common.py", line 786, in open_text
 62     fobj = open(fname, buffering=FILE_READ_BUFFER_SIZE,
 63 FileNotFoundError: [Errno 2] No such file or directory: '/proc/net/dev'
 64 2023-12-11 22:22:20,414»WARNING process_watcher.py:89 -- Raylet is considered dead 2 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
 65 2023-12-11 22:22:20,816»WARNING process_watcher.py:89 -- Raylet is considered dead 3 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
 66 2023-12-11 22:22:21,217»WARNING process_watcher.py:89 -- Raylet is considered dead 4 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
 67 2023-12-11 22:22:21,618»WARNING process_watcher.py:89 -- Raylet is considered dead 5 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
 68 2023-12-11 22:22:21,618»INFO agent.py:227 -- Terminated Raylet: ip=10.208.196.155, node_id=5ea8e24111ca76d5135365a6bea0e7da046378d411339af896a80d4f.·
 69 2023-12-11 22:22:21,619»ERROR process_watcher.py:142 -- Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Oth    er termination signals. Last 20 lines of the Raylet logs:
 70     [state-dump] Event stats:
 71     [state-dump] »······PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), CPU time: mean = 599.819 us, total = 6.598 ms
 72     [state-dump] »······NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
 73     [state-dump] »······NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
 74     [state-dump] »······NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
 75     [state-dump] »······NodeManager.GCTaskFailureReason - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
---

Dmitrii Kuvaiskii

unread,
Dec 12, 2023, 2:46:10 PM12/12/23
to Aditya Gurajada, Gramine Users
Dear Aditya,

Could you copy-paste this email to the Gramine Discussions
(https://github.com/gramineproject/gramine/discussions)? Otherwise
it's very hard to follow this.

One quick answer: Gramine does *not* emulate `/proc/net/dev`. So your
Ray app won't detect this file. For the list of emulated files, see
the list here: https://gramine.readthedocs.io/en/stable/devel/features.html#list-of-pseudo-files
> --
> You received this message because you are subscribed to the Google Groups "Gramine Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gramine-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gramine-users/1f653715-b8eb-44ae-95e8-6c4fbb212e72n%40googlegroups.com.



--
Yours sincerely,
Dmitrii Kuvaiskii

Aditya Gurajada

unread,
Dec 12, 2023, 3:05:10 PM12/12/23
to Gramine Users
Dear Dmitrii, 

Your request was timely. Overnight, I myself was thinking of opening a new discussion topic to air out these issues.

I have re-posted above thread to this new Gramine discussions topic no. 1680.

Looking forward to any tips / guidance you or other Gramine-devs can provide to help me get unblocked.

Thanks!
--AdityA>
Reply all
Reply to author
Forward
0 new messages