Hello, folks --
This post is about using
Ray under gramine-sgx / gramine-direct.
I'm on gramine-SGX-V1:
Gramine was built from commit: 4212a2525efffecbc787419ccf349299957b679f
I had previously written to Gramine Discussions page (see
discussion 1664). Thanks to the help received from
Kailun-Qin and
Dmitrii, I am now past the basic config / template / memory resources issues. I am expanding the post to this channel in the hope that there may be someone else out there who has successfully started a Ray Cluster under gramine-sgx.
The problem is: 'ray start' does seem to go through successfully [1] but soon immediately thereafter, the bootstrap process fails with this cryptic message:
---
[P1:T1:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
[P1:T1:python3.8] do_process_exit() -> do_thread_exit(): process 1 exited with status 0
vsgx-vm:[43] $ [P12:T187:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
[P13:T189:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
---
The brief function names in above output is from debug instrumentation I added to figure out what's going on.
The Ray cluster
never seems to come up successfully under gramine-sgx or gramine-direct. In [2] below I have shown snippets of the Ray dashboard_agent.log file showing more diagnostic messages leading to the failure. These messages simply give more info about the retry attempts, and show that eventually the node crashes.
The thing relevant to Gramine in that log is this brief message:
FileNotFoundError: [Errno 2] No such file or directory: '/proc/net/dev'Qs to Gramine-devs: Do you know if /proc/net/dev is supported under gramine-sgx or gramine-dev? I did go thru the online docs, but cant' recall if this dev is supported.
On the same SGX-enabled box, I
am able to bring up Ray cluster (ray start, ray status) directly
without using gramine-sgx.
Question is: has anyone in this group tried this exercise of integrating Ray under Gramine-sgx or gramine-direct?
Question to Gramine-sgx devs: Would it be possible for someone in your dev-/QA-team to try this integration out? And let me know whether you are able to get 'ray start' to work under gramine-sgx? it should be a fairly simple install of Ray s/w to get this working on some Linux box.
---
Digging further on the Ray-side, I found these two threads. Some of this may be useful to Gramine-devs to help triage / troubleshoot the problems I am seeing. The signature of the problem I am seeing is exactly the same as the issues reported here:
Ray Issue-29412: [Ray Core] Ray agent getting killed unexpectedly
Which lead to a tentative code-fix in ray Python libraries,
Ray PR-29540: [Agent] Make agent shutdown more informative and graceful
The point of these two threads is that: Seems like there might have been some issue with Python library, psutil.Process.parent() mis-reporting that parent node is down, causing some cascading shutdowns on the Ray-side.
Question is: Can Gramine-devs speculate if such issues with node patrolling on Gramine-side, induced by some Python library hiccups could lead to 'ray start' totally aborting?
--
Qs. to Gramine/CI owners: I was hoping someone would have tested out this integration on your end and put-up a nice
tutorial on this page (Several other interesting integrations have been tried out.)
Given that Ray / ML/ Python workloads are becoming so very popular, I would have thought it would get some push from Gramine-CI/QA folks to try out this integration. And give us a helpful tutorial on how-to get this to work.
Thanks in advance, and thanks for reading this far. Any help / tips will be most graciously accepted.
--AdityA>
[1]
Messages seen booting up 'ray start' from under gramine-direct:
[P16:T257:python3.8] libos_init():511: Process ID=16: LibOS initialized
[P16:T257:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
[P16:T257:python3.8] do_process_exit() -> do_thread_exit(): process 16 exited with status 255
2023-12-12 02:09:15,068 SUCC scripts.py:781 -- --------------------
2023-12-12 02:09:15,069 SUCC scripts.py:782 -- Ray runtime started.
2023-12-12 02:09:15,069 SUCC scripts.py:783 -- --------------------
2023-12-12 02:09:15,069 INFO scripts.py:785 -- Next steps
2023-12-12 02:09:15,069 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2023-12-12 02:09:15,070 INFO scripts.py:791 -- ray start --address='10.208.196.155:6379'
2023-12-12 02:09:15,070 INFO scripts.py:800 -- To connect to this Ray cluster:
2023-12-12 02:09:15,071 INFO scripts.py:802 -- import ray
2023-12-12 02:09:15,071 INFO scripts.py:803 -- ray.init()
2023-12-12 02:09:15,071 INFO scripts.py:834 -- To terminate the Ray runtime, run
2023-12-12 02:09:15,071 INFO scripts.py:835 -- ray stop
2023-12-12 02:09:15,071 INFO scripts.py:838 -- To view the status of the cluster, use
2023-12-12 02:09:15,071 INFO scripts.py:839 -- ray statusThe above lines indicate that 'ray start' did go through cleanly ... albeit very briefly.
[2] Snippets of messages from Ray's dashboard_agent.log showing this issue with Python psutil.Process.parent() not being able to 'locate' a parent node.
---
43 2023-12-11 22:22:20,007»INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <StaticResource /logs -> PosixPath('/tmp/ray/session_2023-12-11_22-22-10_412945_1/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_c onfig._CorsConfigImpl object at 0x2d4742bbf670>>
44 2023-12-11 22:22:20,007»INFO http_server_agent.py:79 -- Registered 30 routes.
45 2023-12-11 22:22:20,012»INFO process_watcher.py:44 -- raylet pid is 15
46 2023-12-11 22:22:20,012»WARNING process_watcher.py:89 -- Raylet is considered dead 1 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
47 2023-12-11 22:22:20,016»INFO event_agent.py:56 -- Report events to 10.208.196.155:45899
48 2023-12-11 22:22:20,017»INFO event_utils.py:132 -- Monitor events logs modified after 1702331539.842946 on /tmp/ray/session_2023-12-11_22-22-10_412945_1/logs/events, the source types are all.
49 2023-12-11 22:22:20,019»ERROR reporter_agent.py:1149 -- Error publishing node physical stats.
50 Traceback (most recent call last):
51 File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1132, in _perform_iteration
52 stats = self._get_all_stats()
53 File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 630, in _get_all_stats
54 network_stats = self._get_network_stats()
55 File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 434, in _get_network_stats
56 v for k, v in psutil.net_io_counters(pernic=True).items() if k[0] == "e"
57 File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/__init__.py", line 2122, in net_io_counters
58 rawdict = _psplatform.net_io_counters()
59 File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1023, in net_io_counters
60 with open_text("%s/net/dev" % get_procfs_path()) as f:
61 File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/_common.py", line 786, in open_text
62 fobj = open(fname, buffering=FILE_READ_BUFFER_SIZE,
63 FileNotFoundError: [Errno 2] No such file or directory: '/proc/net/dev'
64 2023-12-11 22:22:20,414»WARNING process_watcher.py:89 -- Raylet is considered dead 2 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
65 2023-12-11 22:22:20,816»WARNING process_watcher.py:89 -- Raylet is considered dead 3 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
66 2023-12-11 22:22:21,217»WARNING process_watcher.py:89 -- Raylet is considered dead 4 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
67 2023-12-11 22:22:21,618»WARNING process_watcher.py:89 -- Raylet is considered dead 5 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
68 2023-12-11 22:22:21,618»INFO agent.py:227 -- Terminated Raylet: ip=10.208.196.155, node_id=5ea8e24111ca76d5135365a6bea0e7da046378d411339af896a80d4f.·
69 2023-12-11 22:22:21,619»ERROR process_watcher.py:142 -- Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Oth er termination signals. Last 20 lines of the Raylet logs:
70 [state-dump] Event stats:
71 [state-dump] »······PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), CPU time: mean = 599.819 us, total = 6.598 ms
72 [state-dump] »······NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
73 [state-dump] »······NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
74 [state-dump] »······NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
75 [state-dump] »······NodeManager.GCTaskFailureReason - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s ---