Can we maintain a per P epoll fd to make netpoll scalable?

Cholerae Hu

unread,

May 9, 2020, 11:32:13 AM5/9/20

to golang-nuts

I'm maintaining a highly-loaded proxy-like service, which serves huge amount of small rpc requests every day. Yesterday I profiled it, and found that runtime.netpoll took 8.5% cpu(runtime.mcall took 20% cpu).

There is only one global epoll fd in runtime, but every P will call netpoll. Inside kernel, a fd list, a rbtree and a lock will be associated to one epoll fd, so concurrent netpoll calls from many Ps may result in lock contention and low cache locality I guess.

Can we do the same optimization of timer to netpoller, to make epoll fd per P, let each P polls on its own epoll fd first and steals ready fds from other Ps if it has no work to do?

Ian Lance Taylor

unread,

May 9, 2020, 6:54:30 PM5/9/20

to Cholerae Hu, golang-nuts

If epoll contention really is a problem, then I think it would be
simpler to avoid contention in the runtime package by calling netpoll
less often. While we could theoretically have a different epoll FD
per P, I think the stealing requirements would be painful to
implement.

In any case the first step is to prove whether kernel contention on
the epoll descriptor is a problem.

Ian

Cholerae Hu

unread,

May 11, 2020, 2:40:24 AM5/11/20

to golang-nuts

Thanks for responding. I will dig deeper about kernel contention later.

在 2020年5月10日星期日 UTC+8上午6:54:30，Ian Lance Taylor写道：

Cholerae Hu

unread,

May 12, 2020, 2:25:38 AM5/12/20

to golang-nuts

Hi Ian. I replaced linux kernel of my online service, to enable CONFIG_LOCK_STAT, and found that there are really many epoll kernel contention, compared to other kernel lock. Lock stat screenshot is:

截屏2020-05-12 14.22.27.png

在 2020年5月10日星期日 UTC+8上午6:54:30，Ian Lance Taylor写道：

On Sat, May 9, 2020 at 8:32 AM Cholerae Hu <chole...@gmail.com> wrote:

Cholerae Hu

unread,

May 12, 2020, 2:51:17 AM5/12/20

to golang-nuts

I think there are some ways to reduce this lock contention.

1. simply calling netpoll less often.

2. make netpoller per p.

3. add a group of netpoller M, each has its own netpoller. They are specialized to do netpolling, and will run readied g directly to archive high cache locality, until the g being preempted. This design is used in brpc(https://github.com/apache/incubator-brpc), which is a c++ rpc framework using M:N user space thread.

Cholerae Hu <chole...@gmail.com> 于2020年5月12日周二下午2:26写道：

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/mXrkXxNVZmE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/a51e80d3-e080-4e2d-b523-682d1c4ad781%40googlegroups.com.

Ian Lance Taylor

unread,

May 12, 2020, 3:18:28 PM5/12/20

to Cholerae Hu, golang-nuts

On Mon, May 11, 2020 at 11:51 PM Cholerae Hu <chole...@gmail.com> wrote:

I think there are some ways to reduce this lock contention.
1. simply calling netpoll less often.
2. make netpoller per p.
3. add a group of netpoller M, each has its own netpoller. They are specialized to do netpolling, and will run readied g directly to archive high cache locality, until the g being preempted. This design is used in brpc(https://github.com/apache/incubator-brpc), which is a c++ rpc framework using M:N user space thread.

Let's definitely start with the simple approach, which is option 1.

We'll need realistic benchmarks to see if this actually makes any difference in practice.

Ian

Cholerae Hu <chole...@gmail.com> 于2020年5月12日周二下午2:26写道：
Hi Ian. I replaced linux kernel of my online service, to enable CONFIG_LOCK_STAT, and found that there are really many epoll kernel contention, compared to other kernel lock. Lock stat screenshot is:

在 2020年5月10日星期日 UTC+8上午6:54:30，Ian Lance Taylor写道：
On Sat, May 9, 2020 at 8:32 AM Cholerae Hu <chole...@gmail.com> wrote:
>
> I'm maintaining a highly-loaded proxy-like service, which serves huge amount of small rpc requests every day. Yesterday I profiled it, and found that runtime.netpoll took 8.5% cpu(runtime.mcall took 20% cpu).
>
> There is only one global epoll fd in runtime, but every P will call netpoll. Inside kernel, a fd list, a rbtree and a lock will be associated to one epoll fd, so concurrent netpoll calls from many Ps may result in lock contention and low cache locality I guess.
>
> Can we do the same optimization of timer to netpoller, to make epoll fd per P, let each P polls on its own epoll fd first and steals ready fds from other Ps if it has no work to do?

If epoll contention really is a problem, then I think it would be
simpler to avoid contention in the runtime package by calling netpoll
less often. While we could theoretically have a different epoll FD
per P, I think the stealing requirements would be painful to
implement.

In any case the first step is to prove whether kernel contention on
the epoll descriptor is a problem.

Ian

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/mXrkXxNVZmE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/a51e80d3-e080-4e2d-b523-682d1c4ad781%40googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAM%3DYXF_bPpwW2AR1Xw%2BQJp3Yhji6Ag1oZ%2B%3DWUe3O-D9sWW-ZDw%40mail.gmail.com.

Reply all

Reply to author

Forward