Increasing shared memory

Michał Daniluk

unread,

Apr 28, 2021, 7:57:57 AM4/28/21

to recsys-challenge2021

My last submission failed because it was insufficient size of shared memory. This problem occurs when I increase num_workers parameter from 0 to 10 in pytorch Dataloader.

The problem is explained here: https://github.com/aws/sagemaker-python-sdk/issues/937

Solution:

Increasing shared memory, because by default is 64M. The solution is simply passing --shm-size 1024M parameter to docker run. Is it possible to increase it?

Username: _mdaniluk

Method: md_4

Logs:

Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

Best,

Michal

Alykhan Tejani

unread,

Apr 28, 2021, 8:04:04 AM4/28/21

to Michał Daniluk, recsys-challenge2021

I will increase shared memory size to docker run and redeploy, will let you know once this is done and you can test again

--
You received this message because you are subscribed to the Google Groups "recsys-challenge2021" group.
To unsubscribe from this group and stop receiving emails from it, send an email to recsys-challenge...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/recsys-challenge2021/77f92ac1-9958-4f02-95f9-8b79d65d61e8n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Alykhan Tejani | Staff Research Engineer | Twitter, Inc.
20 Air Street, Soho, London W1B 5DL, UK
ate...@twitter.com @alykhantejani

Bo Liu

unread,

May 5, 2021, 3:52:13 PM5/5/21

to recsys-challenge2021

Hi Alykhan,

Have you made this change? What's the new --shm-size parameter?

Thanks,

Bo

Alykhan Tejani

unread,

May 6, 2021, 3:48:47 AM5/6/21

to Bo Liu, recsys-challenge2021

Hi,

The parameter is 1024M

To view this discussion on the web visit https://groups.google.com/d/msgid/recsys-challenge2021/3fb1a86a-6d6c-48aa-b86f-d2e9fcb02074n%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Bo Liu

unread,

May 11, 2021, 12:26:29 AM5/11/21

to recsys-challenge2021

Hi Alykhan,

Besides the parameters --network none -m=64g --shm-size=1024m, do you restrict resources (especially CPU) using any other parameters in docker run command? Does it have unlimited access to all 16 cores?

I've tested my submission on an e2-highmem-16 GCP instance. It takes about 3 hours. But on the leaderboard it takes 22 hours.

Can you take a look?

Thanks,

Bo

Alykhan Tejani

unread,

May 11, 2021, 3:40:25 AM5/11/21

to Bo Liu, recsys-challenge2021

Hi,

Yes we restrict with the following parameters

  '--memory', '64g',
  '--cpus', '1',
  '--memory-swap', '64g',

To view this discussion on the web visit https://groups.google.com/d/msgid/recsys-challenge2021/48909814-5797-410c-ace2-eb461b824583n%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Gilberto Titericz

unread,

May 11, 2021, 9:01:15 AM5/11/21

to recsys-challenge2021

Any reason for 1 cpu restriction? Is a bug or is planed to be this way ?

In the Leaderboard page it says: "This image will be run on the e2-highmem-16 instances on GCP (see here for details). You submission will have a 64 GB memory limit and 24 hours time limit to score the entire test set (~10 million rows)."

So I believe most participants, including me, is taking into account that 16 cpus are available to score 10M rows in testset. We are working hard and testing our solutions locally according to the official information. But as you can see many competitors are having slowliness in the submission process and my guess is that everyone used multi processing to build their solutions, but got constrained in the 1 cpu rule at inference time.

I know many top DS that didn't show interest in the competition this year when they saw that GPUs are not going to be available at inference time. Now probably many others will desist for no multi-cpu.

Also we are scoring 10M or 20M rows in the validation set (LB)?

Please be more clear about all the aspects of the competition (and constraints) or at least update the restrictions information in the webpage.

Giba

Alykhan Tejani

unread,

May 12, 2021, 6:53:57 AM5/12/21

to Gilberto Titericz, recsys-challenge2021

Thanks for the feedback. The 24 hours constraint with 1 CPU is how we make the challenge more realistic, a 24 hours constraint with 16 CPUs will be prohibitively expensive in real life situations for most businesses for 10M-20M samples. We hope that this inspires top DS teams to invest in novelties in the model instead of simply throwing compute to the model. We will update the information on the website

To view this discussion on the web visit https://groups.google.com/d/msgid/recsys-challenge2021/5af2d560-33d7-47bf-b4e8-ccb5f20f9963n%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jeong

unread,

May 14, 2021, 6:34:26 PM5/14/21

to recsys-challenge2021

1 CPU restriction is even less realistic than 16 CPUs.

My laptop has 16 CPU threads.

16 CPU core instances at AWS cost between 40 and 80 cents / hour and will cost less than $20~$40 / 24 hours.

I don't think any realistic production ML pipeline will run with only 1 CPU.

Jeong

Luca Belli

unread,

May 18, 2021, 2:34:13 PM5/18/21

to Jeong, recsys-challenge2021

Hi Jeong,

I agree with you, realistically no production ML system runs on a single CPU. I am not sure what the number of CPUs we use for our service is, but I am pretty sure it's >1.

At the same time, the point of this challenge is not maximise metrics by throwing more hardware at the problem, but to fund creative ways to deal with constraints.

Finally keep in mind that the amount of real-time Twitter data we need to process is orders of magnitude bigger than this dataset. If we would not constrain ourselves at this stage, we would run into all sorts of problems at scale.

We are thinking about how we can have people use their hardware for the test set after the challenge is over, if there are any particular methods they want to test.

Stay tuned for more!

Luca

To view this discussion on the web visit https://groups.google.com/d/msgid/recsys-challenge2021/94c0003b-b052-4841-b976-12b8e4229a7an%40googlegroups.com.

ekander2

unread,

Jun 1, 2021, 9:02:50 PM6/1/21

to recsys-challenge2021

Is there any plan to update the challenge website? It still lists the 16vcpu compute instance as the constraint. I think many will agree with me that we would have approached the challenge differently with 1/16 of the available compute resources.

Eric

Alykhan Tejani

unread,

Jun 2, 2021, 12:26:30 PM6/2/21

to ekander2, recsys-challenge2021

Yes, thanks for pointing this out. I have updated the website

To view this discussion on the web visit https://groups.google.com/d/msgid/recsys-challenge2021/8e670bb2-0a74-4e97-af62-03d6cac6be63n%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward