The NumberFields Credit Debacle of 2019

19 views
Skip to first unread message

Eric Driver

unread,
Apr 9, 2019, 7:45:55 PM4/9/19
to Boinc Projects

Richard wanted me to put this information together regarding the credit situation at NumberFields over the last couple weeks.

I think it best to break down the events in chronological order, with a summary at the end.  So here goes...

First to set the stage, NumberFields had been using the credit_from_runtime validator option for many years.   It had been pretty stable with very few complaints from users.  Were there cheaters?  Yes, but that was mitigated using the runtime cap. Then one day I decided to introduce a GPU app...

Episode 1: Credit From Runtime
My biggest concern was producing a robust app and seamlessly integrating it into the project, but I hadn't considered the credit system.  Well, as I soon learned, credit_from_runtime does not play well with GPU apps.  For one, it could easily be cheated by the GPU folks - since the GPU version was so much faster, the cap essentially had no effect.  If the GPU had a separate runtime cap, then the cheating could have been mitigated as before.  But even without the cheating, the GPU FLOPS are way over-estimated and give outrageously high credits.

Episode 2: Credit New
Due to the problems with credit_from_runtime, we decided to switch to the default credit system (CreditNew).  This resulted in the GPU version paying out next to nothing.  The credit junkies started going into withdrawls and started leaving to go find their fix somewhere else.  This of course is not good for a project who wants to retain it's volunteers.  I realized CreditNew would take some time for the averaging to settle; but it did not.  Not knowing for sure if and when the credits would correct themselves, I desperately switched to the credit_from_wu mechanism.  As a side note, I eventually learned that the stabilization problem was caused by the runtime outliers, which I had implemented years before and had forgotten about - the gpu tasks were so fast they were all flagged as outliers and that of course screwed up CreditNew.

Episode 3: Credit Per WU
In switching to fixed credit per WU, I had to select a value for the credit.  I made the mistake of listening to a credit junkie on the boards and set the credit/hour to something close to what another project paid.  Big mistake - the credit junkies were high as a kite (and very quiet), but now the cpu was paying almost 18x what it had originally.  This was clearly unacceptable, so I scaled it back down.  Now cpu credits are in the ballpark of their original value, gpu credits are significantly lower, and of course the credit junkies are whining again.

Summary
First, here is a link showing the credit debacle in a graphical format:
     https://boincstats.com/en/stats/122/user/detail/1969/charts
After having fixed the runtime outlier mechanism, the project went back to using CreditNew.
Anyways, I think Richard's goal for these notes, was not for me to air my dirty laundry, but to provide a "lessons learned" for future projects, especially ones with both gpu and cpu apps.  So here are my lessons learned:
1.  Never use credit_from_runtime with a gpu app.
2.  Use CreditNew if at all possible.
3.  Before introducing a gpu app, make sure the runtime outlier mechanism can handle it (otherwise CreditNew won't function properly)
4.  Dont listen to junkies.


-- Eric


Richard Haselgrove

unread,
Apr 10, 2019, 8:42:17 AM4/10/19
to Boinc Projects, Eric Driver
Eric has covered the ground very thoroughly, but may I add some additional points to set the context for the wider BOINC community?

The project has been using Credit from Runtime since 2011. I was ask to visit their website and advise following a flurry of 'Maximum Elapsed Time Exceeded' errors: these occurred just as we were solving similar problems at SETI@Home, and I was able to advise on the 'outlier' mechanism, then in alpha testing.

The NumberFields project searches polynomial space, and does so via a series of distinct searches and batches within searches. We're now working through search 7, batch 29. There are significant variations in complexity, and hence runtime, between searches, between batches, and between tasks within batches. Credit from Runtime is ideal, because it has coped for the last 8 years with no adjustments to <rsc_fpops_est>. Project management would be much more complex if that value had to be adjusted for each run.

In CreditOptions,  the credit claim for 'credit from runtime' is defined as "runtime*ncpus*peak_flops, where peak_flops is the host's Whetstone benchmark". That's actually a misnomer, because properly-implemented Whetstone code should specifically exclude software optimisations: modern applications running SIMD optimisations like AVX regularly exceed Whetstone by a factor of three or more.

The problems for Eric started when the GPU app used the same credit mechanism, and genuinely did use the GFLOPs Peak value calculated from 100% usage of the hardware geometry of the GPU, impossible in real-world parallel implementations. Whetstone and geometrical flops assessment should not be treated as equivalent.

CreditNew is an experiment in progress (although the code was first tested at SETI Beta in March 2010, and deployed live at SETI@Home in June 2010). It will be interesting to see how it develops here: the runtime variation without compensation in <rsc_fpops_est> may play a part, as may the project's use of replication = 1 (no high / low / average verification of credit claims). Previous tests, including a 'clean room' run at Albert@Home some years ago, suggest that we need to keep an eye on this.

Eric has reproduced the link to my account at BOINCstats. I'm a Windows user, and the project's GPU experiment is a beta app for Linux only: I'm not participating. My charts are for an unchanging number of CPUs only, and the variation - particularly in the Credit per day last 60 days chart - is down to the project's struggle to find a happy medium.

Finally, I think that NumberFields is a very good example of the sort of research that BOINC was designed to support. I think I'm right in saying that Eric is operating as a sole researcher, with only minimal server support from his institution's IT infrastructure team. It's a pleasure to work with him, and to see the good-natured way he interacts with users on his message boards on the rare occasions when things go wrong. I hope he survives the invasion of the credit junkies with minimal hard feelings, and that BOINC can continue to provide code and support that properly provides tools for the growing complexity of modern research methods.


--
You received this message because you are subscribed to the Google Groups "boinc_projects" group.
To unsubscribe from this group and stop receiving emails from it, send an email to boinc_project...@ssl.berkeley.edu.
To post to this group, send email to boinc_p...@ssl.berkeley.edu.
For more options, visit https://groups.google.com/a/ssl.berkeley.edu/d/optout.

Eric Korpela

unread,
Apr 10, 2019, 2:58:05 PM4/10/19
to Richard Haselgrove, Boinc Projects, Eric Driver
I've been planning for a decade to replace the running means in the credit_new pfc scheme with running medians because they are more stable in the presence of outliers (in fact most projects could do without outlier detection using a 1000 point running median).  This is a fairly easy change, but testing it can take time, so I haven't actually put it in beta.

I've also been planning to add CPU use measurements so actual CPU use is available on a per host basis for both multithreaded and GPU apps.  The current CPU estimates for GPU apps is wrong to the point of being pointless since the real value (depending on the graphics card) is 0.5+-0.5 CPU.  We may as well be using a static 0.5 CPU.  It would also help with credit granting and app selection.  (Is running a 4 thread app really better than 4 single threaded apps?)

--
Eric Korpela
kor...@ssl.berkeley.edu
AST:7731^29u18e3

Richard Haselgrove

unread,
Apr 10, 2019, 3:12:10 PM4/10/19
to Eric Korpela, Boinc Projects, Eric Driver
Eric (K!) - could you have a look at my figures in https://github.com/BOINC/boinc/issues/2949, please?

It's seriously unhelpful to consider "all GPUs are equal" when assessing CPU-part requirements. The GPUs themselves may be equal, but the GPU applications vary greatly according to programming language and programmer.

Just look at the MB results for NVidia on SETI@Home: the CUDA apps require very little CPU, the OpenCL apps require 100% - and lose serious efficiency if they don't get it. Both the developers have strived over the years to get best efficiency: in this case it seems to be the language toolkits which are at fault.

I'd ask that any change to the default automatic evaluation should recognise that this is a binary step change from 0 to 1, and can't be handled simply by some analog scale.

Eric Korpela

unread,
Apr 10, 2019, 3:41:26 PM4/10/19
to Richard Haselgrove, Boinc Projects, Eric Driver
The CPU stats would be stored in host_app_version, so they would be measured per host per app version.  I don't see any way it could be considered a single value for either an app or for a host or for a GPU model, so any "number of cpus" for an app version on a host would be determined by how many cpus were used in prior runs of that app on that host.  We'd have to see how quickly those values would converge to 1 on GPU apps that need a full CPU before we decide whether to implement a "round values between 0.5 and 1 to 1" policy.  The app_version table would hold an average/median that would be used for hosts that hadn't done prior work with that app, and the plan class values would only be used when an app is released.


Jose Luis Calabro

unread,
Mar 11, 2021, 9:22:14 PM3/11/21
to boinc_projects, kor...@ssl.berkeley.edu, Boinc Projects, edri...@cox.net, r.hase...@btinternet.com
Hi, anybody knows if the new Radio Telescope FAST work for SETI?

Jarod McClintock

unread,
Mar 15, 2021, 5:30:10 PM3/15/21
to Jose Luis Calabro, boinc_projects, kor...@ssl.berkeley.edu, Boinc Projects, edri...@cox.net, r.hase...@btinternet.com

I imagine it will be years before Seti is in the position to look at new sources of data to process.

Jarod

Reply all
Reply to author
Forward
0 new messages