Proposal to use pycurl instead of urllib2 for cmsweb

1 view
Skip to first unread message

Giffels, Manuel (IKP)

unread,
Mar 31, 2015, 11:56:24 AM3/31/15
to grid-c...@googlegroups.com
Dear all,

I would propose to use pycurl instead of urllib2 for heavy interactions with cmsweb. For example the publication of datasets into DBS 3.

Benchmark:
=========
Before a new dataset can be inserted into DBS, one has to check that all parent files are already in the destination. Basically, that means one API call per parent file. In the dataset below 8218 calls.

Result using pycurl and the dbs3 client:
-----------------------------------------------------
time python datasetDBS3Add.py ../work.dYToMuMuFilter_cern/
Blocks: 1
2015-03-31 16:20:47 - dbs3-migration:DEBUG - Checking parentage for block: /DYJetsToLL_M-50_13TeV-madgraph-pythia8-tauola_v2/Dataset_58142f5c78942717a195be5d8b5c1c3c/USER#790a2860-6323-8e0c-8067-67b42dc97f1a
Number of parent files: 8218

real 5m59.676s
user 2m39.977s
sys 0m5.043s

Result using urllib2 client instead:
--------------------------------------------
time python datasetDBS3Add.py ../work.dYToMuMuFilter_cern/
Blocks: 1
2015-03-31 17:27:51 - dbs3-migration:DEBUG - Checking parentage for block: /DYJetsToLL_M-50_13TeV-madgraph-pythia8-tauola_v2/Dataset_58142f5c78942717a195be5d8b5c1c3c/USER#790a2860-6323-8e0c-8067-67b42dc97f1a
Number of parent files: 8218

real 19m49.109s
user 3m39.461s
sys 0m9.891s

The result clearly shows, that pycurl outperforms urllib2 by a roughly a factor of 3. The reason is that pycurl supports ssl session caching, whereas urllib2 does not support this. That means urllib2 has to authenticate 8218 times with cmsweb. That causes unnecessary load on the cmsweb frontends and takes a lot of time for the client, too.

Please, let me know your opinion.

Cheers,
Manuel

P.S.: I ran both test multiple times and the result does not vary that much.

Fred-Markus Stober

unread,
Mar 31, 2015, 3:35:27 PM3/31/15
to Giffels, Manuel (IKP), grid-c...@googlegroups.com
Hi,

Here are my 2ct:

1ct) is it possible to keep urllib2 as a fallback in the code? I guess a
simple solution would be to bundle the functions in the webservice
module into a class "RESTAPI". Depending on the availability of pycurl
we could load one or the other - that's what the dynamic class loader is
for :)

and the second

1ct) did you try the same benchmark with the "requests" lib
(http://docs.python-requests.org/en/latest/) ? I've found some
indications that it might support ssl session caching as well (not on
the homepage but on a stackoverflow page). Since pyCurl depends on a C
lib, bundling requests (apache 2.0) would be a more deployment friendly
solution, which would work everywhere ...

Cheers,
Fred
--
Dr. Fred Stober <sto...@cern.ch>
KIT - Karlsruhe Institute of Technology
IEKP, Bld. 30.23 8-22, +49-721-608-47243

Giffels, Manuel (IKP)

unread,
Apr 1, 2015, 6:36:57 AM4/1/15
to Stober, Fred, grid-c...@googlegroups.com
Hi Fred,

thanks for the feedback.

> 1ct) is it possible to keep urllib2 as a fallback in the code? I guess a
> simple solution would be to bundle the functions in the webservice
> module into a class "RESTAPI". Depending on the availability of pycurl
> we could load one or the other - that's what the dynamic class loader is
> for :)

That is how it is currently implemented. ;-) It uses the pycurl dbs3 client and the fallback is urllib2. I just need to put the dbs3-client stuff somewhere in the repository. pycurl is available, if the script is executed in a cms environment.

> and the second
>
> 1ct) did you try the same benchmark with the "requests" lib
> (http://docs.python-requests.org/en/latest/) ? I've found some
> indications that it might support ssl session caching as well (not on
> the homepage but on a stackoverflow page). Since pyCurl depends on a C
> lib, bundling requests (apache 2.0) would be a more deployment friendly
> solution, which would work everywhere …

No, I did not try the same benchmark with requests lib. However, I can do it and check whether ssl session caching is available as well. I doubt it.
I know packaging pycurl within grid-control will break the os independence of grid-control. Sourcing a cms environment would work and if that is not possible we will still have urllib2 as fallback. Probably, I would add a warning/notification in that case, so that people can choose.

Cheers,
Manuel
Reply all
Reply to author
Forward
0 new messages