File transfer and management systems

2 views
Skip to first unread message

Brett Viren

unread,
Mar 3, 2016, 11:56:40 AM3/3/16
to hep-sf-t...@googlegroups.com
Hi HSF techies,

I'm seeking advice on a system that will be needed for managing and
transferring raw data files coming from DUNE prototype LArTPC detectors
that will run in CERN test beams in the 2017/2018 time frame.

We still have a lot of unknowns but I think asking this list for early
ideas is exactly what it's for! Maybe someone knows of something close
that we can adopt or some good toolkit/framework/system that we can use
as a component/basis for something we develop ourselves.

Given many unknowns, here is some idea of the parameters/requirements
for the system:

Data rates: I give rates for the detector I'm most familiar with (the
"single-phase" LArTPC) but the other ("dual-phase") should be similar,
certainly within my order of magnitude uncertainties:

- instantaneous data rates of 1-20GByte/sec (during beam spill)

- sustained 0.1-5GByte/sec (average over full beam cycle)

We will have online disk farms on the same LAN as each detector's DAQ.
We expect them to be at least fast enough to sink the "sustained" data
rate and large enough to hold ~1 day's worth of data (so 100 TB to 1 PB
scale).

Starting from those buffer disk farms, the raw data file management
system must:

- copy all files to CERN EOS disk and then to CERN CASTOR tape.

- copy all files from EOS to Fermilab where other systems at FNAL take
over management of that copy.

- submit "express lane" jobs to CERN computers on some small subset of
data while it is on EOS.

- possibly do additional types of prompt file usage that we have not
yet considered.

- purge files from online disk buffer and/or EOS using a variety of
criteria which would be based on the outcomes of the above actions.

- centrally record per-file information for each action (start/finish
of file writes, copies, transfers, processing jobs) with a "real
time" latency for updates on order of a few seconds.

- provide monitoring and display of history of these records.


I do have a bare, almost cartoon design for a system. It boils down to
maintaining a per-file state machine in a central database and
developing a zoo of "agents" talking to that DB through a web service
API. The agents would enact and inform the DB of state transitions
(copy, archive, submit job, purge file, etc). Web applications would be
developed to query the API to provide monitoring views of the state.
Though conceptually simple, there are a lot of parts to develop if we
start from scratch.


So, oh wise list, any ideas or suggestions for systems to adopt or
components to consider?

Thanks!
-Brett.
signature.asc

Marco Cattaneo

unread,
Mar 3, 2016, 12:30:00 PM3/3/16
to Brett Viren, hep-sf-t...@googlegroups.com, Charpentier Philippe, Christophe Haen, Stagni Federico
Brett

What you describe largely overlaps with the dataflow that we use in LHCb, managed by (LHCb)DIRAC. Our sustained data rate to Castor is ~0.5 GB/s. You might want to discuss with our data management experts (in CC:) whether there are ideas and/or tools that you can share in a simple way.
> --
> You received this message because you are subscribed to the Google Groups "HEP Software Foundation Technical Forum" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hep-sf-tech-fo...@googlegroups.com.
> To post to this group, send email to hep-sf-t...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/hep-sf-tech-forum/ir4y49z8pq2.fsf%40lycastus.phy.bnl.gov.
> For more options, visit https://groups.google.com/d/optout.

Brett Viren

unread,
Mar 5, 2016, 7:19:02 PM3/5/16
to Marco Cattaneo, HEP Software Foundation Technical Forum, Charpentier Philippe, Christophe Haen, Stagni Federico
Hi Marco,

This sounds interesting. Are there some places I can start learning
about the nature of the system? Initial googling finds me some
various presentations and twiki pages but I'm not sure if I'm finding
what you are suggesting.

Thanks!
-Brett.
> To view this discussion on the web visit https://groups.google.com/d/msgid/hep-sf-tech-forum/E522EA2A-FB85-4834-9139-B3344C76867B%40gmail.com.

norm...@gmail.com

unread,
Mar 8, 2016, 12:48:59 AM3/8/16
to HEP Software Foundation Technical Forum

Brett,

 

So let me speak on behalf of the NOvA DAQ group, which I think is fair since was in on the design from day one and continue as the lead of the DAQ today (so I know both the requirements that we set forth as well as all the dirty details of the real DAQ and its operational performance).

 

We encountered a very similar problem when designing the DAQ and data handling system for the NOvA experiment.  In particular we wanted a high reliability system which could perform all of the standard functions that you are describing, as well as perform different work flows for different classes of data (i.e. raw data, different types of log files, calibration files, and survey data).   We also wanted a system that could run in a "lights out" operation and would require minimal to no effort to maintain over the life of the experiment.  Basically we were designing to a situation similar to Minos/Soudan (which you are intimately acquainted with) but building on that experience to make it even more robust and tailored to the Ash River lab (which is more remote computing wise than Soudan).

 

We drafted a detailed set of requirements and interface specifications documents for our system (which are public from the NOvA DocDB system — let me know if you want them and I'll fish up the actual document numbers and links) and then worked with Fermilab to develop our full data handling system.  

 

Now in terms of actual technologies we settled on:

 

For the file transport out of the DAQ we used the Fermilab File Transfer System (FTS) with add on modules we write specific to the NOvA DAQ setup and the data data format.   The core FTS code is available from a number of official sources, and our modules are actually publically available from the NOvA CVS repositories.  MicroBooNE and 35 Ton use the same code base but with modules specific to their detectors.

 

For the data/replica catalog we use the modern SAM system (the http based one).  It handles all the replica information and metadata that we need.  It also makes for a very smooth translation to the offline/analysis world since it keeps our entire file provenance for us and has the standard analysis project workflow code built in.  The other advantage is that SAM understands the tape systems and other storage systems, so we didn’t have to build specific support into our DAQ for any of this.

 

For actual data transport we have gone through a whole host of different protocols as we were spinning NOvA up.  We’ve done everything from simple scp based copies to full 3rd part gridftp transfers.  In the end we  hide all of the protocol details behind a simple API (this is formally “samcp”) and then have been able to configure or switch protocols as we found needs to.

 

As for data rates, yeah proto-dune is going to really push the envelope.  At the same time the rates that you are talking about are not very far off of what NOvA and MicroBooNE have been doing.  NOvA’s raw rate peaks at about 4 GB/s off of the detector and is then heavily filtered down to 2-8% of that rate (depending on trigger mix).  MicroBooNE is more aggressive and is running in closer to a zero-bias readout and has data rates about 10-20x ours depending on their readout mode.  We’re both using the same infrastructure and it isn’t limiting us and still has significant headroom and paths for scaling up.  I’m not sure the exact numbers coming out of 35T but you could ping Tom Junk and he would know.

 

Now in terms of topologies and technologies to adopt, I saw Maxim’s slides from the DAQ workshop, and if those are similar to what you are asking about here, then I think this is a pretty straight forward.  The actual layout is very close to the NOvA and MicroBooNe layouts.  You really could just insert an FTS daemon on your DAQ raids and then configure an EOS endpoint and FNAL endpoint.  From there you use the standard dataset replication tools to clone the data to BNL, NERSC and any other locations.

 

I think our (NOvA) system is pretty straightforward and looks to map well on to what you have shown.  It’s also been adopted at this point in other parts of the neutrino community, so you shouldn’t have a hard time convincing those on Dune of its value, since there is good overlap with the collaboration with NOvA, MicroBooNE and the other LAr experiments.

 

Let me know if you have any questions.  I actually went a bit over the top and wrote up much more detail on our systems initially, so if you want some leasons learned or best practices, I have that as well as all our technical documents.

 

Andrew

Gonzalo Merino

unread,
Mar 8, 2016, 1:23:57 PM3/8/16
to norm...@gmail.com, HEP Software Foundation Technical Forum
Hello,

This is Gonzalo Merino, from the Icecube neutrino telescope at UW-Madison.

Thanks Brett for posing these questions in this forum, and thanks
Marco and Andrew for replying.

This topic resonates as also being of interest for us in Icecube.
Something that we do not have in place today, and we would like to
deploy in the future is some sort of replica and metadata catalog. So
far, this has not been an urgent need for Icecube because a big chunk
of the operations were centralized and monolithic. We are moving
towards a more distributed model, with archive replicas in remote
sites, etc, therefore the need for a more consistent catalog gets
stronger.

A few of our in-house data management tools already make use of
internal catalogs, so one option we contemplate is to evolve some of
those to take care the new distributed replicas use cases. However, we
are of course interested of being aware of "what is available out
there", as usual, to try and understand whether we should better plan
on "writing modules to customize existing tools", rather than fully
developing our own tool.

Andrew, several of the items you mention in your reply sound
interesting and I would be happy to learn more about them. If that is
ok with you, I would be interested in pointers to specific documents
that you would recommend that describe "the modern SAM" or "F-FTS" for
instance.

thanks!
Gonzalo
> --
> You received this message because you are subscribed to the Google Groups
> "HEP Software Foundation Technical Forum" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to hep-sf-tech-fo...@googlegroups.com.
> To post to this group, send email to hep-sf-t...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/hep-sf-tech-forum/12c60a2d-c4e4-4683-81c1-625b3d19279e%40googlegroups.com.

Brett Viren

unread,
Mar 8, 2016, 2:11:52 PM3/8/16
to norm...@gmail.com, HEP Software Foundation Technical Forum
Hi Andrew,

Thanks for your detailed descriptions!

My understanding from DUNE 35t data taking experts is that Fermilab's
SAM + drop box ingesting works pretty well and enjoys lots of support
From the computing division. So, I think everyone is assuming this sort
of set up like that is what will take over the data once it is deposited
at Fermilab's "front door".

But, at CERN I think it's not so clear if we should try to extend FNAL
tech there. Technically, I think what you describe can work but I worry
about optimizing the level effort needed since we have so little
available in DUNE. CERN-centric tech comes with the assumption of lots
of "free" nearby experts just like FNAL-centric tech at Fermilab does.
Somehow we have to make this optimization.

I'd like to hear more about Nova's experience. I don't have access to
Nova DocDB and with IceCube also interested I think posting any docs
publicly that you think are good would be very helpful.

Maybe one way to ask it: let's assume that today I want to install a
test environment at BNL that uses FTS to send data to FNAL. What
documentation should I read to understand the effort needed and maybe to
even start doing it? What hardware and software is needed on my end?
How do I install and configure it?


Thanks!
-Brett.
signature.asc

Michel Jouvin

unread,
Mar 8, 2016, 2:48:13 PM3/8/16
to hep-sf-t...@googlegroups.com
Could be a good topic for a HSF Technical Note !

Cheers,

Michel

Thomas R. Junk

unread,
Mar 8, 2016, 5:36:07 PM3/8/16
to hep-sf-t...@googlegroups.com

  Dear all,

 I'll consider myself pinged!  On 35t, we have a two-step copy that resulted from a historical network configuration
where the nodes that wrote the raw data files the first time were not connected to the site network.  We copy
once using ftp to a second computer which buffers the data, runs offline monitoring jobs on a subset of the
data, extracts metadata, and uploads the files to Fermilab's central dCache using FTS.
Currently we're not bottlenecked on that step (as far as we can tell).

  The file transfer from PC-4 to dCache is done with FTS, and we've been very happy with it.  It's faster than
copying the data files over NFS and provides a web interface for monitoring its progress and whether there
are failures in the file transfers.  We have seen three different easily-handled failure modes, all of which
were addressed quickly by the Data Handling support team at Fermilab.  They have been: expired
certificates, failed copies due to dCache outages (some due to scheduled upgrades), and a power cut
which required a restart of the FTS server processes.  The latter is even our fault for not configuring a local
copy of the FTS software and restarting the computer after a power cut happened faster than the network
mount where its software was found.  We're only running 2-3 months so making a fully turnkey system was less
of a priority than a long-running experiment.  The other interventions were addressed very quickly, and most
failed transfers I can retry myself using the handy web interface instead of bothering the support team.

  As for bandwidth, we have a 1 GBit network connection out of PC-4 to the rest of the site.  From the DAQ machine
making the file to the gateway computer which runs FTS is also a 1 GBit network.  There is a minimum amount
of overhead needed to compute checksums and check them, but we are getting maybe  70 to 80 megabytes/sec
pretty steadily through the system, where the bottleneck is upstream making the files and not in the transfer.

We have a second gateway computer in case the production one fails, and FTS is configured on it as well.
If we find that the gateway computer's disk writes are bottlenecking our throughput, we can send half of the
files to this other computer as a simple parallelization.  For ProtoDUNE, our discussions this morning indicated
that we'd have to write data to many disks simultaneously, and parellelizing the I/O at the file-writing step
may be needed to share CPU and I/O loads.  Root compression has been seen to be slow on 35-ton data
and gives a compression factor of only around 2.

  Not as polished as NOvA for sure!

  Tom
--

Elizabeth Sexton-Kennedy

unread,
Mar 9, 2016, 11:38:12 AM3/9/16
to Brett Viren, norm...@gmail.com, hep-sf-t...@googlegroups.com
Hi Brett,

I also want to thank you for raising this topic to this forum. Coincidently CMS just had a data management review-mini-workshop at CERN last week. There were two time scales considered, 1. what is the strategy to reach the end of run2?, 2. what do we pursue for the HL-LHC timescale 10 years from now? My personal conclusions from the workshop are that we could probably survive to the end of run2 with our in-house solution since it is based on the WLCG-FTS3 tool. In fact ATLAS is using much more of the features of FTS3 then we are, and if need be we could thin down what we do in our layer pushing functionality down the stack to FTS3. In the longer term for HL-LHC we will have 2 orders of magnitude more data to handle. I’m not sure we can get there alone. You need the support of a lab to grow a tool to that level. It needs sustained development commitment over a decade. Along this second time scale we had presentations from FNAL on SAM and CERN/Oslo on Rucio. I asked if I could share those, and both authors agreed, [1], [2]. I also reference the FTS3 talk [3]. Oli tells me that SAM could be adapted to use FTS3 as another protocol that it supports (just as now it support gftp).

Cheers, Liz

[1] https://dl.dropboxusercontent.com/u/13077109/fnal_data_management_overview.pdf
[2] https://dl.dropboxusercontent.com/u/13077109/The_ATLAS_DDM_System_1.pdf
[3] https://dl.dropboxusercontent.com/u/13077109/FTS3.pdf
> --
> You received this message because you are subscribed to the Google Groups "HEP Software Foundation Technical Forum" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hep-sf-tech-fo...@googlegroups.com.
> To post to this group, send email to hep-sf-t...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/hep-sf-tech-forum/ir4d1r4239m.fsf%40lycastus.phy.bnl.gov.
Reply all
Reply to author
Forward
0 new messages