1) None of the sample datasets download data to make a local copy. They all re-serve remote data on-the-fly.
You can tell (reasonably well) by looking at each dataset's "type", which is right after each "<dataset " in datasets.xml. Some types get data from local files (the type ends in "Files", e.g., EDDGridFromNcFiles) or a local database or Cassandra ("EDDTableFromDatabase", "EDDTableFromCassandra"). But other types get data from a remote server (e.g., "...FromDap", "...FromERDDAP", "EDDTableFromSOS"). If you aren't sure, you can look up the type of dataset at
https://coastwatch.pfeg.noaa.gov/erddap/download/setupDatasetsXml.html#datasetTypes and read the details.
That said, a "EDD...From...Files" dataset can be based on remote files but be set up to make a partial or full local copy of those files. But none of the sample datasets do that. It would just be trouble because it would presume that you had a ton disk space available. Also, it would take a long time to download an entire large dataset. I wouldn't inflict that on newbie ERDDAP admins.
2) Nodes are not expected to hold local copies of data from other ERDDAP nodes. It would take a ton of disk space to replicate many of these datasets. In most cases, there is no need for the other nodes to maintain copies of the data. That's an intentional design feature of federations of ERDDAPs. See
https://coastwatch.pfeg.noaa.gov/erddap/download/grids.html (which is geared to one institution, but the idea is the same).
The exception would be if you know that users of your node will be making very high use of a given dataset. Then, if you maintain a local copy of the data, it takes a lot of burden of the source ERDDAP and also makes access faster for your users (since the source is closer). But I'm not aware of anyone actually doing that or needing to do that. If you want to make a partial or full local copy, see
https://coastwatch.pfeg.noaa.gov/erddap/download/setupDatasetsXml.html#cacheFromUrl
3) I'll let other people reply to #3.
I will say: you can do spatial queries to find datasets (in a simplistic way) via the advanced search option in my
which searches about 40 other ERDDAPs for matching datasets.
It isn't a great interface (you have to type in the lat and lon values), but it works.
4) If you're talking about automating the Java + Tomcat + ERDDAP installation, I think the problem is that every setup is different (different OS, different needs/goals). A partial answer to that is Docker, but Docker is yet another tool to learn and obscures some details that may be important. If you are already familiar with Docker and/or have a need for multiple ERDDAPs, Docker is probably the way to go.
If you're talking about the setup.xml and datasets.xml, well, I've tried to make that easier with recent changes/reorganizations (at the expense of current admins (sorry)). The trade off is always: more features leads to more effort (which can be minimized but still ...). I continue to put tons of effort into GenerateDatasetXml (which is has a ton of hueristics) and DasDds, but setting up datasets will always involve some effort. It takes time and effort to learn to use tools well. I think people forget how long it took to learn to drive, type, use Matlab or R, learn their profession, etc. This effort can be minimized, but not eliminated. Interestingly. THREDDS was originally designed to be a pass through system (point it at a bunch of files (almost no effort) and it will serve those files, as is, as separate datasets). ERDDAP took a different approach (always encouraging aggregation and improvement of the metadata). ERDDAP's more active approach will always require more effort, but I think it is worth it. Until the singularity arrives, there are things that a human can do better than a computer.
I hope that helps.
Best wishes.