blah … forgot to change my ‘from’ address when sending.
> Begin forwarded message:
>
> From: Joe Hourcle <
one...@dcr.net>
> Subject: Re: {SunPy} Re: JSOC IRIS LEVEL1 query
> Date: February 15, 2017 at 10:57:41 AM EST
> To:
su...@googlegroups.com
> Cc: Kolja Glogowski <
kol...@gmail.com>
>
>
> I recommend that people who haven’t had to manage a DRMS/SUMS instance to not read this. We’ve had a rather high rate of burnout from people who have tried to do it. If you *really* want to go this route, let me know and I’ll see if I can get Niles to set you up with an account on the NSO jabber server … we have a support group on there.
>
>
>> On Feb 15, 2017, at 12:29 AM, Kolja Glogowski <
kol...@gmail.com> wrote:
>>
>> Hi Joe,
>>
>> thanks for your comment concerning the FITS headers.
>>
>> I actually compared the original ('as-is') header from one of the files in Jakub's example with the header of the corresponding file that was obtained using the protocol='fits' option. Both headers were identical in this case, except for CHECKSUM, which probably differed because the date in the comment belonging to the DATASUM keyword had changed during the export process.
>
> Yeah … I’m not sure if that’s the best way to do things. (I’m also of the opinion that there should be some information about the version of DRMS that generated the file, so if we discover in the future that NetDRMS 7 was generating improper FITS files, there’d be a way to know that a given file was generated with it, and it needed to be re-generated.
>
> (I think it was 2.7 that fixed a problem with something in HMI … but 7 didn’t insert the DATASUM and CHECKSUM values back into the headers when exporting … which you’ll see if you download data from SDAC, as the attempt to replace it has dragged on for about 2.5 years or so.
>
>
>> You are right that the keyword entries in the JSOC database can change, and that header entries in the original FITS files might be outdated. But the underlying problem (the metadata can be updated independently from the FITS files), is not resolved by simply using the protocol='fits' export option. For example the FITS headers of files that I download today could be already out of date by tomorrow, and AIA files that were downloaded a few years ago, most certainly contain obsolete metadata.
>>
>> I think the best way to handle this, is to just ignore the metadata from the FITS headers and instead query/store the metadata separately. At KIS I usually don't have to deal with this, because we have our own DRMS server, and all the metadata for data series from JSOC are replicated automatically. But there are exceptions, where I need only a few files from JSOC. In this case I just download the 'as-is'files and use a separate DRMS query for the same dataset to obtain the corresponding metadata which I then usually store as a HDF5 file. I also make sure to store the record number and storage unit number for each file (they can be obtained using the special keywords *recnum* and *sunum*). This way I can always check, if the metadata was updated (recnum changed), or the actual file was replaced by a new one (sunum changed).
>
> sunum will give you something that you can check if the data’s changed … but for HMI data (except hmi.S_720s), if you want an actual index to the data you’ll also need ‘slotnum’, which will let you find the underlying file stored in SUMS.
>
> recnum is the prime key for the table of metadata in DRMS, which stores the headers. So if that’s changed, the metadata has changed.
>
> What I had asked for before launch was for the system to insert identifiers for different concepts, so that a researcher had a way to ask DRMS if there had been any updates. The JSOC refused, and said their system was done, and they’re not going to make any changes. (then a couple of months later, changed how the replication was going to be handled, and brought in a contractor to re-write the whole thing and took out all of the security provisions that I had requested (representing the SDAC & VSO).
>
> For the last few years, I’ve been trying to get people interested in two things :
>
> 1. A way to pass FITS headers with a link to the data rather than the actual data attached. So we could pass updates to the FITS files cleanly, without people having to re-download the file. (as best I know, to export from DRMS to get the updated headers, the data must be in the local SUMS. VOTable allows for this, but FITS does not. When I asked how to handle this on the FITSBITS mailing list, they dismissed it, saying it should never be done.**
>
> 2. A system that would allow for researchers to easily ask if data had been updated … so as you’re about to submit a paper, you can check to make sure nothing’s been changed at the archive that would affect your conclusion. I got some interest from the SolarNet folks, but not too many other people. As we haven’t been able to set a standard for identifiers within the FITS files (and trying to get PIs to change their pipeline is near impossible … even when they’re processing for the ‘Final Archive’ to be sent to NASA, before actually checking with the SDAC to see what we want … or even when it’s still before launch), I think the best approach is to write a program that researchers could run against a directory or list of FITS files, and it would analyze them to figure out what dataset they are, then use knowledge of the datasets to determine unique identifiers and edition, and send off a query to the appropriate place to check if it’s up-to-date. ***
>
>
>
> ** After that incident, I did some testing and discovered that some of the IDL routines would read a file truncated after the header if you didn’t tell it to read the data. I haven’t tested it in PDL, PyFITS or IRAF. And I realized that it wouldn’t be the cleanest way to handle things, but I could claim a new compression scheme (“URL”?) so there was notice about what I did …. but then you have to deal with all of the re-mapped ‘Z’ headers.
>
> *** Mind you, this still requires having a way to search for the records … which we can do for many of the sets served by VSO … but all we can do is tell the person to download the file again, we can’t just give them the updates.
>
>
>> PS.: The following is just a short explanation on how the JSOC export system works. I find it quite hard to get information about it, so maybe this is helpful to somebody.
>
> I’m serious. People should stop reading now.
>
>
>> FITS files from the HMI instrument are usually stored at JSOC without any header keywords (metadata), while FITS files from other instruments (like AIA or IRIS) may have metadata included in the FITS files from the time when they were imported into the JSOC system. After importing the files, the corresponding metadata are managed by the DRMS and stored in a dedicated database. If only the metadata of a record need to be changed, the DRMS database entries get updated, while the FITS headers of the files stored at JSOC will not be altered.
>>
>> When an export request is submitted using the protocol='fits' option, the system at JSOC creates new files for this particular request, by using the (image) data from the original FITS files and generating a new FITS header from the current metadata in the DRMS database. This can take some time and might put the JSOC servers under considerable load for large export requests.
>
> This is why the VSO has the network of caching servers. Unless otherwise specified with the ‘site’ keyword when retrieving data, HMI data requests go through NSO, AIA requests go through SDAC. There used to be additional public sites at SAO/CfA, UCLan, ROB and a few others.
>
> IRIS is served from LMSAL if you go through the VSO, not the copy in DRMS.
>
> The problem comes when you start requesting HMI data that’s not at the site you’re downloading from. The issue is that HMI data is stored as multiple files per storage unit (sunum). This is why you need the ‘slotnum’ to determine the sub-directory. The JSOC insists that the sunum is atomic, and said that if we forked SUMS, they wouldn’t support any of the data that was downloaded through our sites.
>
> So, if we don’t have the file you wanted, we have to download a larger block of data. I believe it’s wither 16 or 32 files to download from the JSOC even when we only want one image. (like for the person who just went and downloaded a few different HMI series at a 12hr cadence for multiple years)
>
> We *did* manage to get the AIA data broken down into individual observations per storage unit, because they had planned on packing 8 different wavelengths per unit. (which unfortunately may have resulted in some of the problems w/ DRMS & SUMS, as I don’t believe they were ever tested at the size of tables that this would end up generating.
>
>
>> The server load (and the wait time) can be significantly reduced using protocol='as-is'. In this case the system does not create new files, but instead returns links to the original (unaltered) FITS files which have no (scientific) metadata stored in their headers (HMI), or may contain possibly outdated metadata (AIA or IRIS). The most recent metadata can be obtained independently, by directly querying the DRMS (for example by calling drms.Client.query() from the drms Python module or by directly using the HTTP/JSON interface from JSOC). If the method='url' option is used, the system also creates a plain-text file containing a table of the most recent metadata for each requested record, which can be easily parsed (for example using pandas.read_table()), and used instead of the (possibly obsolete) FITS header entries.
>
> hmm … interesting … I hadn’t thought of that. You’re likely piggybacking off one one of the APIs that the JSOC ‘lookdata’ webpage uses. I’d still want to get it into a more universal form so that people could use the typical tools on.
>
> I was actually hoping to make an OO version of the IDL VSO client … so you’d get a list of objects as the results from vso_search, and you could call general ‘get’, ‘read’, ‘prep’, ‘plot’, etc. on them, rather than having to use different functions depending on which instrument that data came from. … yet another thing that never got done.
>
> -Joe
>
>
>
>