Thanks Andrew - good point. Another thing that's been at the back of my mind for some time is that 'ethnicity' is a good descriptor for human samples, but animal samples may need other fields.
One possibility would be to link to the NIH Biosample data. I've shown below the details of the record for an example Biosample. NIH provide a range of records tailored to different organisms, with
On the plus side, relying on the Biosample record would mean that the information is held as far 'upstream' as possible: it comes from the original depositor, is disseminated to anyone looking at
the sequences, and the original author attests to NIH that privacy and ethical issues have been addressed. And we'd know that our data was structured in a way that was consistent with NIH.
On the minus side, it would align us more strongly with NIH. With publications, I thought it was important to retrieve some details so that the submitter could verify that the correct PMID had been
entered: its so easy to mistype a long number. The same is also true of the accession numbers such as the Biosample ID, but I don't check those at all at the moment, because we reference NIH
in the submission guidelines as just one repository people can use, and I was worried about getting drawn into checking details against multiple repositories with different data representations,
interfaces and so on. It could be quite a bit of work. A compromise might be to encourage people to submit NIH records where available (given that records are exchanged between the major
repositories), do a good job of integrating against NIH, but allow people to submit details from another repository with less checking and information exchange if the samples were for some reaosn
not available in NIH.
William