=== Here are is my personal opinion on the question based on the workflow and expertise of the members of our group on biomedical signal processing: - The expertise of the group would allow to both process csv or xml - We typically store annotations in text based formats (i.e. csv) - I do personally not really see an advantage of using xml and am generally concerned using a format with more flexibility will introduce unnecessary complexity in something that should in essence stay simple (annotations) === I prefer the CSV format as it is much easier to process using simple utilities as you said. === to me, simpler is better. ASCII is fine. Binary is the most efficient but cannot easily be fixed using a text editor. EEG files can become enormously large. XML just adds more bloat. ASCII is a good compromise. === I strongly prefer a csv format because it is easier to be read by humans and it can be parsed easily. But much more important than the format itself is for me a sample of source code which processes the input properly. For the EEG data it was extremely helpful that in the documentation there was also python code reading edf files. So please, whichever format you choose, try to also give an example how to read a file. Optimal would be examples in several languages and data analysis tools: Python, R, Knime, Weka,... === i/we definitely prefer the simpler CSV format, XML is way to complicated in general. i am daily involved with MRI data and processing thereof using CSV and XML to accompagn data with meta data. i believe for the UNIX community CSV is a much friendlier data format. in any case, in our own projects and startup plans we continue to use CSV for the labelling of EEG data and multi-modal stimulus material. === re human readable. ~XML is not human readable, and therefore, to explore that data, one needs to spin up a script to first parse the XML file. CSV has the advantage that everyone can open it with Excel (or some related program like the free Google Sheets). A tabular format is definitely more familiar and therefore more accessible to broader audience / customers. XML vs. CSV is much like the debate between NoSQL (document based, like XML) and SQL (tabular, like csv) database. Each has its advantages, and both can be scalable in certain ways. I am not sure if there exists a database (SQL or NoSQL) that archives all the releases. But if the format of the data does not change very often (or at all), then it make sense to create a fixed schema (i.e. fixed columns) and release as csv files. However, if the format changes from release to release, then XML / NoSQL data format may make more sense, since we don't need to worry about the schema of the table every time. Alternative to XML are yaml and json files, which are a little bit more human readable. === I would strongly cast my vote in favour of CSV right now. XML is a nightmare to deal with and while there are programs available to deal with it just tuning them is a hard task. I would very much prefer the annotation in CSV. I wouldn’t be averse to multiple “layers” of CSV with different complexity levels of annotation. That’s still easier to deal with than XML. === As you say, the main advantage for us is to avoid loosing time in development with interns working on the project. Beyond that, there's no problem going for xml. === Not sure if you intended a response to the entire group or not. Have you guys considered JSON? It might provide some of the structured benefits of XML but not as verbosely. === being able to leverage a large amount of software > that supports xml. The disadvantage is the programming expertise > required is more advanced than you typically find in small machine > learning research groups. We are concerned this would make the data > less accessible for our customers. If this ends up being represented as XML, what about creating a tiny Python library that handles the processing of the data and releasing it on Gitlab / GitHub? People who want to use the data would do the following: In a terminal $ pip3 install tuh_eeg In a script import tuh_eeg data = tuh_eeg.load_annotations('my_annotations.xml') And the tuh_eeg library would be doing the XML stuff without people really knowing about it (using the lxml library shipped with Python, so without generating extra dependencies except the tuh_eeg library). > So, today's question is: would you prefer we distribute annotation > information in an xml or csv format? Given the structure of the data, I would go for CSV because: - There is no nested structure in the data. It is tabular and CSV is perfectly adapted to storing this data - CSV is much easier to parse for a Linux script. If people have written scripts that work with the current format, it would be much simpler for them to modify the script to work with the new format. If the new choice is XML then this will be much harder. If going for CSV, I would be careful with the column separation character because the text of the annotation often contains commas and such. === Personally i prefer yaml or json. They are machine readable and more human readable than xml === Why not both? CSVs don't take up space (relatively), and XML is useful for more complex annotation. So folks who want to work with one form can use that. Else, provide XML and a macro that retrieves the CSV relevant material into a csv file? === In my opinion this comes down to what you want to represent with the data. CSV is inherently a tabular format for data where each row has to have the same set of fields. If that’s a good fit for your data, then I don’t see any reason to go to the increased complexity and markedly increased file sizes associated with XML. XML is inherently a tree format for data. If there are additional aspects of the annotations that you think would have value that you either can’t represent in CSV or can represent much more effectively in the tree structure of XML, then it makes sense to start using XML. Of course, you can fit purely tabular data into a tree by making every branch have the same structure, but if you’re doing that you get all the disadvantages of XML with few of the advantages. === For the most part, I would vote for csv format for tabular data and json (or yaml) for hierarchical data. When run through a formatter/linter, json (and yaml) are very readable but remain simple to parse. It is also notable that they have direct binary equivalents if they need to be served over an api with protobuffers. === As someone who has programmed computers for almost 40 years and has some experience with xml, my opinion is that it just adds too much complexity. It would be a significant barrier to beginners, or to other people not experienced with XML, against working with your amazing datasets. You might be at risk of losing a lot of people that way. So, my vote is for CSV, unless that somehow removes a lot of information that could have been included otherwise. ====