Join us for an upcoming Livestream event: Making Meaning 2024

chenoa pettrup

unread,

Jan 31, 2024, 8:06:43 PMJan 31

to Collections as Data

Calling all collections as data enthusiasts!

State Library of Queensland’s Making Meaning symposium is the perfect professional development opportunity to for you.

Industry leaders will share their successful use of data in day-to-day practice.

Why you should attend

Discover cutting-edge technologies: Explore locally and internationally used technologies that unlock hidden trends and captivating stories within collections.

Inspiring keynotes and lightning talks: Be inspired by local, national, and international keynote speakers and lightning talks, delving into what 'collections as data' truly means.

Networking opportunities: Meet like-minded people, fostering connections for your next project and expanding your professional network.

Speakers include:

Mia Ridge of The British Library
Robert McLellan of The University of Queensland
Andrea Lau and Jack Zhao of Small Multiples
Plus a wide range of Lightning talks

Aren’t able to make it to Brisbane? No worries…there’s no need for FOMO with livestream tickets also available.

Hope to see you there!

-------

Date: Friday 8 Mar

Location: State Library of Queensland, Brisbane, Australia and online

Time: 8:45 am to 4 pm

Cost: $30AUD (Livestream) /$149AUD (in person)

Registration: https://www.slq.qld.gov.au/making-meaning-2024

AARNET is a supporting sponsor of this event.

Eric Lease Morgan

unread,

Feb 1, 2024, 10:42:45 AMFeb 1

to Collections as Data

When it comes to your collections as data, how do you go about managing the collections' long-term storage.

For the past few years I have been practicing collections as data. I identify and collect sets of narrative content (books, articles, etc.). I then do feature extraction against each item and save the results in both delimited and relational database formats. I then save the whole thing in a file system which can be zipped up and distributed. Thus, I build data sets.

Just as importantly, the data sets can be computed against -- modeled. An example modeling processes includes rudimentary counts & tabulations of ngrams, parts-of-speech, named-entities, and computed keywords. Other modeling supported modeling processes include: topic modeling, semantic indexing, full text indexing and concordancing, sentence extraction matching given grammars, and most recently, indexing against large language models for the purposes of natural language Q&A.

These data sets are self-contained things. I call them "study carrels", and the are purposely designed to be operating system- and network-independent; they are purposely designed to stand the test of time. In short, I have been assembling collections, processing them to create data sets, and slowly making them available for computation.

I have growing number of these data sets -- study carrels. They are zip files. How would you suggest I support long-term storage? I need/want DOIs associated with each carrel. I'd put them in my local institutional repository, but my carrels may not be amenable to the repository's collection development policy.

What do you suggest?

--
Eric Lease Morgan <emo...@nd.edu>
Navari Family Center for Digital Scholarship
University of Notre Dame

Kevin Hawkins

unread,

Feb 1, 2024, 7:56:31 PMFeb 1

to collecti...@googlegroups.com

Eric,

I would use a repository that allows for deposit of datasets (not just
documents) and doesn't have restrictions based on depositor
affiliation. Zenodo is the first one that comes to mind, but you might
find a disciplinary data repository at https://www.re3data.org/ that you
like more.

Kevin

Gustavo Candela

unread,

Feb 2, 2024, 3:24:55 AMFeb 2

to Kevin Hawkins, collecti...@googlegroups.com

Hi Eric,

Maybe these links about the publication of Collections as data are useful:

- https://marketplace.sshopencloud.eu/workflow/I3JvP6

- https://doi.org/10.1108/GKMC-06-2023-0195

- https://glamlabs.io/checklist/

In addition to what you mentioned, and if possible, I would consider could be to including the metadata of the datasets in collaborative edition platforms such as Wikidata and the Social Sciences & Humanities Open Marketplace.

I hope this is useful!

Best wishes,

Gustavo

--
This group aims to foster a welcoming and inclusive experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age, religion, nationality, or political beliefs. Harassment of participants will not be tolerated in any form. Harassment includes any behavior that participants find intimidating, hostile or offensive. Participants asked to stop any harassing behavior are expected to comply immediately. Please contact Thomas Padilla if you have concerns.
---
You received this message because you are subscribed to the Google Groups "Collections as Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to collectionsasd...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/collectionsasdata/5e157456-03d4-4616-8d19-872b483bd9d1%40ultraslavonic.info.

Eric Lease Morgan

unread,

Feb 5, 2024, 9:22:24 AMFeb 5

to Collections as Data

On Feb 1, 2024, at 7:56 PM, Kevin Hawkins <kevin.s...@ULTRASLAVONIC.INFO> wrote:

>> When it comes to your collections as data, how do you go about managing the collections' long-term storage.
>

> I would use a repository that allows for deposit of datasets (not just documents) and doesn't have restrictions based on depositor affiliation. Zenodo is the first one that comes to mind, but you might find a disciplinary data repository at https://www.re3data.org/ that you like more.
>
> Kevin

On Feb 2, 2024, at 3:24 AM, Gustavo Candela <gustavo....@gmail.com> wrote:

> Maybe these links about the publication of Collections as data are useful:
>
> - https://marketplace.sshopencloud.eu/workflow/I3JvP6
> - https://doi.org/10.1108/GKMC-06-2023-0195
> - https://glamlabs.io/checklist/
>
> In addition to what you mentioned, and if possible, I would consider could be to including the metadata of the datasets in collaborative edition platforms such as Wikidata and the Social Sciences & Humanities Open Marketplace.
>

> --
> Gustavo

Thank you for the prompt replies. Yes, Zendo was on my list too, as well as OpenOSF. The links pointing to best practices are useful information too, and I'm glad that I have already put into practice many of them. Whew! --Eric Morgan

Eric Lease Morgan

unread,

Mar 11, 2024, 8:57:58 AMMar 11

to Collections as Data

To what degree is it unethical or unprofessional to deposit data sets -- collections as data -- in multiple respositories?

A long time ago, in a galaxy far far away, the preservation of books and journals was ensured when multiple libraries included books and journals in their collections. This philosopy of preservation was well-articulated with the advent of LOCKSS when they said, "Lot's of copies keep stuff safe." See: https://www.lockss.org/

Now-a-days, we relegate the preservation of the scholarly record -- whether that be books, journals, or data sets -- to centralized networked services. Hmmm.

For decades I have been using the Internet to provide access to library collections and services, and one of things this experience has taught me is, links will break. Thus, if I deposit my data sets in multiple Internet locations, then the probability of losing access to the data sets decreases. Yet, like the publishing of articles in multiple journals is seen as unethical, would the publishing of data sets in multiple locations be seen in the same light?

Put more simply, it is okay for me to depostit my data sets in my university's institutional repository as well as something like Zenodo?

--
Eric Morgan <emo...@nd.edu>
University of Notre Dame

Gregory Neil Jansen

unread,

Mar 12, 2024, 12:19:05 PMMar 12

to Eric Lease Morgan, Collections as Data

I think that if both deposits share the same citation (some kind of DOI or similar identifier), then it shouldn't be seen as "taking credit twice". If the DOI can be resolved to more than one of these locations, so much the better. As a consumer of such a dataset, I would want to have clear versioning and be able to see which deposit was the most up to date, etc.. Maybe think of the extra copies as a sort of a preprint, if you want to squeeze it into an existing scholarly publishing framework. Also, I think it is fairly common to deposit publications in up to three places, a journal, an institutional repository, and perhaps a disciplinary repository.

hope this helps,

--
This group aims to foster a welcoming and inclusive experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age, religion, nationality, or political beliefs. Harassment of participants will not be tolerated in any form. Harassment includes any behavior that participants find intimidating, hostile or offensive. Participants asked to stop any harassing behavior are expected to comply immediately. Please contact Thomas Padilla if you have concerns.
---
You received this message because you are subscribed to the Google Groups "Collections as Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to collectionsasd...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/collectionsasdata/098A413F-BD93-4062-824F-EEC311EA85E8%40nd.edu.

--

Gregory N. Jansen

Senior Faculty Specialist, School of Information
University of Maryland in College Park

Eric Lease Morgan

unread,

Mar 13, 2024, 11:38:55 AMMar 13

to Collections as Data

On Mar 12, 2024, at 12:18 PM, Gregory Neil Jansen <jan...@umd.edu> wrote:

> I think that if both deposits share the same citation (some kind of DOI or similar identifier), then it shouldn't be seen as "taking credit twice". If the DOI can be resolved to more than one of these locations, so much the better. As a consumer of such a dataset, I would want to have clear versioning and be able to see which deposit was the most up to date, etc.. Maybe think of the extra copies as a sort of a preprint, if you want to squeeze it into an existing scholarly publishing framework. Also, I think it is fairly common to deposit publications in up to three places, a journal, an institutional repository, and perhaps a disciplinary repository.

The more I think about it, the more I think I will:

1. manually deposit four data sets in my local institutional repository
2. manually deposit the same four data sets in Zenodo
3. semi-automatically deposit twelve additional data sets in my local repository
4. use the API to deposit the same twelve data sets in Zenodo
5. rest

In all cases, I will get a DOI from my local repository and use it when I deposit into Zenodo. After I am done resting, I will re-evaluate and consider depositing more. On my mark. Get set. Go.

Thank you for the replies. They were helpful.

--
Eric Morgan <emo...@nd.edu>

Reply all

Reply to author

Forward