Open Genealogy Data - US census project at opengendata.org

31 views
Skip to first unread message

Doug Kennard

unread,
Jan 22, 2019, 6:27:30 PM1/22/19
to rootsdev
Do any of you have projects or ideas that could benefit from a public domain transcription of the US Census? Other open datasets in the future?

The Open Genealogy Data project is currently transcribing the 1930 US Census at opengendata.org using a combination of automatic handwriting recognition and volunteers/crowd-sourcing. All the census transcriptions by volunteers will be released into the public domain using CC0 (Creative Commons Public Domain Dedication). Data snapshots will be made available periodically as transcription progresses. Improvements to the interface will come over time, but the data itself is the highest priority.

If you could benefit from open databases or know of others who could, please participate, help spread the word to others who might participate, and consider donating (or sponsoring the data for your location) to help offset the cost.


-Doug

Justin York

unread,
Jan 22, 2019, 10:26:17 PM1/22/19
to root...@googlegroups.com
I really what you're doing here. I didn't know that the 1930 census was available on archive.org.

It appears you wrote your own software for the transcription. Is that correct? What led you to that decision?

Do you have a process in place for checking data quality?

When the final data set is complete, I'd be willing to chip in to a prize pool for a contest on kaggle. I've always wanted to see something like that happen.

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Doug Kennard

unread,
Jan 22, 2019, 10:27:54 PM1/22/19
to rootsdev
It was pointed out to me privately that I had not been very clear about who is doing this and why.

Who: The Open Genealogy Data project is currently sponsored and hosted by my company (Historic Journals), but is intended to be a community project that benefits everyone. Censcript AI (which licenses image processing and handwriting recognition software to Historic Journals for use in the project) is a separate company, also owned by me.

Why: The purpose of the project is exactly as you might infer from the name- to make genealogy data more open and accessible without the cumbersome restrictions that usually accompany data from the large genealogy organizations that already have it. I believe there are a lot of individuals, organizations, researchers, and companies that can benefit from the census and other databases, and I see no reason why they should be perpetually locked up as proprietary business assets that only a few well-funded organizations can afford to leverage completely. Those organizations obviously don't owe us the transcriptions they paid for, and so far haven't been too interested in completely opening the data (why would they?). So transcribing the data and making it available to everyone so nobody else has to go to the effort and expense again in the future seems like a worthwhile thing to do.

By opening the data, students and faculty who are researching and developing new technologies don't need permission to use it or to publish it. People and companies with good ideas for new tools or services can build them without the obstacle of obtaining the foundational data. And individuals can be more certain about how thorough their searches and research are because they have complete access to entire databases instead of just relying on black-box search algorithms.

The Open Genealogy Data project is far from alone in the push for open data. Many of you and several organizations that have participated right here on this list (FreeUKGenealogy and Reclaim the Records, just to name two that immediately come to mind) as well as libraries, museums, archives, some governments, and others throughout the world also work to make data more open.

-Doug

On Tue, Jan 22, 2019 at 4:27 PM Doug Kennard <doug.k...@gmail.com> wrote:
--

Doug Kennard

unread,
Jan 22, 2019, 11:13:12 PM1/22/19
to rootsdev
Thanks, Justin.

I did write the transcription software. I don't think anyone will accuse me of making it too pretty, but there were some specific things I felt I could do to streamline the user interaction and make the transcription process a little more efficient since we'd be working with millions of documents of the exact same type and layout.

Some checks for data quality are already in place, others will be added in the future. There will also be an interface for making corrections, but that is not in place yet and probably won't be for a while.

A kaggle contest in the future sounds like a great idea.

Matt Misbach

unread,
Jan 23, 2019, 10:13:50 AM1/23/19
to rootsdev
Doug,

Well done, what a great project. You are asking for participation... how do we participate?

Matt

Doug Kennard

unread,
Jan 23, 2019, 11:32:41 AM1/23/19
to rootsdev
Thanks, Matt. The most immediate way to participate is through transcribing (and encouraging others to transcribe). There is a short youtube tutorial video on the bottom of the page at opengendata.org/transcribe, or written instructions near the top of the transcription form itself.

Additionally, any influence that people might have with organizations that already have records and datasets is helpful, particularly with encouraging organizations to release public domain data under CC0 (or at most an attribution license). After all, there would be no need for this project if the data were already open! "Share-alike" and similar licenses are not preferred, as they can make it difficult for people to use in combination with data from other sources that do restrict redistribution.

Reply all
Reply to author
Forward
0 new messages