--
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hey, so I know this isn't *exactly* a software-related update, but I did just want to drop a quick line to let people here know about a new project I'm involved in. [...]
We desperately need more open source tools to (1) upload, manage, and flip through scanned images in discrete sets, (2) transcribe information from the images into spreadsheets/databases, and (3) make those data sets searchable and public -- or else more and more of our heritage is going to end up being the literal property of whomever can martial the resources to cut exclusive deals with state archives and/or stick it all behind their paywall and/or has the technical skill to run the servers. And right now that's all in the hands of, like, five companies worldwide.
[...]
Well done for chasing it up this far. Especially given the costs involved.
Do you have any idea yet how much the image scanning will cost? Are you considering crowdfunding the costs?
You're going to have more than "tens of thousands" of vital records. The earlier collection https://familysearch.org/search/collection/2240282 seems to have 120,000 births a year around 1908.
Internet Archive is one place to archive the whole reels (be sure to add a CC license, to override their default "scholarship and research purposes only" access restriction), but there doesn't seem to be a way to link to specific images (pages).
Would Flickr (or another image hoster) be more suitable for your point (1), with each reel in an album? They already have many large CC0 image archives (and have an upload API). When the images are individually addressable, and assuming they're in some logical order (date?), a first-pass index could list (for example) the starting date of each page, and link to each hosted image. Basically a table of contents.
Do you know if the indexes are handwritten, typewritten, or typeset? Either of the latter two could be OCR'd - not fully accurately, but as a quick way to maybe get an easy 80% full-text index (like a newspaper archive index).
Ben Brumfield's FromThePage looks really nice as a transcription system (crowdsourced), but it doesn't seem ideal for tabular data. Pybossa's free crowdsource-as-a-service platform Crowdcrafting.org might be more suitable, as they already have a crowd to start it off. These might cover your point (2) (but I agree something simple for tabular data is needed, it's an area I'm looking at as I also need it).
Maybe a commercial company could do transcriptions faster (it shows it the poor results sometimes!), but it's always good to have multiple independent transcriptions (ideally double-keyed, and with an open license).
For (3) - make the data searchable and public - is I think the easiest. Your transcribed index would be small (a few hundred MB), so assuming the image are hosted elsewhere any small server would do the job, with a simple database browse. Even just a bunch of downloadable csv's (split by first letter of last name, or by year) would be a start, so that anybody can then make the data available how they like.
Internet Archive is one place to archive the whole reels (be sure to add a CC license, to override their default "scholarship and research purposes only" access restriction), but there doesn't seem to be a way to link to specific images (pages).Exactly, they are basically just a free server to use, no matter how big the data is (1 TB? more?) and certainly "better than nothing". But their interface is not great for this purpose. As I lamented, the genealogy community unfortunately lacks a proper central location for hosting genealogically relevant image sets.
Ben Brumfield's FromThePage looks really nice as a transcription system (crowdsourced), but it doesn't seem ideal for tabular data. Pybossa's free crowdsource-as-a-service platform Crowdcrafting.org might be more suitable, as they already have a crowd to start it off. These might cover your point (2) (but I agree something simple for tabular data is needed, it's an area I'm looking at as I also need it).
Filing FOIL requests is free! Even FOIL appeals! And all the legal help I received from the NY Committee on Open Government was free too;
You're going to have more than "tens of thousands" of vital records. The earlier collection https://familysearch.org/search/collection/2240282 seems to have 120,000 births a year around 1908.Um, those are births; the data set I'm going after is an index to marriages. :-)
Check out the State of Washington's wonderful new "SCRIBE" record transcription system:
If only someone made an Apache Solr driven system that can do that (and won second place in the 2012 RootsTech competition for it)::-)
But they're both one-off customized systems. Until there's a standalone multi-tenant system somewhere, the broader genealogy community still lacks an easy way to spin up new searchable databases for any newly available data sets. And that kinda stinks.
I wrote above that the Internet Archive actually manages this really well, though the documentation is strangely labeled.
See the Book URL docs: https://openlibrary.org/dev/docs/bookurls
Based on that, if we look at a census reel that's been uploaded and made it through the IA "Derive" task-- https://archive.org/details/populationschedu1370unix -- but we only want the first page of the Pittsylvania County entries, which start halfway through the book, we can manually navigate to the first page using the Bookreader and find it at https://archive.org/stream/populationschedu1370unix#page/n161/mode/1up . But we don't want to use the Bookreader -- we want to directly link to the URL for the image. We can do that by converting the URL of the Bookreader, replacing "stream" with "download", replacing "#" with "/", and replacing "/mode/1up" with ".jpg". That yields https://archive.org/download/populationschedu1370unix/page/n161.jpg and if we visit it, we are redirected to the actual (non-persistent) URL for the image.
Note that you'll need to ingest the IA metadata for the book into your transcription system.
I'll be attempting to touch base with the folks at Zooniverse and NYPL in the next few weeks, as Free UK Genealogy reviews the new systems to decide how to proceed. Of course, the folks on the other side may be a great deal busier than I am.
Regarding Washington State's Scribe--which is unrelated to the Zooniverse Scribe--I'd also like to know more. I gather that it was the product of the previous State Archivist, so it may be undergoing transition as personnel and priorities change. (NB I have no insight into state politics here in Texas, much less in Washington!) If you do find out anything about the status of the software, I'd love to know more.
Also, although I've said as much in another forum, good on you, Brooke! You're fighting the good fight, and we all stand to benefit.
Do you still have to pay for the actual copying (physical reels)? Or does the legal process get them to waive that?
(I did take a look at Leafseek a couple of years ago, but I didn't have any data that's suitable for it - and I find Java a pain to manage).
But they're both one-off customized systems. Until there's a standalone multi-tenant system somewhere, the broader genealogy community still lacks an easy way to spin up new searchable databases for any newly available data sets. And that kinda stinks.Could Leafseek do that role, with a bit of splitting things off into parameters for each tenant?
Hosting data like this is the sort of thing genealogy societies should have been taking the lead on, but most of them (especially in the UK) still sell their data in book or even microfiche form and charge for data access.
Maybe what is needed is a global volunteer "Genealogy Records Society" dedicated to the obtaining, preservation and (legal) free distribution of freely reusable genealogy records and data (without any commercial or religious motivation) ... does anything close to that exist?
You rock.
-----You received this message because you are subscribed to a topic in the Google Groups "rootsdev" group.To unsubscribe from this topic, visit https://groups.google.com/d/topic/rootsdev/rr96Jc7Bgi4/unsubscribe.To unsubscribe from this group and all its topics, send an email to rootsdev+u...@googlegroups.com.
--
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
I can't stop thinking about how awesome this is. I'm anxiously waiting for an opportunity to contribute money or time.