Community Indexing Project

99 views
Skip to first unread message

Justin York

unread,
Mar 25, 2016, 5:50:10 PM3/25/16
to root...@googlegroups.com
On Monday, Reclaim the Records will start uploading their first set of digital images to Internet Archive. Congrats Brooke. You're doing amazing work.

Reclaim the Records is working on obtaining and digitizing records while leaving indexing up to the internet at large. Two of the expensive pieces are taken care of: record acquisition and image hosting. That leaves two tasks of creating and publishing indexes (available for download as well as making them searchable). I can think of many websites that would be willing to add the indexes to their record search engine. So lets focus on the task of indexing.

How could we setup a community indexing project?

I'm aware of open source indexing tools such as FromThePage (created by our very own Ben Brumfield) and PyBossa but I'm not very familiar with their features nor how much work it would take to setup and manage a project.

What other options do we have?

Ben Brumfield

unread,
Mar 26, 2016, 1:44:25 PM3/26/16
to root...@googlegroups.com
Hi, Justin!

Great news for Brooke and everyone interested in open genealogy!

I haven't looked at the specific records in detail, but from what I recall from Brooke's blog post, the documents themselves are highly tabular -- more akin to census records and parish records than to more prose-like genealogy documents like wills or obituaries.  I'm grateful that you mentioned FromThePage, but it's not a great fit for purely tabular data -- in February I developed a feature allowing markdown-encoded tables to be embedded within transcripts and exported as CSVs, but Markdown-encoding may be asking a lot of transcribers.  Nevertheless I'd be happy to host the project on FromThePage.com if someone else wants to run it - the Internet Archive integration already works, so the import should be straightforward.

You mentioned PyBossa -- I've never run a project on it, but you might take a look at their hosted version at https://crowdcrafting.org/.  I'm not sure about IA integration or whether you can upload manifests of image URLs, but it does support transcription.

For Free UK Genealogy, we're evaluating the new version of Scribe released by NYPL/Zooniverse -- http://scribeproject.github.io/ .  It looks very promising, though we're only a couple days into evaluation.  I'd love some other eyes on it, and know of a few other folks who are considering it for academic projects like toll registers on the Danube.

There is also the option of a Google Sheets/Google Forms integration.  Yes it's quick and dirty, but "best is the enemy of done".

Ben


--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,
Mar 27, 2016, 11:57:06 AM3/27/16
to rootsdev
These images of Index to New York City Marriage Applications, Affidavits, and Licenses, 1908-1929 are closely related to the existing NYC Marriage Index to 1929 transcribed by ItalGen and hosted by the German Genealogy Group, as such I'm not sure they'd be at the top of my list to transcribe, particularly when there are other, less duplicative, datasets (hopefully) coming online such as Index to all New York City marriage records, 1930-2015.

Having said that, the software question is largely independent of the data set and if someone did want to tackle this one, there are example pages on Facebook of the new batch of images. According to the summary, there are about 450,000 records. Index entries are handwritten (except for first two letters of surname which are pre-printed on form). Left page of two page spread is groom index and right page is bride index (pages are independent of each other). Columns are surname, givenname, (license?) number, month, day, year. Books are broken down by year, so year column is just a long line of dittos. Year, month(s), initial surname letter, and borough are available for potential OCRing in large block letters on microfilm target. Index entries are in chronological order.

As far as software goes, PyBossa is much less sophisticated than Scribe and operates at a much lower level. If you're not cloning one of their existing apps, you basically are writing your own Python/Javascript app largely from scratch. You can see an example of Scribe used for a project like this at the New York Public Library's Emigrant Bank project. The Scribe model of marking image regions, transcribing the regions, then verifying the transcriptions may be overkill for projects like a columnar index transcription, but it's pretty powerful for more variable material. I don't know how hard it'd be to modify it to make use of automatic page segmentation (or use some initial region marking as input to a machine learning process that learned how to do segmentation).

Tom

Brooke Ganz

unread,
Mar 28, 2016, 4:03:29 PM3/28/16
to rootsdev
Thanks, everyone!  😊  This is pretty exciting.

If you'd like to see some of the actual scans, which I randomly picked from various years and each of the five boroughs, they are attached to our latest Facebook post here:
(That page and its scans should be visible even if you don't use Facebook at all.)

The process of uploading the files was supposed to start today, but it has been pushed back to later this week, since I'm still sick with a nasty case of gastroenteritis and was in the hospital Saturday night hooked up to an IV, which is not quite how I would have liked to have celebrated the arrival of the scans.  They are beautifully clear, and I've been told they are far easier to read than the old microfilm copies that are onsite at the NYC Municipal Archives.  "My" microfilm copies that were won in the legal case were made directly from the Archives' vault masters, and the digital scans were then very generously done by FamilySearch on their professional-grade equipment, which is what likely accounts for the clarity.

Tom is correct that this data, which is an index, is somewhat duplicative of the data in ItalianGen's index, even if the two underlying document sets are different (City Clerk's licenses/affidavits/applications versus Health Department certificates), because the time periods do overlap.  But according to previous statements of the head of the NYC Municipal Archives, this new City Clerk's data set has 10% more entries in it.  Some of that variance may be due to people who applied for a license/affidavit/application and then never followed through with the marriage, maybe because one or both partners were found to be ineligible (bigamy? STD's? underage?), or died before the wedding, or got cold feet and never followed through with the wedding, or some other cause.  Or perhaps the City Clerk's index was just a better index all around, since it was apparently compiled contemporaneously, while the ItalianGen index to the Health Department certificates (really two separate indices, the Brides Index and the Grooms Index) was created years later as a WPA project, and maybe some errors and omissions crept in.

Anyway, I am totally agnostic on how the images get turned into a transcribed index, but I will cheer you all on from the sidelines.  Once the images are on the Internet Archives, you can download them in bulk, or even as a torrent if you want, so that should help disseminate them easily.  I do plan on giving a copy of the data on a hard drive to ItalianGen/GermanGen, but I know they still do things the tedious way: they burn individual CD's with a small number of files and send them in the mail to people who then work one image at a time, transcribing each image into an Excel row.  (Ugh!)  If a better way is set up to manage the project, perhaps their indexers might want to join that way instead.  Heck, maybe you can set up an arms race amongst indexers.  😀

FYI, the next data set to come out *might* be the New York State (excluding NYC) death index for 1880-1956, and to my surprise it might already be in CSV format and not need an indexing project at all!  How cool would that be?  Right now I'm in a holding pattern with it because the New York State Department of Health has pushed back their official reply date on that FOIL request yet again, now aiming for April 29th.  You can follow the real-time updates on the MuckRock page for that request here: https://www.muckrock.com/foi/new-york-16/index-to-all-new-york-state-death-records-1880-1956-23256/

If that dataset does get released, and in a nice ready-to-go CSV format at that, that would be *such* a fun dataset to play around with, not just for research but also with visualization tools or statistical tools.  You could look at migration patterns of family surnames, most common surnames per county, counties with the most nonagenarians at time of death, and so on.  Unfortunately the statewide data supposedly wouldn't include Albany, Buffalo, or Yonkers prior to 1914, but they're next on my list once I get the state data.  And I keep getting mixed answers on whether inmates of state prisons or state mental hospitals who died on the grounds were included in the data or not.  Guess we'll find out.


- Brooke

Justin York

unread,
Mar 29, 2016, 10:23:04 AM3/29/16
to root...@googlegroups.com
Thanks Ben and Tom for the tips. I agree that FromThePage and PyBossa are not ideal fits.

Scribe looks promising. It has the type of workflow I would want. But it has some important deficiencies. The whole product was designed with the idea that there is one subject (record) per image. Genealogy can have multiple images per record or multiple records per image. At this point they recommend dealing with those intricacies in post-processing of the data.

The Smithsonian runs a Transcription Center which looks like it's exactly what I had in mind. I sent them a message asking about how it was implemented. I haven't heard back yet but it looks like it's a custom Drupal module :(. 

--

Tom Morris

unread,
Mar 29, 2016, 11:04:00 AM3/29/16
to root...@googlegroups.com
It's been ages since I looked at it, but the 1901 Canadian Census transcription project had a reasonable interface, but I don't know if it is open source. http://automatedgenealogy.com/census/

The Zooniverse "subjects" are basically single images, whatever they contain. In the Old Weather transcription of ships logs, this comprises an entire days worth of readings (arguably multiple entries), so this can be dealt with in post, as they say, but another significant issue is information density and work chunking. Having an entire census page, or even marriage register page, as a single unit of work that a volunteer has to complete in one go represents a significant commitment of time and increases the probability of fatigue, etc.

I'll keep my eyes open for other potential candidates for open source transcription software.

Tom

--

---
You received this message because you are subscribed to a topic in the Google Groups "rootsdev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rootsdev/YnV1ZBAKPqo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rootsdev+u...@googlegroups.com.

Tom Morris

unread,
Mar 29, 2016, 11:07:55 AM3/29/16
to root...@googlegroups.com
Hmm, actually even though the Scribe "primary subject" is an image, they support secondary (& tertiary) subjects. The Mark phase can generate multiple subjects, so perhaps this could be used to segment lines/entries in registers. It'd require more investigation, but sounds like a possible lead.


Tom

Ben Brumfield

unread,
Mar 29, 2016, 12:31:31 PM3/29/16
to root...@googlegroups.com
Scribe does indeed allow multiple records to be indexed from the same page, in the exact way you describe.

The three implementations of ScribeAPI run by Zooniverse or NYPL show a pretty broad spectrum of record types, but I think it's clear that multiple records-per-image can work:
https://whaling.oldweather.org/#/
https://www.measuringtheanzacs.org/#/
http://emigrantcity.nypl.org/#/

The big challenge I see from a transcription interface perspective is in regular, tabular data.  It's very good at encoding structured data that appears irregularly, as with the Emigrant Bank.  This data is regular enough that encoding multiple fields from the same subject (where a "subject" is a whole line from the index) is desirable, so we'd need to work with the "composite" task type.

I think it's the most promising option, though it will require the most set-up.

--------------

Regarding the Smithsonian Transcription Center, Justin is right that it is a highly customized version of the NARA Drupal module.  It is closed-source, mainly because it's incredibly tightly coupled to the content management system(s) and digital library system(s) used by the various units of the Smithsonian.  I really don't think it's a good option here, but would be happy to point Meghan Ferriter to this thread if we'd like her to weigh in.

--------------

I've thought a bit more about using FromThePage for the work, and think I might be interested in running a trial on a smallish set of the documents.  Because they're so regular, we could pre-populate the MarkDown tables and just ask transcribers to fill them in, reducing the encoding burden.  It's certainly the quickest thing to get up and running, so we could launch a trial a couple of weeks from now, play around with it, then either move on to other platforms or try to get something serious running on FromThePage, depending on that experience.

--------------

You might be interested in the directory of crowdsourced transcription tools at http://tinyurl.com/TranscriptionToolGDoc  I created it a few years ago, but it's been updated recently by Raffeali Vigilanti to include a feature matrix.

Ben

Matt Misbach

unread,
Mar 29, 2016, 4:24:02 PM3/29/16
to rootsdev
FamilySearch has a cool tool that is in a "pilot" phase. It's a chrome extension that allows you to index any record on the Internet. The indexed data is immediately available and searchable to the public.

The tool could be used as is, OR the source code could possibly be made available so that Reclaim the Records could run its own instance of the tool.

I think this is a fantastic model for indexing the world's images online. You can find more details about it here:

Justin York

unread,
Mar 29, 2016, 5:27:29 PM3/29/16
to root...@googlegroups.com
That's a great tool for the FamilySearch ecosystem (especially once the data gets integrated into the main search data). But I don't like the idea of FamilySearch tools being used for this indexing because I want the data to have an open license.

Ben, does FromThePage have a mechanism for marking where data comes from on the image? I figure it wouldn't since it was developed for free-form text. In some ways skipping the mark step could be beneficial because it would slow the process down. On the other, it could be extremely advantageous long-term to have the data. For example, it could turn into a useful machine learning training data set for handwriting ocr. But perhaps there are already useful data sets available.

--

Ben Brumfield

unread,
Mar 29, 2016, 5:37:37 PM3/29/16
to root...@googlegroups.com
Justin,

FromThePage only links text to pages at the page level -- it does not do within-image linking, so you are absolutely correct. 

That said, the transcripts from FromThePage (and presumably other tools) can be linked back to image regions using computer vision tools.  Desmond Schmidt successfully used TILT to connect transcribed words produced by FromThePage and DigiVol back to the cursive regions of the facsimile for the William Brewster field books project.  I was not involved, but have corresponded with Desmond briefly, and he's written about his efforts at http://bltilt.blogspot.com/ and a shorter (but older) summary is at http://britishlibrary.typepad.co.uk/digital-scholarship/2014/06/text-to-image-linking-tool-tilt.html

I suspect that TILT might be overkill, since it strives for the same kind of bounding-box/word granularity that OCR tools produce.  Genealogists generally prefer to see an entire entry from a facsimile, rather than a single word.  In this material, the regularity of the writing on the images might be amenable to simpler computer vision tools for segmentation.  (I wish I had time to spend on this, since CV segmentation would also be ideal for pre-"marking" the pages and generating record-level "subjects" for Scribe.)

Ben


Matt Misbach

unread,
Mar 30, 2016, 10:04:57 AM3/30/16
to rootsdev
I agree, that's why I suggested using the source code to create a separate instance and branding the tool however you want. The data wouldn't touch any FamilySearch servers.

Dovy Paukstys

unread,
Mar 30, 2016, 10:23:14 AM3/30/16
to rootsdev
Justin, we could just roll our own pretty easily.  :P

I have some ideas that would work well.

Justin York

unread,
Mar 30, 2016, 10:32:38 AM3/30/16
to root...@googlegroups.com
Oh, yeah, that's an idea worth considering. If we decided we wanted a lite client, instead of a heavy weight like Scribe, then we could create a chrome extension that works while viewing the images at Internet Archive.

Tom Morris

unread,
Mar 30, 2016, 11:01:23 AM3/30/16
to root...@googlegroups.com
On Wed, Mar 30, 2016 at 10:04 AM, Matt Misbach <mis...@gmail.com> wrote:
I agree, that's why I suggested using the source code to create a separate instance and branding the tool however you want. The data wouldn't touch any FamilySearch servers.

Do you have a pointer to the source code? I didn't see it anywhere in the announcement.

Tom

Matt Misbach

unread,
Mar 31, 2016, 10:18:47 AM3/31/16
to rootsdev
The source hasn't been released yet, but if a formal request was made they have said they would most likely release it. Send me an email offline at ma...@misbach.org and I can put you in touch with the person you need to talk to.

Justin York

unread,
Mar 31, 2016, 10:21:16 AM3/31/16
to root...@googlegroups.com
You can also install the extension and see all the source code in Chrome's local data on your file system.

Matt Misbach

unread,
Apr 1, 2016, 4:49:59 PM4/1/16
to rootsdev
Perfect, even easier :-)

Trent Larson

unread,
Apr 2, 2016, 2:07:12 PM4/2/16
to rootsdev


Just another thumbs-up for this whole idea.  I'll keep watching.

I'm very interested in the data ownership and sharing parts of these efforts; I believe the future will involve more distributed storage and workflows.  (Interesting that the Trepo announcement just came out to this group... I wonder if there might be overlap.)  Thanks for the pointers to Scribe and CrowdCrafting.org... I'd love to hear of other tools or projects in those areas.

Trent
familyhistories.info

Matthew LaFlash

unread,
Apr 8, 2016, 8:36:12 PM4/8/16
to rootsdev
I've been milling this idea over for awhile, but I'm afraid that I'm arriving a little late to this conversation.  I'd just like to echo Ben Brumfield's suggestions of Zooniverse's Scribe platform for the project, I really think that it would be ideal, and allows for the traditional double-keying and arbitration that we are accustomed to with projects like FamilySearch Indexing or Ancestry's World Archive Project, but with the plus of being maintained completely independently.  I have previously been in touch with Zooniverse about the potential of this tool for genealogy, and they have recognized the potential for this area.  I have also been able to spin copies of their "Measuring the ANZACs" and "Emigrant City" projects locally to give it a run.

I haven't yet tried to write my own project, but now that images are available, I think that a proof of concept is definitely doable if there is interest in pursuing this as an option.

I'm looking forward to continued discussion on the topic.

Matt

Justin York

unread,
Apr 8, 2016, 10:36:09 PM4/8/16
to root...@googlegroups.com
In an announcement today about the release of images on Internet Archive, Reclaim the Records stated:

Details about how to join a new volunteer-led transcription project for these images, to turn them into a free online searchable database, will be posted shortly!

That was a surprise. I'm looking forward to hearing about what they come up with.

--

Brooke Ganz

unread,
Apr 9, 2016, 2:11:45 PM4/9/16
to rootsdev
Hi guys. Guess you saw our announcement of records availability. :-)

The only transcription project announcement I am planning (so far) is to just give out the e-mail address of the coordinator for the ItalianGen group, who will be planning to do a transcription project eventually, maybe many months from now. I talked to them on the phone and they are interested in having their volunteers work on the records, but as previously mentioned they do things in the most basic and low-tech way, burning small batches of records to DVD's and mailing them out and transcribing to spreadsheets and manually merging them, and so on. It sounds brutal, although in practice they have done amazing work on many projects over the years, albeit slowly.

I would love it if Roots-Dev could come up with a better transcription *system* for these records to harness the ItalianGen volunteer pool's energy. But I have to at least have a human point of contact for people wanting to volunteer, in the mean time.

So please, feel free to move ahead to set up an online transcription system! It would be such a benefit...


- Brooke

Roger Moffat

unread,
Apr 9, 2016, 2:45:20 PM4/9/16
to root...@googlegroups.com

> On Apr 9, 2016, at 2:11 PM, Brooke Ganz <aspar...@gmail.com> wrote:
>
> So please, feel free to move ahead to set up an online transcription system! It would be such a benefit...

Yes please ;-)

i’m involved in projects with Western Michigan Genealogical Society and Michigan Genealogical Council and would love to have some system a bit akin to FamilySearch Indexing that I could roll out on to my server to allow volunteers to index the many records we have access to.

I run the Western Michigan Genealogical Society website with over 2.7 million records online using FileMaker Pro (have used this since 1999) and have tried to build something that would let us do at least basic indexing online using that, but would be more at ease with some type of community project where ideas for how it works come from more places than just my poor olde braine.

Roger

Roger Moffat

Colin Spencer

unread,
Apr 10, 2016, 1:41:20 AM4/10/16
to rootsdev
Let me throw something odd out there.

There are two good local transcription tools available

Transcript - http://www.jacobboerema.nl/en/Freeware.htm

Genscriber - http://genscriber.com/genapps/

It may be worth reaching out to the respective authors to see if they have considered an online version of their tools or whether they could be easily converted to be server side tools. Alternatively it may be worth suggesting either of these tools to the ItalianGen Group instead of the way they do it now.


Ben Brumfield

unread,
Apr 10, 2016, 6:14:30 AM4/10/16
to root...@googlegroups.com
That's great news, Brooke.

The looming question in the conversation has been "who will bell the cat?" 
  • Who has the time and resources to build (or customize) the software? 
  • Who will run and maintain the servers? 
  • Who will publicize the project, recruit and manage volunteers?
I suspect I'm not the only one who's comfortable with the first two, but scared off by the third.  Getting the ItalianGen group on board may solve that.

I'm going to start a new thread for those of us who are interested in a solution based on http://scribeproject.github.io/ to start talking about the sources, target data model and task design of an implementation in that system.  I'd suggest those interested to start similar threads for the FamilySearch chrome extension or other tools.  Perhaps we can each explore the technologies that interest us and make some progress.

Ben

Justin York

unread,
Apr 11, 2016, 10:21:04 AM4/11/16
to root...@googlegroups.com
Ben I see in the new thread you started that you and Tom are way ahead of me in your technical understanding of the situation.

I am actually very willing to assist with publicizing and managing the projects.

Matthew LaFlash

unread,
Apr 11, 2016, 10:33:35 AM4/11/16
to rootsdev
I'm definitely interested in joining the conversation -- can someone point me to the thread that has been referenced.  I'm sure it's right in front of me, but I'm new to Google groups and I'm not finding it.

Matt

Ben Brumfield

unread,
Apr 11, 2016, 10:36:12 AM4/11/16
to rootsdev
Sorry, Matt, I should have posted a link here.

The Scribe implementation thread is at https://groups.google.com/forum/#!topic/rootsdev/Sd1_h_f8o6Y

Ben

Reply all
Reply to author
Forward
0 new messages