Scribe configuration for NYC Marriage Index

168 views
Skip to first unread message

Ben Brumfield

unread,
Apr 10, 2016, 6:45:27 AM4/10/16
to rootsdev
This is an open thread for discussing and designing an open-source indexing solution for the NYC Marriage Index based on the NYPL/Zooniverse Scribe software.

(Scribe is not the only possible solution for indexing these records, but efforts relying on other platforms should start their own threads.)

Ben W. Brumfield
http://manuscripttranscription.blogspot.com/

Ben Brumfield

unread,
Apr 10, 2016, 8:24:23 AM4/10/16
to rootsdev
My initial notes:

Target Data:
According to Brooke, researchers will use this index to order copies of records from the NYC Municipal Archives as follows:
In your letter, make sure to list the full name of the bride or groom, the full name of the person's spouse if you know it, the Volume number (if listed in the index), the Page number (if listed in the index), the Document number, and the date of the document (month, day, and year). Remember that the date of these documents is probably a few weeks before the wedding took place. If you already know for sure the exact date of the wedding, you should include that information in your letter, too.
So researchers will be searching on names, dates and borough to find volume number, page number, document number, and document date.

It also seems possible to me that researchers might find the name of a bride, then use the vol/page/doc numbers to find the name of a groom, or vice-versa.  I do not know enough about the sources to say whether this cross-correlation would work, however.

Sources:
I spot-checked 13 rolls, choosing the ones that appeared on the left column of the Internet Archive collection listing.
Brooklyn 1919:
  • The photographs was made of an entire opening, rather than a single page.  As a result, any given horizontal band may have two records rather than one.
  • Verso pages contain listings for the groom, recto for the bride. 
  • Bride and groom entries on the same line of same opening are not correlated -- the surname is the bride's maiden name, so the two names appearing on the same horizontal band of the opening do not represent the couple in the wedding.  (See "leaf" 10 for an example of bride entries with no groom entries on the same line.)

Codicology:
The books appear to be organized as follows:
  • A series records documents within a year.  The microfilm reels on the Internet archive contain one series per reel.
  • Records from that year are divided by surname range (from "A" to "XYZ") into volumes (e.g. 1919 B)
  • Each volume contains several quires, each quire recording between one and four months of entries. (eg. 1919 B Jan-Mar precedes B Apr-Jun, B Jul-Sep, B Oct-Dec, and then B Dec.)
  • Each quire is divided into pre-printed surname sub-ranges -- "Aa", "Ab", etc. The number of entries in each subrange varies according to the printer's understanding of the distribution of spelling of English surnames: twoscore entries for "Cz" follows three entries for "Cy", ten entries for "Cv", forty for "Cu" and fifty for "Cr".  (Note also the clerk manually filling in a "Cw" entry here.)  The chronological length of the quires appears to be irregular -- my suspicion is that when the longest sub-range ran out of room in a quire, the clerk would grab a new blank notebook and start a new range.
  • Each entry contains the following fields:
    • Name of (Bride|Groom)
      • Surname
      • Given Name
    • Vol
    • Page
    • Number
    • Date
      • Month
      • Day
      • Year
Ditto marks are used extensively throughout, so if we were to use CV to slice up entries into individual Zooniverse subjects (not a bad idea), we'd either need to preserve enough context for transcribers to convert the ditto marks, or we'd need to post-process the records to convert ditto marks into searchable values. 

The microfilming recorded the beginning of each volume with some kind of marker frame (sorry that I don't know the terminology here), as in this "Start B File" slide.  It appears that no attempt was made to photograph covers, spines, front-matter or end-matter.  That's unfortunate, since it loses context, but it will simplify the data model.

Brooklyn 1929: identical
Manhattan 1917: identical (however note this page of G entries inserted in the middle of XYZ Jun-Sep)
Manhattan 1909:
The microfilm begins with this warning:
2 Part Year
Each month(s) time section
May be taken from 2 books-
(1) - Odd number license
(2) - Even number license
Therefore each month(s)
time section may be in
2 parts.

There is an occasional
exception wherein a month(s)
time perion[sic] was only one
book

Unlike the other registers, this microfilm photographed a page at a time, rather than a two-page opening.
As predicted by the warning, two quires record A Jan-Mar, in a series, the quire with odd-numbered documents preceding the one indexing even-numbered documents.
This roll ends with a MISC volume containing random pages.
Manhattan 1924: Identical to Brooklyn 1919
Manhattan 1919: Identical to Brooklyn 1919
Brooklyn 1924: Identical to Brooklyn 1919
Manhattan 1911:
Begins with this document:
Notice
Each volume is made
up of two books,
odd and even numbers.
Therefore each alphabet
is in two parts.
Otherwise identical to Brooklyn 1919.

Manhattan 1914:
Identical to Brooklyn 1919, however several entries have "NO RETURN" stamped on them after the surname.  See XYZ Nov-Dec Z p2 for an example. 
Does anyone know what this means?  Brooke? 
It certainly seems like a datum worth transcribing.

Bronx 1921: Identical to Brooklyn 1919
Bronx 1922: Identical to Brooklyn 1919
Bronx 1927: Identical to Brooklyn 1919
Queens 1910: Identical to Brooklyn 1919

Ben




Tom Morris

unread,
Apr 10, 2016, 10:20:53 AM4/10/16
to root...@googlegroups.com
Great stuff Ben.  A few quick comments:

- correlating couples by marriage license volume, (page), and number should work, as I understand the system. Date can be used as a cross-check. Of course this isn't really very solid evidence in the absence of an image of the original document with both names present, but it's a good hint.

- The microfilm start of book thing is called a "target" as far as I know.

- Good point about retaining context, but rather than segmenting a page using OpenCV, I think it'd be better to mark the zones/fields and then do the transcription in situ. Regardless, there's a decision to be made as to whether do verbatim transcription or not. I'd lean towards verbatim with interpretation done as a separate step.

- As for NO RETURN, remembering that this is an index of marriage license APPLICATIONS, I suspect that the "no returns" are those who never actually returned a completed license (ie didn't get married).

- I note that there's no page numbering in the volumes I checked. That'll make it more difficult to catch page misses (although the pre-printed letters help).

- There's volume metadata, page scan metadata, raw images in both JPEG2000 & TIFF format, etc available by clicking on the "details" link e.g. https://archive.org/download/NYC_Marriage_Index_Brooklyn_1919

- There's no OCR data (*_abby.gz) available because the language was set to "english-handwritten", but the volumes could be run through Tesseract to pick up any of the pre-printed info, if that was deemed useful.

- All items are part of the NY Marriage Index collection (somewhat misleading since it's an index to marriage license *applications*, not marriages)
- Years 1911-1913 show as only have films for Brooklyn in the year facet, but there are other films (e.g. Queens 1911, Manhattan 1911-1913) available, but missing from the facet
- Queens only has films for 1908-1911
- Bronx runs 1914 - 1929
- Manhattan & Brooklyn run 1908-1929

I could write a quick Python script to download and summarize the metadata to generate page counts, etc if that would be useful for planning purposes.

Tom

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Brumfield

unread,
Apr 10, 2016, 3:38:12 PM4/10/16
to rootsdev
Thanks, Tom.

My responses inline:


On Sunday, April 10, 2016 at 9:20:53 AM UTC-5, Tom Morris wrote:
- correlating couples by marriage license volume, (page), and number should work, as I understand the system. Date can be used as a cross-check. Of course this isn't really very solid evidence in the absence of an image of the original document with both names present, but it's a good hint.

Great!
 
- The microfilm start of book thing is called a "target" as far as I know.

Thanks!

- Good point about retaining context, but rather than segmenting a page using OpenCV, I think it'd be better to mark the zones/fields and then do the transcription in situ. Regardless, there's a decision to be made as to whether do verbatim transcription or not. I'd lean towards verbatim with interpretation done as a separate step.

You're absolutely right here, in my opinion.

Looking over the Scribe documentation, it seems like we could populate the database with each image as "Primary Subjects", then use OpenCV to fake out "Secondary Subjects".  If we were doing this manually, the Primary subjects would request a drawing task which would ask volunteers to highlight each line, which would then create such secondary subjects.  It looks like the next step there is for someone with a Scribe instance running to do some drawing, then post what a "secondary subject" looks like in the database.
 
- As for NO RETURN, remembering that this is an index of marriage license APPLICATIONS, I suspect that the "no returns" are those who never actually returned a completed license (ie didn't get married).

Another excellent point.  That also indicates that presenting the database to researchers requires a great big honking caveat about all those indexes that didn't track returns at all.
 
- I note that there's no page numbering in the volumes I checked. That'll make it more difficult to catch page misses (although the pre-printed letters help).

That's true, but we can use the canonical names of Internet Archive "pages" for reference.  Better would be to use the IIIF Canvas IDs in the IA->IIIF shim.

(see https://iiif.archivelab.org/iiif/NYC_Marriage_Index_Manhattan_1914/manifest.json for an example)
 
- There's volume metadata, page scan metadata, raw images in both JPEG2000 & TIFF format, etc available by clicking on the "details" link e.g. https://archive.org/download/NYC_Marriage_Index_Brooklyn_1919

And there's now a IIIF endpoint which should respond to the IIIF Image API.

That means that we can deep-link directly to regions of the image:
https://iiif.archivelab.org/iiif/NYC_Marriage_Index_Manhattan_1914$1103/900,2030,2100,130/full/0/default.jpg

should retrieve page 1103, at a region 900x2030 pixels from the upper left, grabbing a section 2100 pixels wide by 130 high.

Which it does!

Such a thing might not be great for transcription (as you mentioned above), but it would be pretty handy for display to researchers.

- There's no OCR data (*_abby.gz) available because the language was set to "english-handwritten", but the volumes could be run through Tesseract to pick up any of the pre-printed info, if that was deemed useful.

Agreed.  I haven't yet seen anything compelling about the printed sections, but I may have missed something.
 
- All items are part of the NY Marriage Index collection (somewhat misleading since it's an index to marriage license *applications*, not marriages)
- Years 1911-1913 show as only have films for Brooklyn in the year facet, but there are other films (e.g. Queens 1911, Manhattan 1911-1913) available, but missing from the facet
- Queens only has films for 1908-1911
- Bronx runs 1914 - 1929
- Manhattan & Brooklyn run 1908-1929

I could write a quick Python script to download and summarize the metadata to generate page counts, etc if that would be useful for planning purposes.

I think that would be of great interest! 

Ben
Tom

To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+unsubscribe@googlegroups.com.

Justin York

unread,
Apr 11, 2016, 10:22:25 AM4/11/16
to root...@googlegroups.com
I get the idea that we would need a new scribe installation for each record set. Meaning, we would setup one for the NYC Marriage Index and then need to setup another instance of scribe for whatever Reclaim the Records releases next. Is that correct?

Tom

To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Ben Brumfield

unread,
Apr 11, 2016, 10:33:58 AM4/11/16
to root...@googlegroups.com
I'm not clear on that yet, Justin, and think we will need to experiment.  It looks like there is support for multiple different workflows, and for grouping subjects into collections.  It appears, however, that workflows are associated with subjects on the project level, and the documents say that "there SHOULD be only one project".

That may be something we can modify, however, either by adding support for multiple projects per installation, or by creating a new configuration to explicitly associate workflows with subjects.

Ben

Ben Brumfield

unread,
Apr 11, 2016, 2:21:27 PM4/11/16
to rootsdev
I may have about an hour tonight to try to configure a few subjects and workflows.  What's the best way we can collaborate on this?

1) Create a git repository under rootsdev for sample data and configurations only?  ScribeConfigNYCMarriages, containing project directories and test data directories?
2) Fork the entire ScribeAPI repository from zooniverse to rootsdev, then create a new branch for this work, with project directories in the working location and test_data directories somewhere that seems to make sense?
3) Something else?

Ben

Justin York

unread,
Apr 11, 2016, 2:43:16 PM4/11/16
to root...@googlegroups.com
The project isn't very active any more so I don't think we can rely on PRs being merged and therefore would want to manage the code ourselves. So at least #2. We could still have a separate repo for the configuration but I don't know how hard it would be to put the two together.

--

Tom Morris

unread,
Apr 11, 2016, 2:50:40 PM4/11/16
to root...@googlegroups.com
I just pushed https://github.com/tfmorris/scribeAPI/tree/marriages which is a lightly edited version of Emigrant's Bank and confirmed that the instructions in the ReadMe will create a working site.

When you get to the end of the install instructions, use

rake project:load['marriages']
rails s

to bring the web site up at http://localhost:3000

I'm probably not going to do much more today, so feel free to fork that and use it as a starting point.

Tom

Ben Brumfield

unread,
Apr 11, 2016, 2:53:24 PM4/11/16
to root...@googlegroups.com
I like that idea.

Brooke and I met with Ben Vershbow a couple of months ago, and the impression that I got was that NYPL Labs is still very invested in Scribe, but now that Labs has been promoted from sort of a skunk-works operation to being involved in all of NYPL's digitization, it's going to be a while before they can get back to this.

I sent Ben an email a couple of hours ago, asking him to point some of his technical folks at this thread.

That said, I agree that we should be prepared to maintain this ourselves, and that a fork is the right solution.

Reading a bit more, it looks like the subject configuration files under /projects would live pretty comfortably in our fork, though we probably want them in a separate branch, and to do any generally-useful stuff in a different branch without config information so that we can actually issue pull requests.

Ben

Ben

Ben Brumfield

unread,
Apr 11, 2016, 2:57:20 PM4/11/16
to root...@googlegroups.com
Wow, Tom.  This is starting to feel like a barn raising!

I'm planning to hack on this from 17:40-18:50 CDT this evening.  Happy to collaborate via gchat if anyone else is around then.

Ben​

Tom Morris

unread,
Apr 11, 2016, 2:57:26 PM4/11/16
to root...@googlegroups.com
On Mon, Apr 11, 2016 at 2:53 PM, Ben Brumfield <benw...@gmail.com> wrote:

Reading a bit more, it looks like the subject configuration files under /projects would live pretty comfortably in our fork, though we probably want them in a separate branch, and to do any generally-useful stuff in a different branch without config information so that we can actually issue pull requests.

That's what I did. The new stuff is on the "marriages" branch. If you look at the repo, it's got separate branches for dev & production of each of the sites independently, so that individual deployments can be managed without stepping on each other's toes. I suggest we follow the same approach (assuming we use Scribe).

Tom 

Tom Morris

unread,
Apr 11, 2016, 6:06:28 PM4/11/16
to root...@googlegroups.com
I ran the volume stats and committed the program and CSV output to the repo.  
There are 68 volumes for the period 1908-1929, totalling 72, 049 images (mostly two pages per image).
Volumes range from 210 pages (Queens 1908) to 1972 pages (Manhattan 1919) and average 6600 pixels wide by 4500 pixels high. 
Two volumes (Manhattan 1908 & 1909) have split half width (3700 pixels) pages with a correspondingly higher count (2560 & 3099)

I also sketched out some initial workflows for the front end and pushed those to the repo. I think the next steps are probably to get some images loaded that we can play with and make any necessary database updates to match the new workflow. I think for the time being the subject/image loading can be just generating some IIIF image URLs [1] by hand and putting them in a CSV file. It looks like the existing images in the Emigrant Bank app are about 1400W x 1000W with thumbnails of 150x110, but I'm not sure what the best target for this registers would be. We may want to consider an image processing pipeline that crops, deskews, segments, etc the images if it can be done reliably.

It turns out that there are page numbers after all. They're just very small and at the very bottom of the left page (only) so they're hard to see. We should decide if we want to transcribe them (I think yes) and add them to the workflow if so.

Tom




--

Ben Brumfield

unread,
Apr 11, 2016, 6:52:24 PM4/11/16
to root...@googlegroups.com
I've just forked Tom's fork into the rootsdev organization and added Tom and Justin as collaborators.  If anyone else wants on, let me know and I'll add you too.

Ben

Tom Morris

unread,
Apr 11, 2016, 6:59:24 PM4/11/16
to root...@googlegroups.com
Thanks Ben. I actually went ahead and added a (single) subject from the marriage indexes, so you should have a fully functioning system when you build it.

One thing that's immediately obvious is that a simple minded implementation like is sketched out now is going to be unwieldy from a usability point of view, so we probably want to do some investigation into Scribe's capabilities and tools for this type of work as well as brainstorm about what the workflow should look like from a high level point of view.

One of the things rattling around in my mind is that perhaps we should start with some high-level page level meta tasks like collecting page numbers, leading initials, etc. Outside the box thinking such as marking and transcribing all dittoed values as a single block might make things more efficient too.

Tom

Ben Brumfield

unread,
Apr 11, 2016, 9:56:07 PM4/11/16
to root...@googlegroups.com
I believe that I'm very close to done with a rake task which will create a subjects_X.csv file by querying the IIIF service for a given microfilm at Archive.org.  I'm running into a challenge which Tom might be able to help me out with, however, which is that I'm not sure I understand what the subject.csv columns mean.

The column headers in the source code are as follows:
order,file_path,thumbnail,capture_uuid,page_uri,book_uri,source_rotated,width,height,source_x,source_y,source_w,source_h

While the documentation at https://github.com/zooniverse/scribeAPI/wiki/Project-Subjects makes reference to only a handful of these:
  • order - Integer - the sequence of the subjects
  • file_path - String - the URL to the full media file
  • thumbnail - String - the URL to the thumbnail image of the media file
  • width - Integer - width in pixels of media file
  • height - Integer - height in pixels of media file

Did I miss something in the docs, or will we need to do code archaeology to figure this out?  Regardless, I'm just using some stub values for the things I don't understand so I can try to get something checked in tonight.

Ben



Tom Morris

unread,
Apr 11, 2016, 10:30:22 PM4/11/16
to root...@googlegroups.com
Cool! I'll check in more detail later, but my working assumption in looking at the column headers was that the extra columns were app-specific values that one could toss into the row and would get carried along through the processing pipeline as opaque values to help with post-processing without having to do additional lookups/joins.

Not sure I'll get to it tonight, but I'll try to check into it further and report back.

Tom

Ben Brumfield

unread,
Apr 11, 2016, 11:14:05 PM4/11/16
to root...@googlegroups.com
I've just pushed a new rake task that takes an Internet Archive ID as a parameter and outputs (to STDOUT) a CSV file that should work for populating subjects.  Unfortunately, I've run out of time halfway through testing it, and figure an hour or so worth of work remains before it's where I want it to be.

Since it doesn't modify any functionality, I went ahead and checked it in and pushed to the rootsdev/ScribeAPI marriages branch, in case anybody needs it before I can return to the work.

I expect to make another pass tomorrow around 20:00CDT and hope to finish this piece up then.

Nice job with the task and introductory configuration work, Tom -- it's looking really promising!

Ben

Justin York

unread,
Apr 12, 2016, 6:33:00 PM4/12/16
to root...@googlegroups.com
For those of us that don't know anything about Ruby or Scribe, is there any way to help with the technical side?

Tom Morris

unread,
Apr 12, 2016, 6:56:21 PM4/12/16
to root...@googlegroups.com
On Tue, Apr 12, 2016 at 6:32 PM, Justin York <justi...@gmail.com> wrote:
For those of us that don't know anything about Ruby or Scribe, is there any way to help with the technical side?

Sure. I'd never looked at the Scribe implementation before the other day and hadn't used Ruby/Rails in years, but had a working system bootstrapped in a couple of hours, most of which was spent hacking on JSON config files. The Scribe README is pretty complete and if you follow it's directions using the marriages fork, you could have a working system up in well under an hour.

It this point the two biggest tasks are:

1. Figuring out a reasonable workflow based on the contents of the registers, the desired output, and the tools available from Scribe. For the last piece, besides the Scribe documentation, there are also the Emigrants Bank, Old Weather, and Measuring the Anzacs system which can be studied to see what they've done for workflows and how they work in practice. These are included in the repo, so you can not only run them locally, but also hack on them.

2. Figuring out an image processing pipeline to produce images to feed the system with. At a high level, this is probably something like: split double page images, crop out extraneous stuff, deskew any crooked images (most of the marking tools use horizontal rectangles) and crop again,  and, optionally, OCR printed bits like page #, leading letters, "NO RETURN" stamps.

Until the high level stuff is sketched out it's too early to start, but there will also be a ton of work putting together help pages with screen shots, screen capture videos, etc. There will also be a whole backend process of analyzing the raw data and turning it into something useful.

There's no shortage of technical work to go around. :-)

Tom

Ben Brumfield

unread,
Apr 12, 2016, 7:02:59 PM4/12/16
to root...@googlegroups.com
I agree with Tom, and in particular with his recommendation of #2.  The Scribe folks recommend OpenCV, which I've never used -- I gather it's a python library? 
I'd love to see a way of detecting actual lines of text from an index, which seems like a common enough task that someone in the Computer Vision world has figured it out already. 

I also don't have any options for finding a staging server we can all get at.  I'd be happy to contribute to long-term hosting costs, and happy to contribute the OpenSourceIndexing.org domain name, but don't have the time at present to spin up a development server.

In addition to #1 and #2, we're going to need text and site design.  Tom's done some of this, editing the CSV files to pull in an image of a couple getting married, but most of the text, the help files, the links to external resources, and the other assets still need to be addressed. 

Ben

Brooke Ganz

unread,
Apr 12, 2016, 8:45:36 PM4/12/16
to rootsdev
Quick note on the stats that were generated for the NYC Marriage Index: there are only 44 out of 48 microfilms online so far.  The final four microfilms should contain about twelve items/years for Queens and Staten Island, but they will be a bit different in that some of these items will span multiple years, and were not broken up into smaller parts during the microfilm filming or the later digitization.  These years/items should be going up in the next few weeks.

(Frankly, they probably would have all been posted by now, but this week is spring break, so I'm on vacation with my family at LegoLand, wooo!  There's enough WiFi bandwidth here for reading the web, but not for uploading hundreds of images.  So uploads will resume when I return.)

Also, there are some slight oddities with the existing online items/years that still need to be worked out with the Internet Archive: sorting the items by date seems to miss some the items, even through they are indeed online; some items were accidentally placed into the "community media" category instead of Reclaim The Records; a few items need to be re-generated; the Internet Archive's front-end web cache seems to take forever to notice any changes to the underlying files; and so on.  Nothing major, but my point is to consider this all to still be slightly buggy until further notice.  And bug reports are always welcome, of course.

The total number of images, when everything's online and working right, should be 79,735, as per the auto-generated text and XML files that accompanied the scanned images on the FamilySearch hard drive.  I assume that number is correct.

Also, a small update on another project: the New York State Department of Health may indeed give us the New York statewide (minus NYC) death index for 1880-1956, but their previous attorney's insinuation and my hopes that the data might already be in CSV format seems to be dashed.  We are very likely going to be looking at microfiche (not microfilm) scans, and lots of them.  So yeah, getting an open source indexing system ready for that eventual tsunami of records later this year -- seriously, that data set could be millions of names -- would be awesome...!


- Brooke

Ben Brumfield

unread,
Apr 18, 2016, 11:56:00 AM4/18/16
to rootsdev
Now that the weekend is over and I need to get back to work, I've pushed my rake task to generate Scribe subjects from the Internet Archive.

# ingest NYC_Marriage_Index_Brooklyn_1919 from the Internet Archive.  This will
# 1) create project/marriages/subjects/group_nyc_marriage_index_brooklyn_1919.csv and
# 2) print a line to be appended to project/marriages/subjects/groups.csv
rake project
:subject_from_archive[marriages,NYC_Marriage_Index_Brooklyn_1919]



Having done that, running rake project:reload[marriages] loads the subjects into the system.

The one thing I don't understand is why the images aren't displaying once I run
rails s

Tom, would you have time to take a look at the csv file and compare it to the one you hand-coded?  I went ahead and checked in the modified groups.csv file and the new subject file for Brooklyn 1919.

Ben

Ben

Ben Brumfield

unread,
Apr 18, 2016, 11:56:53 AM4/18/16
to rootsdev
I should have mentioned that the changes are at https://github.com/rootsdev/scribeAPI/tree/marriages

Ben

Tom Morris

unread,
Apr 18, 2016, 12:06:42 PM4/18/16
to root...@googlegroups.com
On Tue, Apr 12, 2016 at 7:02 PM, Ben Brumfield <benw...@gmail.com> wrote:
I agree with Tom, and in particular with his recommendation of #2.  The Scribe folks recommend OpenCV, which I've never used -- I gather it's a python library? 
I'd love to see a way of detecting actual lines of text from an index, which seems like a common enough task that someone in the Computer Vision world has figured it out already. 

OpenCV is written in C, but has Python bindings. Leptonica is another package which is used internally by Tesseract OCR for its image processing. ImageMagick is a command line utility which can do all the operations I mentioned (although it might be harder to do things like look at multiple pages to see where the best average crop boundary is). The classic line detector is uses the Hough line detector. The Leptonica site has a page on skew angle detection (the hard part of the deskewing operation).

Note that, because of the power of the IIIF service, the "pipeline" may just consist of figuring out the correct crop points, rotation angles, etc so that they can be plugged into the URL for fetching the page images (presuming the IIIF services isn't too slow).

Tom

Tom Morris

unread,
Apr 18, 2016, 12:07:53 PM4/18/16
to root...@googlegroups.com
On Mon, Apr 18, 2016 at 11:56 AM, Ben Brumfield <benw...@gmail.com> wrote:
Now that the weekend is over and I need to get back to work, I've pushed my rake task to generate Scribe subjects from the Internet Archive.

# ingest NYC_Marriage_Index_Brooklyn_1919 from the Internet Archive.  This will
# 1) create project/marriages/subjects/group_nyc_marriage_index_brooklyn_1919.csv and
# 2) print a line to be appended to project/marriages/subjects/groups.csv
rake project
:subject_from_archive[marriages,NYC_Marriage_Index_Brooklyn_1919]



Having done that, running rake project:reload[marriages] loads the subjects into the system.

The one thing I don't understand is why the images aren't displaying once I run
rails s

Tom, would you have time to take a look at the csv file and compare it to the one you hand-coded?  I went ahead and checked in the modified groups.csv file and the new subject file for Brooklyn 1919.

Sure, I'll check it out later today.

Tom 

Justin York

unread,
Apr 18, 2016, 1:48:06 PM4/18/16
to root...@googlegroups.com
I'll work on getting a staging environment setup. The project is configured nicely to support Heroku so it should be pretty straight forward. I'm just trying to get it running in Cloud9 first (my default IDE these days) and that's turning out to be non-trivial.

--

Tom Morris

unread,
Apr 18, 2016, 2:56:46 PM4/18/16
to root...@googlegroups.com
I've fixed the script and regenerated the CSV file. There was a column missing causing the shifted data to be read as an image height of 0, which confused the viewer.  

I cut the resolution down to improve performance a little, but I didn't split the left and right pages into separate subjects, which I believe is the right way to go. 

For anyone who looks at the current workflows, I want to make it clear that I do not think the current workflows are optimal or even close. They were just the quickest way to get something up and running based on the Emigrant starting point. I'm also not convinced that the IIIF server will have adequate performance for production use. We may want static images stashed on S3 or somewhere.

In addition to the image processing pipeline and the workflow design, a completely non-technical task that someone could tackle is to identify all the different types of information that would be valuable to extract. Mainly this is just perusing a big enough sample of the images to see what variety of stuff they contain. For example, today I came across a new class of notation like "Dup 1468 1919" which I'm guessing is a duplicate marriage license application #1468 in year 1919. Ben mentioned the "No Return" notation in his original survey. There may be other stuff like this lurking. The workflows are fixed for all time when they get deployed, but the more stuff we can design in up front rather than having to add later, the cleaner things will be.

I've created an initial wiki page with Ben's instructions: https://github.com/rootsdev/scribeAPI/wiki
The wiki might be a good place to sketch out what we want the workflows to look like.

It might be a good idea to switch the default branch of our fork to be the `marriages` fork (I don't have privileges to do that). We could also consider enabling the Github issue tracker for the repo when we get a little further along and use it for tracking issues specific to the marriages fork/project.

Tom

Justin York

unread,
Apr 18, 2016, 2:59:15 PM4/18/16
to root...@googlegroups.com

On Mon, Apr 18, 2016 at 1:56 PM, Tom Morris <tfmo...@gmail.com> wrote:
It might be a good idea to switch the default branch of our fork to be the `marriages` fork (I don't have privileges to do that). We could also consider enabling the Github issue tracker for the repo when we get a little further along and use it for tracking issues specific to the marriages fork/project.

Great idea. I made those changes.

Justin York

unread,
Apr 19, 2016, 10:31:10 AM4/19/16
to root...@googlegroups.com
We got it running in heroku: https://scribe-marriages.herokuapp.com

Ben Brumfield

unread,
Apr 19, 2016, 10:37:24 AM4/19/16
to root...@googlegroups.com
Wow!  Great job!

Is the project ready for people to try out, so we can come up with needs beyond the ones we've already mentioned?

Ben

Justin York

unread,
Apr 19, 2016, 10:38:41 AM4/19/16
to root...@googlegroups.com
Not yet. I've identified some bugs introduced in the Mongoid upgrade. I'm in the process of filing them in Github.

--

Tom Morris

unread,
Apr 19, 2016, 10:46:00 AM4/19/16
to root...@googlegroups.com
On Tue, Apr 19, 2016 at 10:37 AM, Ben Brumfield <benw...@gmail.com> wrote:

Is the project ready for people to try out, so we can come up with needs beyond the ones we've already mentioned?

DEFINITELY NOT.

I probably should have made this a message by itself instead of burying it in with a bunch of the other stuff, but the current workflow is just a crude placeholder to get something up and running. It's nowhere near acceptable even expose people to as a proof of concept because it will have the effect of "anchoring" them so that they propose tweaks on what's there rather than thinking about how it should really be done.

I'd rather see people people start with the microfilms and a whiteboard or sheet of paper and think through what the workflow should be.

Tom 

todd.d....@gmail.com

unread,
Apr 19, 2016, 11:53:39 AM4/19/16
to Rootsdev
Looking good everyone! Should I hold off on QA for things like UI bugs?

Cheers!

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Tod Robbins
Digital Asset Manager, MLIS

Tom Morris

unread,
Apr 19, 2016, 12:05:54 PM4/19/16
to root...@googlegroups.com
On Tue, Apr 19, 2016 at 11:52 AM, todd.d....@gmail.com <todd.d....@gmail.com> wrote:
Looking good everyone! Should I hold off on QA for things like UI bugs?

Hi Todd. Thanks for your interest! Since we haven't even begun to design the UI yet, yup, it'd be too early to attempt to QA something which doesn't exist.

Of course, if you'd like to offer feedback on what you think the UI should look like, how the workflow should progress, etc, that would be valuable.

I know it's a little difficult to design/comment in the abstract without knowing what's easy and what's hard to do in Scribe, but you can take a look at the Emigrants Bank, Old Weather, etc projects to get an idea. Perhaps we could also put together a small smorgasbord of example workflows that people could play with to get an idea of what's possible (e.g. using the table row marker, rather than the cell marker).

Also, making even a rough start depends on getting our image processing pipeline in place which no one has signed up to tackle yet. I've got some ideas on rough building blocks, but haven't had a chance to experiment with them yet. We need square, vertical, & true images to work well with the Scribe marking tools, which also implies that we need to separate the left and right pages since they often need different amounts of rotation.

Tom

Tom

Justin York

unread,
Apr 19, 2016, 1:07:34 PM4/19/16
to root...@googlegroups.com
On Tue, Apr 19, 2016 at 11:05 AM, Tom Morris <tfmo...@gmail.com> wrote:
I know it's a little difficult to design/comment in the abstract without knowing what's easy and what's hard to do in Scribe, but you can take a look at the Emigrants Bank, Old Weather, etc projects to get an idea. Perhaps we could also put together a small smorgasbord of example workflows that people could play with to get an idea of what's possible (e.g. using the table row marker, rather than the cell marker).
 
Also, making even a rough start depends on getting our image processing pipeline in place which no one has signed up to tackle yet. I've got some ideas on rough building blocks, but haven't had a chance to experiment with them yet. We need square, vertical, & true images to work well with the Scribe marking tools, which also implies that we need to separate the left and right pages since they often need different amounts of rotation.

Could you create issues in Github to start the conversation on those tasks?

BTW, Heroku will automatically update and redeploy the app when changes are pushed to the marriages branch. So with the recent updates, you can now transcribe since pages are saved properly after being marked.

todd.d....@gmail.com

unread,
Apr 19, 2016, 1:18:05 PM4/19/16
to Rootsdev

On Tue, Apr 19, 2016 at 11:07 AM, Justin York <justi...@gmail.com> wrote:
Could you create issues in Github to start the conversation on those tasks?

I'd suggest we tag issues with "workflows" and "features" to keep some order.

Matthew LaFlash

unread,
Apr 19, 2016, 3:34:07 PM4/19/16
to root...@googlegroups.com
All -- I've taken a stab at a workflow -- just putting it out here to develop further discussion:

Page -- Is there info to transcribe (Yes/No -- pickOne) (ANZAC)
    If Yes, is the image an index of Brides or Grooms (Bride/Groom -- pickOne) (ANZAC)
    If No, next image.

Mark Records -- Name (Surname, Given Name(s) -- Composite) (EmigrantCity)
   -- Citation (Volume, Page, Number -- Composite) (EmigrantCity)
   -- Date  (Month, Day, Year -- Composite) (EmigrantCity)

Is there anything else to mark? -- (Yes/No -- pickOne) (ANZAC)

To me it seems doable to mark one record and then run through each of the three composite tasks.  I have reviewed a sampling of the images, and I have some concerns about losing the year from the context of the filming for the grooms, when it seems as though it is inconsistently used -- for some it is entered and then dittoed until it changes, for others it is not entered at all.  But then there are occasions like the Staten Island, 1911-1912, where the ledgers span more than one calendar year, so the date will still need to be captured at the record level.  If a year is not recorded at the record level, perhaps we would just pull it in from a higher level -- the page or collection?

I also noticed that the filming or digitization process has been inconsistent -- Manhattan, 1902 -- which have already been divided into bride and groom pages.  Another concern with separating the pages is that I have seen a few of the groom pages that have lost the majority of the printed surname prefix, but which can be easily inferred from the prefixes on the brides page, which contains the same pre-printings.  Just putting these out there as possible issues that I have noted.

It also seems like dividing the transcription by borough might be useful in recruiting volunteers to participate in the transcription project -- letting volunteers choose which borough and perhaps even time period.  This idea seems similar to how Old Weather had separate "voyages".

Sorry for joining the conversation late, but I'm very interested in contributing where I may.

Matt


--

---
You received this message because you are subscribed to a topic in the Google Groups "rootsdev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rootsdev/Sd1_h_f8o6Y/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rootsdev+u...@googlegroups.com.

Ben Brumfield

unread,
Apr 20, 2016, 8:46:13 AM4/20/16
to rootsdev
Yesterday evening, I brought the project to Cafe Bedouins to see if I could make any progress on image segmentation/entry detection.

Tom's recommended an approach based on Hough line detection, which I did not attempt but which sounds very promising.

After talking the problem over with Chad Bailey, we talked about using OCR to identify the entry header letters and use them to anchor the entry.  So I pulled down an image, cropped out the Groom page, deskewed, and tried running tesseract on the results.

The results were discouraging.  Tesseract did a fine job of identifying the location of the header texts, which might be useful for extrapolating the page layout.  It did a mediocre job on the entry heading letters.  Actually recognizing the text was a total failure.

All of this was done with a totally untrained tesseract.  Were I to continue this approach, I'd want to find training sets that match the fonts in the pages, or to create them using a methodology similar to the eMOP project

So I'd say that OCR isn't a dead end, but it's not a quick solution either.

I'll attach the hocr file tessersact produced as well as a couple of images showing bounding boxes for lines and words. 

Ben




groom_deskewed.hocr.lines.jpg
groom_deskewed.hocr.words.jpg
groom_ocr_hocr.html

Ben Brumfield

unread,
Apr 22, 2016, 5:09:18 PM4/22/16
to rootsdev
Another set of recommendations comes from Ryan Baumann on the digital humanities slack channel, who has (among other things) written a tool to do line-of-text cropping based on Tesseract outputs:

11:41 AM Does anybody have a favorite set of methods for identifying lines of hand-written material for separation?  Even keywords to look for would be handy, since "line detection" apparently refers to something else in Computer Vision.
2:07 PM My favorite method is to use Tesseract's line segmenter (https://github.com/ryanfb/tesslinesplit)…probably not the greatest for handwriting. I think you might get better search results looking for "line segmentation" "line extraction" or "text segmentation". I think I recall seeing some interesting/good results for segmenting handwritten documents with a seam carving approach?
GitHub
tesslinesplit - Standalone Tesseract line segmentation
10:02 AM That's really nice, Ryan.  I gave Tesseract a pass on Tuesday night in the hopes of just being able to use bounding boxes to detect y-coordinates on the images, but the results I got were modest:  https://groups.google.com/d/msg/rootsdev/Sd1_h_f8o6Y/0PF_WzA6AQAJ
1:18 PM @benwbrum: Looking at that, if the form is the same across the corpus, the approach I would try would be to create one clean image of a blank form page so I could use pyramidal template matching to find the rotation/scale. That way I could just manually figure out the x,y+w/h offsets (+ metadata) for the cells in my master blank form once, then transform all the images into that coordinate system to extract the cells for any page.
4:01 PM Thanks, @ryanfb -- they are indeed homogeneous across the 40-60K images we have, so time invested up front can really pay off.  I'm going to have to do a lot more reading on CV to follow you, which will be Sunday evening at best.  Do you mind if I cut-and-paste this conversation into the threat at the rootsdev group so that someone else can run with your suggestions if they get to it before I do?
4:02 PM Sure, no problem!

Justin York

unread,
Apr 22, 2016, 5:39:27 PM4/22/16
to root...@googlegroups.com
Ryan's ideas sound promising. Thanks for asking.

What are your thoughts on splitting the images? It seems like that decision ought to be made first since it affects the entire process.

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Ben Brumfield

unread,
Apr 22, 2016, 6:06:33 PM4/22/16
to root...@googlegroups.com
I suspect that actually splitting the images will be unnecessary, since each "subject" takes an x,y,w,h rectangle, and can refer to the same original image URL. 

If we want to de-couple the line identification work from the workflow work, we could just set up a quick "draw" workflow for our use to create rectangles around entries as secondary subjects, then export them and reload them as primaries.

Ben

Tom Morris

unread,
Apr 22, 2016, 11:48:13 PM4/22/16
to root...@googlegroups.com
I did some experimentation with this a couple of days ago, but haven't had a chance to write up my notes yet (and don't have them with me), but will try to write them up over the weekend.

I wouldn't expect Tesseract to work well in a generic scenario. The large black borders could screw up the adaptive thresholding and the weird layout is going to confuse the page segmentation logic. It could probably be used to recognize pieces of a pre-segmented image, but I don't think it would help with the segmentation.

I'll write up more detail, but from my experimentation, it looks like the Hough Transform shows promise. Preceding it with a Canny edge detector, as is done with natural images, didn't seem as productive as just using thresholded gradients with a noise filter and a Sobel filter earlier in the pipe.

Some of the investigations done as part of this experimentation turned up some anomalies in the image stream which make me think that some level of image processing has already been done. I generated a list of outliers that have things like three pages in a single image, very short header targets, letter start targets cropped down to just the target, etc. I'll post the list when I write up my notes.

Although the API doesn't require pages to be separated or cropped due to it's ROI capabilities, I still think doing this would be desirable from a performance point of view.

The basic sketch of the pipeline that I've got in mind is:
- separate & deskew pages (perhaps using info from Hough line detection, perhaps something else)
- use Sobel filter (or something else) to enhance X & Y gradients, the Hough line to find vertical & horizontal lines
- filter/prune detected lines with goal of identifying strong double horizontal top lines & strong vertical midline
- use that info to segment page, hopefully being able to identify line boundaries 

Desired subject targets are a full horizontal entry with 50% overlap top & bottom to account for miss-segmentation, sloppy handwriting, etc. Page crop goal is full extent of physical page to allow marginal annotations, etc to be picked up. Page scale goal is ~2X target display size to allow transcribers to zoom in, if necessary.

I'll try to write up some more concrete stuff when I have my notes available.

Tom

Justin York

unread,
Apr 23, 2016, 8:40:57 AM4/23/16
to root...@googlegroups.com

Tom I am particularly interested to hear why you think splitting the pages will boost performance. It appears that your the only one recommending that we split pages. I have no opinion on it test so I would like more detail on the benefits.

Ben Brumfield

unread,
Apr 23, 2016, 11:05:42 AM4/23/16
to root...@googlegroups.com
I think that Tom's main concern is load on the Internet Archive servers to handle scaling, rotating, and cropping. 

Is that right?  If so, it should be testable.

Ben

Tom Morris

unread,
Apr 23, 2016, 11:25:58 AM4/23/16
to root...@googlegroups.com
On Sat, Apr 23, 2016 at 8:40 AM, Justin York <justi...@gmail.com> wrote:

Tom I am particularly interested to hear why you think splitting the pages will boost performance. It appears that your the only one recommending that we split pages. I have no opinion on it test so I would like more detail on the benefits.

There are two immediate reasons:

1. The pages aren't bound and thus aren't aligned to each other, so need to be deskewed separately. Some of the algorithms work better with pages deskewed, but also the default Scribe drawing tools can only draw horizontally aligned boxes.

2. Bigger images take longer to download. This is a simple matter of physics. Even downsampled & compressed, a single page image is going to be approximately half the size of a two page image.

On Sat, Apr 23, 2016 at 11:05 AM, Ben Brumfield <benw...@gmail.com> wrote:
I think that Tom's main concern is load on the Internet Archive servers to handle scaling, rotating, and cropping.  

I have a separate performance concern about using the experimental IIIF server to crop, rotate, & downsample the image on-the-fly while the user waits.

Perhaps we have different expectations about performance, but I think, since we're asking people to volunteer their time, we should value that time and make their work efficient. I don't want to see the busy spinner at all, let along stare at it for seconds on end. Perhaps we should establish some performance goals for us to meet.

One more thing - if we are able to use the IIIF servers, separating the pages, deskewing the desired page, and cropping it can all happen virtually using parameters in the URL. This means that we don't have to store any intermediary images, but they're still separate images from Scribe's point of view. As Ben pointed out earlier, a single image can be reused multiple subjects.

Tom

Tom Morris

unread,
Apr 23, 2016, 11:41:16 AM4/23/16
to root...@googlegroups.com
Matt - Thanks for taking a stab at the workflow. I've used your content to begin a wiki page here: https://github.com/rootsdev/scribeAPI/wiki/Workflow-design

Everyone - Please review, edit, add to the page so that we can use it to represent consensus as it develops. This thread has gotten a little unwieldy with the number of different topics being discussed.

Tom

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Tom Morris

unread,
Apr 23, 2016, 12:36:44 PM4/23/16
to root...@googlegroups.com
I'll send more when I get back from the MIT Open House later this afternoon, but here are a couple of the artifacts that I generated the other day. The first is an image overdrawn with lines generated from the Hough line detector. This particular example actually has pages which are pretty parallel, so it doesn't show a lot of non-parallel lines, but the way the algorithm works, all lines run edge to edge, although there's probablistic version that generates line segments. This version has relatively few lines because it looks too cluttered with lots drawn, but one strategy would be to tune the parameters to generate lots and lots of lines and then filter then by angle/position, etc. The other approach would be to tune the preprocessing so only the "important" lines (whatever we decide those to be) are detected. If we can reliably detect the a couple of vertical and horizontal lines (and now which they are), we'd have enough information to segment the pages accurately.

The second attachment is a list of outliers which are more than 6 standard deviations from the mean with respect to size. This means they are not in the 99.9999998026825% covered by six stddevs (in other words, they're extreme outliers). There's all kinds of interesting stuff in there.

Inline image 1
outliers-6sd.tsv

Brooke Ganz

unread,
Apr 23, 2016, 7:16:10 PM4/23/16
to rootsdev
Some very overdue replies to Ben:

So researchers will be searching on names, dates and borough to find volume number, page number, document number, and document date.

This should probably be "searching on names and/or dates and/or borough".  A number of researchers are finding that Grandma registered for her license in borough X (often Manhattan) even though the wedding was held in borough Y and/or the bride or groom lived in borough Z.

It also seems possible to me that researchers might find the name of a bride, then use the vol/page/doc numbers to find the name of a groom, or vice-versa.  I do not know enough about the sources to say whether this cross-correlation would work, however.

It is supposed to work like that, in theory, but in practice some researchers have reported seeing their Grandma's name in the index but not their Grandfather's (or vice versa) so they're using their Grandma's information on its own to try to order the records, and they're hoping that Grandpa's transcription was erroneously left out.  But there's no way to know for sure until they get their records order filled.

In practice, I would assume that there will be some cases where one spouse has a document number, but their spouse was not recorded or was recorded with a different document number by accident.  Not a lot of cases, I hope.

Identical to Brooklyn 1919, however several entries have "NO RETURN" stamped on them after the surname.  See XYZ Nov-Dec Z p2 for an example. 
Does anyone know what this means?  Brooke? 
It certainly seems like a datum worth transcribing.

Yes, definitely!  We *think* it means that the couple applied for their license but either (1) never actually went through with the wedding, or (2) the clergyman who married them never mailed back the completed license to the city as required to create a corresponding Health Department marriage certificate.  Anecdotally, a few Catholic Churches in NYC were really bad about that sort of thing and kept the records in their basements.  Transcribing that information -- a stamp on a city document officially stating that there wasn't any Health Department marriage certificate for that couple -- would be very nice.

In a print interview several years ago with a major New York genealogy society, the head of the NYC Municipal Archives said that these new City Clerk's Office files have 10% more records than the Health Department marriage certificates.

Also anecdotally, quiet a lot of people have commented on the Reclaim The Records Facebook page saying that because of this new index, they were finally able to find relatives' records that they were never able to find before, or which were outright missing from the Health Department marriage certificates' "Bride Index" and "Grooms Index".  (Note that those two "official" indices were created years after the fact of the marriage, handwritten on 3"x5" index cards during the WPA days, while these City Clerk's Office indices were created roughly at the time of the marriage licenses' applications.)

As you might have guessed, I am very happy to hear about all the people finding relatives in this newly-available index.  :-)


- Brooke

Justin York

unread,
May 13, 2016, 11:41:01 PM5/13/16
to root...@googlegroups.com
You've convinced me that deskewing is necessary.

I see the advantages to separating pages, though I'd still like to explore other possibilities. FamilySearch and Ancestry use DeepZoom to mitigate the performance problems of loading larges images. OpenSeadragon (which I think is the image view that FamilySearch uses) supports IIIF. Would that be easier than trying to separate the pages? 

How did you generate that image with the detected lines? You say it was the Hough line detector. What lib and params did you use? I'd be willing to play around with it but I don't have a clue about how to start. 

How close are we to having CV model that can accurately separate the pages?

When you talk about page segmentation, are you hoping to detect boxes on the page that contain data? Do you think it would be okay to launch the project without doing page segmentation? I'm okay skipping this step since Scribe can be configured to have the mark step for the users to do. I'm worried about the release of the project being delayed by tuning the CV models.

Tom Morris

unread,
May 14, 2016, 12:25:21 AM5/14/16
to root...@googlegroups.com
I apologize for not following up as I promised. The bad news is that I haven't had any time to put into programming up a custom image processing pipeline. The good news is that I think scantailor could get us 90% of the way there at the expense of a little extra human effort. I did a volume in a couple of hours, including the time to build scantailor and figure out how it works. I suspect that someone without much experience could do a volume in an hour if we wrote up some decent instructions and provided binaries ready to run.

I have a bunch of screen caps from May Day a couple of weeks ago, but I'm not sure whether they'll make it through the mailing list, so I'll post them as attachments in the next message.

Tom

Tom Morris

unread,
May 14, 2016, 12:27:14 AM5/14/16
to root...@googlegroups.com
Default page separation by scantailor without any manual intervention

Inline image 1

Tom Morris

unread,
May 14, 2016, 12:29:53 AM5/14/16
to root...@googlegroups.com
Deskewing

Inline image 1Inline image 2

Tom Morris

unread,
May 14, 2016, 12:30:27 AM5/14/16
to root...@googlegroups.com
Content selection


Inline image 2Inline image 1

Tom Morris

unread,
May 14, 2016, 12:42:39 AM5/14/16
to root...@googlegroups.com
I haven't played with it extensively, but scantailor appears to feature:

- good automatic defaults
- reasonably intuitive manual overrides (e.g. you get to see how cropping will affect all the pages, not just the page you're working on)
- a natural left/right workflow (which requires overriding when, for example, there's an image with just the beginning of alphabet letter target)
- the ability to do the interactive marking work up front and then submitting the multi-hundred page processing as a batch job

Probably the biggest drawback that I noticed is that's *strongly* oriented towards scanned book images with left/right pages, so for the microfilm images, all the alphabet letter targets, "retake" targets (and their preceding retaken page), need to be manually deleted from the images stream (meaning we also loose the information contained in those pages).

It's not ideal, but it's workable. I could put up the image set that I created as soon as we had a place to host it and people could start knocking out the remaining volumes in 1-3 hours each.

We could certainly extract the metadata for page split region, deskew angle, crop ROI, etc and feed that into an IIIF pipeline, but hopefully we can all agree that a precomputed image would be quicker to serve than one that has to be processed on the fly. The advantage to using IA IIIF is that you get hosting for free (at the cost of wasting your volunteer's time).

The things a custom image processing pipeline could add would include column and row identification, automatic identification and removal of leader/trailer/retake/alphabet targets, and tighter control over expected parameters (to catch outliers like three page images, one page images, etc).

If I can find a way to extract ROI metadata from the scantailor pipeline, perhaps I'll see if I can get the current volume uploaded to the staging instance for people to look at -- or I could just dump a tar/zip someplace for people to download.

Tom

Justin York

unread,
May 20, 2016, 10:04:42 AM5/20/16
to root...@googlegroups.com
Thanks Tom for looking at scantailor. It looks like a nice solution.

We might need the ROI metadata no matter how the images are hosted. I think there's value in being able to correlate the data we publish with the original images hosted in Internet Archive. If the data we publish can only be understood in the context of the images we modify and host for indexing then we either host the images forever or our data loses value sometime in the future. I don't like either of those options.

Ben Brumfield

unread,
Jun 8, 2016, 8:07:18 AM6/8/16
to rootsdev
I've made a little more progress on the code front, but haven't managed to keep up with this thread.  Tom's progress with Scintillator looks very promising, and might tie in well with one of the two things I've been working on:

1) Calculating ROIs based on page layout.  We all know that the drawing task will be a deal-breaker, so we have to calculate ROIs in a pre-processing phase.
Since these documents are so consistent, we should be able to calculate fields geometrically if we're given the corners of a single page.  While I have no insights on identifying corners (though it looks like Tom does), I was able to create a script to identify rectangles and draw them on a local image: https://github.com/rootsdev/scribeAPI/blob/marriages/lib/tasks/image_preprocess.rake

An example result is here:




The next step on this is to remove the image-drawing code (really only useful for debugging purposes) and instead use the rectangles to create secondary subjects withing the Scribe database.  (That still leaves the questions of actually getting the page layout correct, dealing with skew at presentation time, and not presenting blank images to users, all of which are major challenges that Tom may be better able to address.)



2) Exposing transcripts via IIIF/OpenAnnotation.  While I'm interested in publishing the indexed records via stand-alone, web-based search engines like MyopicVicar or through researcher-friendly bulk downloads as CSVs, the IIIF connection gives us the ability to anchor genealogy records within the Linked Open Data world.  I've added a new controller that exposes transcribed volumes as IIIF manifests, exposes pages as SharedCanvas canvases, and exposes transcribed ROIs as OpenAnnotation annotationLists on the canvases. 

Example AnnotationList:
{
  "@context": "http://iiif.io/api/presentation/2/context.json",
  "@id": "http://localhost:3000/iiif/list/5756379da020dd53e83fe0e3",
  "@type": "sc:AnnotationList",
  "resources": [
    {
      "@id": "http://localhost:3000/iiif/list/5756379da020dd53e83fe0e3/annotation/5756ad2ba020dd5a893a80fe/em_number",
      "@type": "oa:Annotation",
      "motivation": "sc:painting",
      "on": "https://iiif.archivelab.org/iiif/NYC_Marriage_Index_Manhattan_1908$2077/#xywh=1307,361,100,42",
      "resource": {
        "@id": "em_number_5756ad2ba020dd5a893a80fe",
        "@type": "cnt:ContentAsText",
        "format": "text/plain",
        "chars": "7259"
      }
    },
    {
      "@id": "http://localhost:3000/iiif/list/5756379da020dd53e83fe0e3/annotation/5756ad38a020dd5a893a8100/em_number",
      "@type": "oa:Annotation",
      "motivation": "sc:painting",
      "on": "https://iiif.archivelab.org/iiif/NYC_Marriage_Index_Manhattan_1908$2077/#xywh=1307,399,103,39",
      "resource": {
        "@id": "em_number_5756ad38a020dd5a893a8100",
        "@type": "cnt:ContentAsText",
        "format": "text/plain",
        "chars": "7523"
      }
    },

At this point, if you re-generate a group of subjects from a volume, load it into a clean Scribe database, and do some transcribing you can use an IIIF client like Mirador to view the results:



To try this out, go to http://projectmirador.org/demo and close one of the viewer panes, then select "New object" from the menu and add a URL corresponding to http://localhost:3000/iiif/manifest/nyc_marriage_index_manhattan_1908 to the input field.  Clicking on the item that loads will re-open the viewer pane on the marriage application index volume.  Clicking the two word balloons will display ROIs and text from the transcripts.


I'll be demoing this to the IIIF group today on their community call at 11am central (notes doc), and am hoping for some advice and maybe some help from that group.  I'm definitely a newbie to linked open data, but if other people have ideas for ways genealogy tools can use records presented as LOD, I'm all ears.


Ben




text
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+unsubscribe@googlegroups.com.

todd.d....@gmail.com

unread,
Jun 24, 2016, 1:47:26 PM6/24/16
to Rootsdev
Awesome progress everyone! I think Mirador is the stronger IIIF image viewer and has good support for Open Annotations as well.

–Tod

To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,
Jun 24, 2016, 3:49:28 PM6/24/16
to root...@googlegroups.com
On Fri, May 20, 2016 at 10:04 AM, Justin York <justi...@gmail.com> wrote:
Thanks Tom for looking at scantailor. It looks like a nice solution.

We might need the ROI metadata no matter how the images are hosted. I think there's value in being able to correlate the data we publish with the original images hosted in Internet Archive. If the data we publish can only be understood in the context of the images we modify and host for indexing then we either host the images forever or our data loses value sometime in the future. I don't like either of those options.

Good point about provenance and reproduceability. The ScanTailor files include all the information about the various operations, ROIs, etc. Some snippets below:

  <files>
    <file dirId="1" id="2" name="Reclaim_The_Records_-_NYC_Marriage_Index_-_Microfilm_Roll_39_-_Bronx_-_1914_-_00007.jpg"/>

  <images>
    <image subPages="2" fileImage="0" fileId="2" id="3">
      <size width="6464" height="4530"/>
      <dpi vertical="400" horizontal="400"/>
    </image>

  <pages>
    <page imageId="3" subPage="left" selected="selected" id="4"/>
    <page imageId="3" subPage="right" id="5"/>

  <filters>
    <fix-orientation/>
    <page-split defaultLayoutType="auto-detect">
      <image id="3">
        <params mode="auto">
          <pages type="two-pages">
            <outline>
              <point x="0" y="0"/>
              <point x="6464" y="0"/>
              <point x="6464" y="4530"/>
              <point x="0" y="4530"/>
              <point x="0" y="0"/>
            </outline>
            <cutter1>
              <p1 x="3220.658624141709" y="0"/>
              <p2 x="3398.721389100468" y="4532"/>
            </cutter1>
          </pages>
          <dependencies>
            <rotation degrees="0"/>
            <size width="6464" height="4530"/>
            <layoutType>auto-detect</layoutType>
          </dependencies>
        </params>
      </image>

Obviously if we developed our own pipeline we'd want to keep all the equivalent information about various transformations so they could be inverted to map back to the original.
Reply all
Reply to author
Forward
0 new messages