--
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
- correlating couples by marriage license volume, (page), and number should work, as I understand the system. Date can be used as a cross-check. Of course this isn't really very solid evidence in the absence of an image of the original document with both names present, but it's a good hint.
- The microfilm start of book thing is called a "target" as far as I know.
- Good point about retaining context, but rather than segmenting a page using OpenCV, I think it'd be better to mark the zones/fields and then do the transcription in situ. Regardless, there's a decision to be made as to whether do verbatim transcription or not. I'd lean towards verbatim with interpretation done as a separate step.
- As for NO RETURN, remembering that this is an index of marriage license APPLICATIONS, I suspect that the "no returns" are those who never actually returned a completed license (ie didn't get married).
- I note that there's no page numbering in the volumes I checked. That'll make it more difficult to catch page misses (although the pre-printed letters help).
- There's volume metadata, page scan metadata, raw images in both JPEG2000 & TIFF format, etc available by clicking on the "details" link e.g. https://archive.org/download/NYC_Marriage_Index_Brooklyn_1919
- There's no OCR data (*_abby.gz) available because the language was set to "english-handwritten", but the volumes could be run through Tesseract to pick up any of the pre-printed info, if that was deemed useful.
- All items are part of the NY Marriage Index collection (somewhat misleading since it's an index to marriage license *applications*, not marriages)- Years 1911-1913 show as only have films for Brooklyn in the year facet, but there are other films (e.g. Queens 1911, Manhattan 1911-1913) available, but missing from the facet- Queens only has films for 1908-1911- Bronx runs 1914 - 1929- Manhattan & Brooklyn run 1908-1929I could write a quick Python script to download and summarize the metadata to generate page counts, etc if that would be useful for planning purposes.
Tom
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+unsubscribe@googlegroups.com.
Tom
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
--
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
--
Reading a bit more, it looks like the subject configuration files under /projects would live pretty comfortably in our fork, though we probably want them in a separate branch, and to do any generally-useful stuff in a different branch without config information so that we can actually issue pull requests.
--
order
- Integer - the sequence of the subjectsfile_path
- String - the URL to the full media filethumbnail
- String - the URL to the thumbnail image of the media filewidth
- Integer - width in pixels of media fileheight
- Integer - height in pixels of media fileDid I miss something in the docs, or will we need to do code archaeology to figure this out? Regardless, I'm just using some stub values for the things I don't understand so I can try to get something checked in tonight.
Ben
For those of us that don't know anything about Ruby or Scribe, is there any way to help with the technical side?
# ingest NYC_Marriage_Index_Brooklyn_1919 from the Internet Archive. This will
# 1) create project/marriages/subjects/group_nyc_marriage_index_brooklyn_1919.csv and
# 2) print a line to be appended to project/marriages/subjects/groups.csv
rake project:subject_from_archive[marriages,NYC_Marriage_Index_Brooklyn_1919]
rails s
I agree with Tom, and in particular with his recommendation of #2. The Scribe folks recommend OpenCV, which I've never used -- I gather it's a python library?
I'd love to see a way of detecting actual lines of text from an index, which seems like a common enough task that someone in the Computer Vision world has figured it out already.
Now that the weekend is over and I need to get back to work, I've pushed my rake task to generate Scribe subjects from the Internet Archive.
# ingest NYC_Marriage_Index_Brooklyn_1919 from the Internet Archive. This will
# 1) create project/marriages/subjects/group_nyc_marriage_index_brooklyn_1919.csv and
# 2) print a line to be appended to project/marriages/subjects/groups.csv
rake project:subject_from_archive[marriages,NYC_Marriage_Index_Brooklyn_1919]
Having done that, running rake project:reload[marriages] loads the subjects into the system.
The one thing I don't understand is why the images aren't displaying once I run
rails s
Tom, would you have time to take a look at the csv file and compare it to the one you hand-coded? I went ahead and checked in the modified groups.csv file and the new subject file for Brooklyn 1919.
--
It might be a good idea to switch the default branch of our fork to be the `marriages` fork (I don't have privileges to do that). We could also consider enabling the Github issue tracker for the repo when we get a little further along and use it for tracking issues specific to the marriages fork/project.
--
Is the project ready for people to try out, so we can come up with needs beyond the ones we've already mentioned?
--
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Looking good everyone! Should I hold off on QA for things like UI bugs?
I know it's a little difficult to design/comment in the abstract without knowing what's easy and what's hard to do in Scribe, but you can take a look at the Emigrants Bank, Old Weather, etc projects to get an idea. Perhaps we could also put together a small smorgasbord of example workflows that people could play with to get an idea of what's possible (e.g. using the table row marker, rather than the cell marker).
Also, making even a rough start depends on getting our image processing pipeline in place which no one has signed up to tackle yet. I've got some ideas on rough building blocks, but haven't had a chance to experiment with them yet. We need square, vertical, & true images to work well with the Scribe marking tools, which also implies that we need to separate the left and right pages since they often need different amounts of rotation.
Could you create issues in Github to start the conversation on those tasks?
--
---
You received this message because you are subscribed to a topic in the Google Groups "rootsdev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rootsdev/Sd1_h_f8o6Y/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rootsdev+u...@googlegroups.com.
--
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
Tom I am particularly interested to hear why you think splitting the pages will boost performance. It appears that your the only one recommending that we split pages. I have no opinion on it test so I would like more detail on the benefits.
Tom I am particularly interested to hear why you think splitting the pages will boost performance. It appears that your the only one recommending that we split pages. I have no opinion on it test so I would like more detail on the benefits.
I think that Tom's main concern is load on the Internet Archive servers to handle scaling, rotating, and cropping.
--
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
So researchers will be searching on names, dates and borough to find volume number, page number, document number, and document date.
It also seems possible to me that researchers might find the name of a bride, then use the vol/page/doc numbers to find the name of a groom, or vice-versa. I do not know enough about the sources to say whether this cross-correlation would work, however.
Identical to Brooklyn 1919, however several entries have "NO RETURN" stamped on them after the surname. See XYZ Nov-Dec Z p2 for an example.
Does anyone know what this means? Brooke?
It certainly seems like a datum worth transcribing.
{ "@context": "http://iiif.io/api/presentation/2/context.json", "@id": "http://localhost:3000/iiif/list/5756379da020dd53e83fe0e3", "@type": "sc:AnnotationList", "resources": [ { "@id": "http://localhost:3000/iiif/list/5756379da020dd53e83fe0e3/annotation/5756ad2ba020dd5a893a80fe/em_number", "@type": "oa:Annotation", "motivation": "sc:painting", "on": "https://iiif.archivelab.org/iiif/NYC_Marriage_Index_Manhattan_1908$2077/#xywh=1307,361,100,42", "resource": { "@id": "em_number_5756ad2ba020dd5a893a80fe", "@type": "cnt:ContentAsText", "format": "text/plain", "chars": "7259" } }, { "@id": "http://localhost:3000/iiif/list/5756379da020dd53e83fe0e3/annotation/5756ad38a020dd5a893a8100/em_number", "@type": "oa:Annotation", "motivation": "sc:painting", "on": "https://iiif.archivelab.org/iiif/NYC_Marriage_Index_Manhattan_1908$2077/#xywh=1307,399,103,39", "resource": { "@id": "em_number_5756ad38a020dd5a893a8100", "@type": "cnt:ContentAsText", "format": "text/plain", "chars": "7523" } },
To try this out, go to http://projectmirador.org/demo and close one of the viewer panes, then select "New object" from the menu and add a URL corresponding to http://localhost:3000/iiif/manifest/nyc_marriage_index_manhattan_1908 to the input field. Clicking on the item that loads will re-open the viewer pane on the marriage application index volume. Clicking the two word balloons will display ROIs and text from the transcripts.
I'll be demoing this to the IIIF group today on their community call at 11am central (notes doc), and am hoping for some advice and maybe some help from that group. I'm definitely a newbie to linked open data, but if other people have ideas for ways genealogy tools can use records presented as LOD, I'm all ears.
Ben
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+unsubscribe@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
--
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
Thanks Tom for looking at scantailor. It looks like a nice solution.We might need the ROI metadata no matter how the images are hosted. I think there's value in being able to correlate the data we publish with the original images hosted in Internet Archive. If the data we publish can only be understood in the context of the images we modify and host for indexing then we either host the images forever or our data loses value sometime in the future. I don't like either of those options.