Issue 49 in gref-mvz: author search yields incomplete results

1 view
Skip to first unread message

codesite...@google.com

unread,
May 8, 2008, 5:55:50 PM5/8/08
to gr...@googlegroups.com
Issue 49: author search yields incomplete results
http://code.google.com/p/gref-mvz/issues/detail?id=49

New issue report by eightysteele:
if you search GReF for author 'N. B. Stern', you get three sections (pub
ids 2379-2381). but they are only 2 pages each. if you query the bscit
site, you get 4 sections with many more pages:

http://bscit.berkeley.edu/cgi-bin/mvz_volume_query?special=pagescan_directory=v1660_s1&section_order=1&page=1&orig_query=674689
http://bscit.berkeley.edu/cgi-bin/mvz_volume_query?special=pagescan_directory=v1660_s2&section_order=2&page=1&orig_query=674689
http://bscit.berkeley.edu/cgi-bin/mvz_volume_query?special=pagescan_directory=v1660_s3&section_order=3&page=1&orig_query=674689
http://bscit.berkeley.edu/cgi-bin/mvz_volume_query?special=pagescan_directory=v1660_s4&section_order=4&page=1&orig_query=674689

gref somehow isn't getting the sections and pages correctly.


Issue attributes:
Status: Accepted
Owner: eightysteele
CC: carlacic
Labels: Type-Defect Priority-Medium Usability Component-Persistence Component-Logic

--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings

codesite...@google.com

unread,
May 8, 2008, 5:59:51 PM5/8/08
to gr...@googlegroups.com
Issue 49: author search yields incomplete results
http://code.google.com/p/gref-mvz/issues/detail?id=49

Comment #1 by eightysteele:
(No comment was entered for this change.)


Issue attribute updates:
Labels: -Priority-Medium Priority-Critical

codesite...@google.com

unread,
May 8, 2008, 8:06:19 PM5/8/08
to gr...@googlegroups.com
Issue 49: author search yields incomplete results
http://code.google.com/p/gref-mvz/issues/detail?id=49

Comment #2 by eightysteele:
important note:
bscit uses a url in the following form to reference a scanned field
notebook page:
http://bscit.berkeley.edu/mvz/notebookjpegs/v500_s6/v500_s6_p000.jpg

here is the mapping from bscit url parameters to the mvz data model:
1) `v` (as in v500 above) refers to book_section.book_id in the mvz
data model.
2) `s` (as in s6 above) refers to book_section.book_section_order in
the mvz data model.

CASE 1: scanned field notebook pages exist on bscit but not in the mvz database.

use gref to search for N.B. Stern. the results show that 2 pages are
available for:
1) publication.publication_id = 2379, book_section.book_section_order = 1
2) publication.publication_id = 2380, book_section.book_section_order = 2
3) publication.publication_id = 2381, book_section.book_section_order = 3

all pages above are from book_section.book_id = 1660. the underlying
query to the mvz
data model is:

SELECT page_id, book_id, book_section_order, COUNT(PAGE_ID) AS PAGE_COUNT,
PUBLICATION_ID, AGENT_NAME, PUBLISHED_YEAR, PUBLICATION_TITLE, PUBLICATION_REMARKS
FROM PUBLICATION pub JOIN BOOK_SECTION bs USING(publication_id) JOIN
FIELD_NOTEBOOK_SECTION fns USING(PUBLICATION_ID) JOIN
PUBLICATION_AUTHOR_NAME pan
USING(PUBLICATION_ID) JOIN AGENT_NAME pa USING(AGENT_NAME_ID) LEFT
OUTER JOIN PAGE p
USING(PUBLICATION_ID) WHERE 1=1 AND pa.AGENT_NAME_TYPE = 'preferred' AND
upper(pa.AGENT_NAME) like '%N. B. STERN%' GROUP BY PUBLICATION_ID, AGENT_NAME,
PUBLISHED_YEAR, PUBLICATION_TITLE, PUBLICATION_REMARKS, page_id, book_id,
book_section_order ORDER BY PUBLISHED_YEAR, PUBLICATION_ID

the same search using bscit shows results for v1660
(book_section.book_id 1660 in the
mvz database) for sections 1-4, all with field notebook pages scanned.
the problem
is, for example, that v1660 section 1 in bscit shows 29 pages scanned,
and gref shows
only 2 pages available. so the number of scanned pages for each section
do not match
what's in the mvz database, and since gref uses the mvz database, it
only shows
what's available there. this is because these metadata simply do not
exist yet in the
mvz database.

CASE 2: field notebook metadata exist in the mvz database but their
pages have not
yet been scanned.

using gref, search for Annie M. Alexander in 1939. the results show
that 102 pages
are available using the following query:

SELECT COUNT(PAGE_ID) AS PAGE_COUNT, PUBLICATION_ID, AGENT_NAME, PUBLISHED_YEAR,
PUBLICATION_TITLE, PUBLICATION_REMARKS FROM PUBLICATION pub JOIN
BOOK_SECTION bs
USING(publication_id) JOIN FIELD_NOTEBOOK_SECTION fns
USING(PUBLICATION_ID) JOIN
PUBLICATION_AUTHOR_NAME pan USING(PUBLICATION_ID) JOIN AGENT_NAME pa
USING(AGENT_NAME_ID) LEFT OUTER JOIN PAGE p USING(PUBLICATION_ID) WHERE
1=1 AND
pa.AGENT_NAME_TYPE = 'preferred' AND upper(pa.AGENT_NAME) like '%ANNIE
M. ALEXANDER%'
AND PUBLISHED_YEAR >= 1939 AND PUBLISHED_YEAR <= 1939 GROUP BY PUBLICATION_ID,
AGENT_NAME, PUBLISHED_YEAR, PUBLICATION_TITLE, PUBLICATION_REMARKS
ORDER BY
PUBLISHED_YEAR, PUBLICATION_ID

modifying the above query to project the `book_section.book_id` and
`book_section.book_section_order` columns reveal that all 102 results
are from
book_section.book_id = 1768 for book_section.book_section_order = 1.

the same search using bscit shows results for v1768
(book_section.book_id 1768 in the
mvz database) in section 1 but doesn't have any field notebook pages
scanned. so in
this case, the mvz specimen database contains 102 records for
book_section.book_id =
1768 where book_section.book_section_order = 1, but the corresponding
field notebook
pages have either not been scanned or have not been processed by bscit.
this is why
gref shows 102 pages available, because in the mvz database, they are available.
however, not all of those pages are available yet on bscit.

Reply all
Reply to author
Forward
0 new messages