Better format determination (part 2 of 2)

13 views
Skip to first unread message

Demian Katz

unread,
Aug 29, 2011, 8:44:04 AM8/29/11
to solrma...@googlegroups.com
...and here's the rest of the discussion.

- Demian
________________________________________
From: Walker, David [dwa...@calstate.edu]
Sent: Tuesday, August 16, 2011 12:41 PM
To: VuFind Tech
Subject: [VuFind-Tech] Proposed changes to VuFindIndexer.getFormat, Part 2

Okay, here is my proposed changes.

https://gist.github.com/1149060

It's quite lengthy, but here is why I think it addresses the flaws I mentioned in the previous email.

1. It is much more complete.

As you can see in the enumerations at the top of the file, there are many more format designations than in the current getFormat method.

This goes essentially two levels deep for material types and content types. MARC, amazingly, has even more detailed formats, at least for some material types. For example, you can divide books into 'encyclopedias', 'dictionaries', 'yearbooks', and so on. And music is even more detailed. But I had to stop somewhere, so I didn't parse these out. It also picks up secondary content types (from the 006).

2. It distinguishes between content type and media/carrier type.

There is a getMediaTypes method that essentially parses the 007, plus a few other fields. The getContentTypes method does the same for the leader/008/006 and a few data fields. Each returns all available values.

If I've done my job well, then these two 'lower-level' functions should not need to be customized by libraries. (If there is something missing or wrong here, it should be corrected in the distro.)

Instead, my thought here is that you could have 'higher-level' functions, or BeanShell scripts, that utilize, combine, or otherwise customize these values for the actual indexing.

The getPrimaryContentTypePlusOnline method is an example of this. It takes just the first content type from the getContentTypes set, and then also checks if the item is online. It then combines the content type 'Book' and the media type 'Online' into a single combined type called 'EBook'.

But this is just one of many different examples of how you might do this. I think this allows for a great deal of flexibility without people having to localize or re-write a ton of code.

3. It makes a best guess attempt at determining if this is an online resource.

The SolrMarc indexer already includes a getFullTextUrls method that checks to see if the record has a full-text link. It is a 'best guess' since many MARC records infamously contain links to table of contents and other information *about* the item without always consistently marking them as such.

But, all things considered, I think this is much preferable to the current indexer which makes no such effort, at least for format.

So there it is, at least as far as the basic issues are concerned.

There are, IMO, some other (minor) improvements over the current getFormat method.

One of the other complaints I have with the current getFormat function is that it could be much better commented. I essentially took the MARC standards documentation at loc.gov and cut-and-paste the relevant portions into my file, and then wrote the code around that. So hopefully it's well commented in a way that corresponds with the documentation online.

Also, following the BlackLight indexer, I used Enum's for the format values rather than just strings. That way, I didn't accidentally typo one of the formats.

Comments, criticisms, questions all welcome.

Except from Demian. Go get some sleep first. ;-)

--Dave

==================
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

------------------------------------------------------------------------------
uberSVN's rich system and user administration capabilities and model
configuration take the hassle out of deploying and managing Subversion and
the tools developers use with it. Learn more about uberSVN and get a free
download at: http://p.sf.net/sfu/wandisco-dev2dev
_______________________________________________
Vufind-tech mailing list
Vufin...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/vufind-tech

Alan Rykhus

unread,
Aug 29, 2011, 9:45:52 AM8/29/11
to solrma...@googlegroups.com
Hello,

In the past I've submitted a getFormat function that I use in our MnPALS
Plus site. I would like to put forth an argument for that.

I use a multi-value format field. Each bib record can have multiple
formats that are in a hierarchical list from specific to generic. I
think this is a better solution.

If a patron is searching the database and they are looking for
information on whatever subject, they can limit to an ebook, but they
can also limit to just a book and have the ebooks included. Even though
they might have the latest ebook reader, they can probably still pick up
a book and read it. So they would only want to limit to book. By having
just the specific facets once you select ebook you just eliminated all
of the books in the stacks of the libraries.

The patron can still limit to ebook and get just the electronic
versions.

al

--
Alan Rykhus
PALS, A Program of the Minnesota State Colleges and Universities
(507)389-1975
alan....@mnsu.edu
"It's hard to lead a cavalry charge if you think you look funny on a
horse" ~ Adlai Stevenson

Demian Katz

unread,
Aug 29, 2011, 9:49:36 AM8/29/11
to solrma...@googlegroups.com
I believe that David's solution is compatible with yours -- his code allows multiple formats per item; it just does a more thorough job of interpreting all of the MARC fields than the current default VuFind getFormat method. In fact, in many cases, his code is probably TOO specific -- hence the need to use translation map files to group together some of his outputs. I believe it is possible that, if we get David's code into the SolrMarc core, you can reimplement your custom indexer simply by setting up an appropriate translation map file.

Of course, I haven't looked too closely at any code yet, so perhaps your indexer has some feature that David's lacks -- if this is the case, I'm sure he would be interested to hear about it.

- Demian
________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] On Behalf Of Alan Rykhus [alan....@mnsu.edu]
Sent: Monday, August 29, 2011 9:45 AM
To: solrma...@googlegroups.com
Subject: Re: [solrmarc-tech] Better format determination (part 2 of 2)

Hello,

al

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

Walker, David

unread,
Aug 29, 2011, 12:07:04 PM8/29/11
to solrma...@googlegroups.com
Hi Alan,

One of the goals I had with this proposal was precisely to make the determination of format as flexible as possible.

Rather than having a single, monolithic getFormat method, my proposal includes several methods at two different levels. The bulk of the code consists of two 'lower-level' functions that parse format in terms of content type and media/carrier type (see the previous emails for a discussion of this). Higher level functions, or BeanShell scripts, can then call these two lower-level functions and combine, contract, or otherwise customize the results as they see fit.

In that way, each institution can do this slightly differently, while keeping the bulk of the code untouched.

> they can limit to an ebook, but they
> can also limit to just a book and have

> the ebooks included.

I complete agree with you on this. The getPrimaryContentTypePlusOnline method (line 205) is an example of one of these 'higher-level' functions. It specifically addresses this scenario.

So you can have multiple format values, or just a primary one, or a combination of the two, or even different facets for content type and one for media/carrier type. The options here are myriad.

--Dave

==================
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] On Behalf Of Alan Rykhus [alan....@mnsu.edu]

Sent: Monday, August 29, 2011 6:45 AM


To: solrma...@googlegroups.com
Subject: Re: [solrmarc-tech] Better format determination (part 2 of 2)

Hello,

al

--

Owen Stephens

unread,
Mar 2, 2012, 4:34:19 AM3/2/12
to solrma...@googlegroups.com
Where did this proposal get to? Was it ever integrated into SolrMARC?

Thanks,

Owen


On Monday, August 29, 2011 5:07:04 PM UTC+1, David Walker wrote:
Hi Alan,

One of the goals I had with this proposal was precisely to make the determination of format as flexible as possible.

Rather than having a single, monolithic getFormat method, my proposal includes several methods at two different levels.  The bulk of the code consists of two 'lower-level' functions that parse format in terms of content type and media/carrier type (see the previous emails for a discussion of this).  Higher level functions, or BeanShell scripts, can then call these two lower-level functions and combine, contract, or otherwise customize the results as they see fit.

In that way, each institution can do this slightly differently, while keeping the bulk of the code untouched.

> they can limit to an ebook, but they
> can also limit to just a book and have

> the ebooks included.

I complete agree with you on this.  The getPrimaryContentTypePlusOnline method (line 205) is an example of one of these 'higher-level' functions.  It specifically addresses this scenario.

So you can have multiple format values, or just a primary one, or a combination of the two, or even different facets for content type and one for media/carrier type.  The options here are myriad.

--Dave

==================
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu
________________________________________

To unsubscribe from this group, send email to solrmarc-tech+unsubscribe@googlegroups.com.

Demian Katz

unread,
Mar 2, 2012, 7:21:22 AM3/2/12
to solrma...@googlegroups.com
This was introduced into SolrMarc in version 2.3. More details are available in the release notes here:

https://groups.google.com/group/solrmarc-tech/msg/d6875e5b11fe8d61?hl=en

(As a side note, I had a lot of trouble figuring out how to link to this specific message -- maybe it would be better if we copied release notes into the SolrMarc wiki for easier future reference).

I don't think this has been widely tested yet, so there may be some issues depending on the nature of your MARC records. For further discussion, see this thread:

https://groups.google.com/forum/#!msg/solrmarc-tech/JNINy1crNug/md1fjyQ6SzAJ

- Demian
________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] On Behalf Of Owen Stephens [ow...@ostephens.com]
Sent: Friday, March 02, 2012 4:34 AM


To: solrma...@googlegroups.com
Subject: Re: [solrmarc-tech] Better format determination (part 2 of 2)

Where did this proposal get to? Was it ever integrated into SolrMARC?

Thanks,

Owen

On Monday, August 29, 2011 5:07:04 PM UTC+1, David Walker wrote:
Hi Alan,

One of the goals I had with this proposal was precisely to make the determination of format as flexible as possible.

Rather than having a single, monolithic getFormat method, my proposal includes several methods at two different levels. The bulk of the code consists of two 'lower-level' functions that parse format in terms of content type and media/carrier type (see the previous emails for a discussion of this). Higher level functions, or BeanShell scripts, can then call these two lower-level functions and combine, contract, or otherwise customize the results as they see fit.

In that way, each institution can do this slightly differently, while keeping the bulk of the code untouched.

> they can limit to an ebook, but they
> can also limit to just a book and have

> the ebooks included.

I complete agree with you on this. The getPrimaryContentTypePlusOnline method (line 205) is an example of one of these 'higher-level' functions. It specifically addresses this scenario.

So you can have multiple format values, or just a primary one, or a combination of the two, or even different facets for content type and one for media/carrier type. The options here are myriad.

--Dave

==================
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu
________________________________________

From: solrma...@googlegroups.com<mailto:solrma...@googlegroups.com> [solrma...@googlegroups.com<mailto:solrma...@googlegroups.com>] On Behalf Of Alan Rykhus [alan....@mnsu.edu<mailto:alan....@mnsu.edu>]


Sent: Monday, August 29, 2011 6:45 AM

To: solrma...@googlegroups.com<mailto:solrma...@googlegroups.com>


Subject: Re: [solrmarc-tech] Better format determination (part 2 of 2)

Hello,

In the past I've submitted a getFormat function that I use in our MnPALS
Plus site. I would like to put forth an argument for that.

I use a multi-value format field. Each bib record can have multiple
formats that are in a hierarchical list from specific to generic. I
think this is a better solution.

If a patron is searching the database and they are looking for
information on whatever subject, they can limit to an ebook, but they
can also limit to just a book and have the ebooks included. Even though
they might have the latest ebook reader, they can probably still pick up
a book and read it. So they would only want to limit to book. By having
just the specific facets once you select ebook you just eliminated all
of the books in the stacks of the libraries.

The patron can still limit to ebook and get just the electronic
versions.

al

On Mon, 2011-08-29 at 08:44 -0400, Demian Katz wrote:
> ...and here's the rest of the discussion.
>
> - Demian
> ________________________________________

> From: Walker, David [dwa...@calstate.edu<mailto:dwa...@calstate.edu>]


> Sent: Tuesday, August 16, 2011 12:41 PM
> To: VuFind Tech
> Subject: [VuFind-Tech] Proposed changes to VuFindIndexer.getFormat, Part 2
>
> Okay, here is my proposed changes.
>
> https://gist.github.com/1149060
>
> It's quite lengthy, but here is why I think it addresses the flaws I mentioned in the previous email.
>
> 1. It is much more complete.
>
> As you can see in the enumerations at the top of the file, there are many more format designations than in the current getFormat method.
>
>
>
> This goes essentially two levels deep for material types and content types. MARC, amazingly, has even more detailed formats, at least for some material types. For example, you can divide books into 'encyclopedias', 'dictionaries', 'yearbooks', and so on. And music is even more detailed. But I had to stop somewhere, so I didn't parse these out. It also picks up secondary content types (from the 006).
>
> 2. It distinguishes between content type and media/carrier type.
>
> There is a getMediaTypes method that essentially parses the 007, plus a few other fields. The getContentTypes method does the same for the leader/008/006 and a few data fields. Each returns all available values.
>
> If I've done my job well, then these two 'lower-level' functions should not need to be customized by libraries. (If there is something missing or wrong here, it should be corrected in the distro.)
>
> Instead, my thought here is that you could have 'higher-level' functions, or BeanShell scripts, that utilize, combine, or otherwise customize these values for the actual indexing.
>
> The getPrimaryContentTypePlusOnline method is an example of this. It takes just the first content type from the getContentTypes set, and then also checks if the item is online. It then combines the content type 'Book' and the media type 'Online' into a single combined type called 'EBook'.
>
> But this is just one of many different examples of how you might do this. I think this allows for a great deal of flexibility without people having to localize or re-write a ton of code.
>
> 3. It makes a best guess attempt at determining if this is an online resource.
>
> The SolrMarc indexer already includes a getFullTextUrls method that checks to see if the record has a full-text link. It is a 'best guess' since many MARC records infamously contain links to table of contents and other information *about* the item without always consistently marking them as such.
>
> But, all things considered, I think this is much preferable to the current indexer which makes no such effort, at least for format.
>
> So there it is, at least as far as the basic issues are concerned.
>
> There are, IMO, some other (minor) improvements over the current getFormat method.
>
>
>

> One of the other complaints I have with the current getFormat function is that it could be much better commented. I essentially took the MARC standards documentation at loc.gov<http://loc.gov> and cut-and-paste the relevant portions into my file, and then wrote the code around that. So hopefully it's well commented in a way that corresponds with the documentation online.


>
> Also, following the BlackLight indexer, I used Enum's for the format values rather than just strings. That way, I didn't accidentally typo one of the formats.
>
>
>
> Comments, criticisms, questions all welcome.
>
>
>
> Except from Demian. Go get some sleep first. ;-)
>
>
>
> --Dave
>
> ==================
> David Walker
> Library Web Services Manager
> California State University
> http://xerxes.calstate.edu
>
> ------------------------------------------------------------------------------
> uberSVN's rich system and user administration capabilities and model
> configuration take the hassle out of deploying and managing Subversion and
> the tools developers use with it. Learn more about uberSVN and get a free
> download at: http://p.sf.net/sfu/wandisco-dev2dev
> _______________________________________________
> Vufind-tech mailing list

> Vufin...@lists.sourceforge.net<mailto:Vufin...@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/vufind-tech
>

--
Alan Rykhus
PALS, A Program of the Minnesota State Colleges and Universities
(507)389-1975

alan....@mnsu.edu<mailto:alan....@mnsu.edu>


"It's hard to lead a cavalry charge if you think you look funny on a
horse" ~ Adlai Stevenson

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.

To post to this group, send email to solrma...@googlegroups.com<mailto:solrma...@googlegroups.com>.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com<mailto:solrmarc-tech%2Bunsu...@googlegroups.com>.


For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

--


You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.

To view this discussion on the web visit https://groups.google.com/d/msg/solrmarc-tech/-/2HHqJH5RasgJ.


To post to this group, send email to solrma...@googlegroups.com.

To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.

Walker, David

unread,
Mar 2, 2012, 8:03:52 AM3/2/12
to solrma...@googlegroups.com
We've been using this code for almost a year now among nearly 20 campuses. So it's been around the block at Cal State -- which is a pretty big and diverse block.

We're only using the 'content type' designation (plus electronic), however. And so the 'media type' designation is not as well tested, per the thread below.

--Dave
-----------------


David Walker
Library Web Services Manager
California State University

https://groups.google.com/group/solrmarc-tech/msg/d6875e5b11fe8d61?hl=en

https://groups.google.com/forum/#!msg/solrmarc-tech/JNINy1crNug/md1fjyQ6SzAJ

Thanks,

Owen

> the ebooks included.

--Dave

Hello,

al

> -------- uberSVN's rich system and user administration capabilities

Owen Stephens

unread,
Mar 2, 2012, 10:53:33 AM3/2/12
to solrma...@googlegroups.com
Thanks both

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: ow...@ostephens.com
Telephone: 0121 288 6936

Reply all
Reply to author
Forward
0 new messages