Marc4j on Github

28 views
Skip to first unread message

Robert Haschart

unread,
Dec 17, 2012, 6:07:37 PM12/17/12
to solrma...@googlegroups.com
Greetings all,

Bas Peters, the original creator of Marc4j and the owner of the marc4j
project on Tigris.org, contacted me saying that he wants to remove the
project from tigris.org , largely due to tigris being outdated and he
received tons of spam via that project that isn't filtered out by
tigris.org spam filters.

This would seem to be an opportunity that several here have asked for:
To move the project to a new official location, on a more-up-to-date
service, that uses repository software that is newer than CVS that
Tigris relies on.

Since git and github were (strongly) suggested the last time that this
issue was raised, I've looked at github and it seems that there are
several marc4j repositories there, some of which seem to have been
created just recently. Some of them seem to be owned by Bill Dueber,
but for the one named marc4j/marc4j, my name seem to be the only one
associated with it. If anyone know the status of the github projects
for Marc4j it would be useful to find out.

Bob Haschart


Bill Dueber

unread,
Dec 17, 2012, 8:21:10 PM12/17/12
to solrma...@googlegroups.com
I've been kind of sitting on those, using them as a starting point for some messing around I've been doing, until the wider community got to the point where they wanted to make the move. That seems to be now :-)

I think the best place to start would be with something like the repos now known as marc4j_maven and marc4j_codetable_converter,   both at https://github.com/organizations/marc4j.

marc4j_codetable_converter is just a breakout of the codetable generation code, since its presence within the marc4j distro made the build a little complicated. [Briefly: there are java classes that generate other java classes, which then need to be compiled with the main marc4j distro. This made the compilation phase a two-phase ordeal; splitting it off allows things to be simpler.]

marc4j_maven is, as you might expect, everything left after I pulled out what's in marc4j_codetable_converter, re-organized as per maven standard practice.

Both are maven projects, with the latter having the former as a dependency.

There have been no changes to the code from the CVS except moving things around (and having a dependency on the real icu4j instead of an internal clone), but I also didn't preserve the CVS logs in this pass. There's no good reason for that: I can certainly duplicate my efforts with the full log history and we won't have to lose anything.

[If you're a java person and it looks like I don't know what I'm doing, well, that's because I'm not a java person and I don't know what I'm doing.]

I propose something along the lines of the following:

Reorganization and mavenification
  • marc4j be split into two pieces, as I did, but with the full log history intact
  • both be maven projects (which seems like the least-awful of the awful java build processes that everyone has easy access to)
  • both be upload to the maven central repository under the org.marc4j namespace
Github account
  • major contributors to marc4j be co-owners of the github.com/marc4j organization (I'll happily step aside after adding the first few names)
  • All co-owners be first-class owners, able to review and apply pull requests
Release process
  • Whatever is there now be tagged as "2.5.0" and released
  • Switch to semantic tagging
  • Tagged releases will be uploaded to the maven central repo, which will act as the de facto repository of the latest stable build. 
Expansion?
  • Should the readers/writers currently under the solrmarc namespace be folded into marc4j and housed under the github account? I'm thinking of stuff like the marc combining reader
  • There's also that git repo at indexdata that includes the turbmarc stuff. It should probably stay with them, but we need to make sure it's easy to develop against the core. 
Anyone else with thoughts on this? 










--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tech+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.




--
Bill Dueber
Library Systems Programmer
University of Michigan Library

Naomi Dushay

unread,
Dec 18, 2012, 3:13:51 AM12/18/12
to solrma...@googlegroups.com
I am in favor of marc4j using git for source control and github as its home.  Among other things, interested parties will be able to easily "watch" the repository to be notified of changes.

release process
I would love to see a better release process, with more transparency of tagged releases and with semantic versioning.  I would like to see all future work of solrmarc use tagged releases of marc4j.

expansion
I have no objection to some of the readers and writers in solrmarc migrating to marc4j.  I believe I wrote one of the combining readers.

tests, continuous integration
I don't know if marc4j has tests … but if it doesn't, all commits starting now should be expected to have them.
I don't know if marc4j has a continuous integration build … but if it doesn't, it should get one immediately.  travis.org works seamlessly with github and is free.


I have never built marc4j, and I have never used maven, so I can't speak to the rest of Bill's suggestions.

- Naomi


To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.

Daniel Lovins

unread,
Dec 18, 2012, 8:15:23 AM12/18/12
to solrma...@googlegroups.com

Just FYI I think the link for Travis should be travis-ci.org.

 

- Daniel

To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.



 

--
Bill Dueber
Library Systems Programmer
University of Michigan Library

 

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

Naomi Dushay

unread,
Dec 18, 2012, 12:48:02 PM12/18/12
to solrma...@googlegroups.com
I also forgot to mention that  http://github.com/solrmarc   exists and is available to Bob or whomever for solrmarc related work.  I will be happy to add owners and remove myself as an owner, as folks see fit.  It is my hope that solrmarc will live there someday.

- Naomi

Greetings all,

Bas Peters, the original creator of Marc4j and the owner of the marc4j project on Tigris.org, contacted me saying that he wants to remove the project fromtigris.org , largely due to tigris being outdated and he received tons of spam via that project that isn't filtered out by tigris.org spam filters.

Robert Haschart

unread,
Dec 19, 2012, 5:22:52 PM12/19/12
to solrma...@googlegroups.com
Bill,

I will freely admit to not being a fan of maven.  To me it seems that maven has its preferred way of doing things, and if you need to something even a little outside what it expects, you have to jump through all sorts of hoops to make maven happy.  The example of not being able to generate the codetable source files without needing to create an additional sub-project, to me clearly demonstrates this.   Furthermore it was a conscious decision to include the normalizer.jar  rather than the full icu4j distribution.   Only a tiny portion of icu4j is needed by marc4j, the classes for converting from composed form to decomposed form, and even these classes aren't always needed.   It makes no sense to me to require the inclusion of a utility library that is over 10 times the size of the project being developed.

http://www.iternum.com/knowhow/guidelines/maven-vs-ant/document.pdf

We have a project used internally here for which the (now-departed) developer decided use maven as the build tool, and the complaints in the above document about maven being inconsistent seems to be true,  w.r.t.  that project.  It seems to be a matter of course that when I need to build that project, I tell it to build, and it fails two or three times before returning success.  Also with this other project which is a web application, if I need to debug it as it is receiving messages, the plugin which supposedly supports that doesn't work with the web-container I'm trying to test with. 

However for the rest of your suggestions I think you are right on the mark.

Specifically I think that the repo should have multiple full-fledged co-owners,  that there should be a 2.5 release very shortly after migrating the code. 
And some of the additional readers/writers from solrmarc should be migrated to marc4j for those that it makes sense for as well as adding support for other readers/writers from other sources. 

-Bob Haschart
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.




--
Bill Dueber
Library Systems Programmer
University of Michigan Library
--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.

Bill Dueber

unread,
Dec 19, 2012, 7:12:15 PM12/19/12
to solrma...@googlegroups.com
I'm not married to maven, but I *am* married to the idea of having the build system generate and deal with a real dependency graph, and to make marc4j available as a dependency to other projects that use the standard build systems. Maven seems to be the way people do that these days (whether straight-up, or via something like buildr). I don't care how it's built, just that the dependencies are pulled in automatically and that marc4j is available in the maven central repo. And even the former is less important to me than the latter; getting marc4j in the central repo will make projects that depend on it much easier to build.  

I understand the concern about the ICU library, but we're in a situation where any improvements to the underlying library will bypass us because we ignore them (or is this not something we're worried about in this particular case?).  The icu jar looks to be about 8.8MB; I don't know if people consider that "too big" or not, and of course if you'll be using some of the other stuff in icu jar as well, it's a sunk cost. 

Anyone else wanna weigh in before everyone disappears for the holidays? :-)

Demian Katz

unread,
Dec 21, 2012, 7:08:48 AM12/21/12
to solrma...@googlegroups.com
I can't really provide an informed opinion on the Maven issue, since I have very little experience with the Java build process in general (I think I'm blessed in this regard).  It does seem to make sense to make the code available via Maven regardless of how it is built, though, assuming that doing this does not require any major architectural changes to the library.

Regarding the ICU library issues, I usually tend to favor standardization and convenience over disk space, though adding 8MB to every projecting using marc4j does seem a little excessive if the library is not used heavily.  Is it possible to set up a build option to either include the mini-ICU code or to try to find the library on the classpath?  In the case of VuFind, for example, we already have the ICU library included with Solr, so having another copy bundled inside marc4j bundled inside SolrMarc would be a waste.  Of course, I realize that this is probably not as easy in practice as it sounds in theory...  but it might be a nice solution if it's feasible.

- Demian

From: solrma...@googlegroups.com [solrma...@googlegroups.com] on behalf of Bill Dueber [bi...@dueber.com]
Sent: Wednesday, December 19, 2012 7:12 PM

To: solrma...@googlegroups.com
Subject: Re: [solrmarc-tech] Marc4j on Github

Simon Spero

unread,
Dec 21, 2012, 12:17:57 PM12/21/12
to solrma...@googlegroups.com
One of the thing that Maven takes care of is downloading necessary dependencies.  The downside is that it's maven :-)

Apache also has ivy, a subproject of ant, which handles dependency management and downloading. 

In the marc4j case, most of the need to call ICU can be compiled out with smart MARC to Unicode translation; composed marc characters can be mapped directly to unicode characters or sequences, since the range of characters to which diacritics can be correctly applied is generally restricted, and can be learned with a high degree of accuracy using a corpus.  Exceptions can be handled by falling back to a slow-path conversion, but these should always result in non-composable unicode anyway.

Simon

Naomi Dushay

unread,
Dec 21, 2012, 7:24:04 PM12/21/12
to solrma...@googlegroups.com
I use ICU4j for our Solr;  I don't mind the space taken up by it.  I'm more concerned with it "hiding" in the marc4j jar so I could be using a different version of ICU4j  than I expected in my Solr analysis.  

Disk space is cheap.  An empty file with a name akin to YOU_WILL_NEED_ICU4J_JAR distributed with the marc4j jar could be helpful, along with documentation (the file could be non-empty with instructions of where to get the goodies.) 

The Lucene/Solr build uses ivy. 

Anyone wanna get together at Code4Lib, or just before or after, to work on this stuff?

- Naomi

Demian Katz

unread,
Dec 24, 2012, 7:53:58 AM12/24/12
to solrma...@googlegroups.com
I'd be happy to join a conversation at Code4lib -- perhaps it would be worthwhile to go over the whole MARC suite (including SolrMarc as well as Marc4j) since I think there have been some partial discussions on list that might be better advanced in person.  Would this make sense as a breakout, or do you already have those slots reserved for other topics?  If nothing else, I'd be happy to discuss this over dinner one evening.

- Demian

From: solrma...@googlegroups.com [solrma...@googlegroups.com] on behalf of Naomi Dushay [ndu...@stanford.edu]
Sent: Friday, December 21, 2012 7:24 PM

Tod Olson

unread,
Jan 3, 2013, 11:02:06 AM1/3/13
to solrma...@googlegroups.com, Tod Olson
I agree, taking some time for such a conversation at code4lib would be a fine thing.

-Tod
Reply all
Reply to author
Forward
0 new messages