I should note that I have also been muddling about in MARC.js in the
past few days, as I try to get Marc-21's multilingual data support to
feed into the multilingual frameworks of Frank's Multilingual Zotero
project. Any major new development on the core translators like MARC
should probably take into account the coming possibility of storing
translations and transliterations of titles, authors, and other data.
I'll take a look at the materials soon and comment as I can -- it
would be good if you took a look at what Frank has produced
(http://gsl-nagoya-u.net/http/pub/zotero-multilingual-overview.html)
and perhaps experimented with the test XPI or the SVN branch
(https://www.zotero.org/trac/browser/extension/branches/trunk-multilingual).
You probably shouldn't try too hard to make a translator work with
this experimental branch right now, since it is liable to change
before it merges to the Zotero trunk, but it's still reason enough to
think about how you can gather the multiple representations of
bibliographic data from MARC.
Regards,
Avram
2010/10/8 ziche <zi...@noos.fr>:
> [..] I have uploaded some discussion materials to
> http://zotero-dev.googlegroups.com/web/Marc2-suggestions.zip. [..]
I'm having trouble downloading the zip file. In light of the announced
coming closure of the Files sections on Google Groups, could you post
this somewhere else -- maybe as separate files on Github or Bitbucket?
That way you'll get revision histories as a nice bonus as well, and
people will be able to reliably access the files.
Group admins: Can you change the group description to reflect that the
Files section is no longer recommended for code submissions?
- Avram
- I uploaded a screenshot of how the result should look like:
http://github.com/zomark/zotero-marc/blob/master/screenshots/MultilingualItem01.png
--
You received this message because you are subscribed to the Google Groups "zotero-dev" group.
To post to this group, send email to zoter...@googlegroups.com.
To unsubscribe from this group, send email to zotero-dev+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/zotero-dev?hl=en.
This frustrated me as well. To be fair, the language variant system is
more flexible than the script variant system. My map was:
switch (tag) {
case "(3": return "Arab";
case "(B": return "Latn";
case "$1": return "Qabk"; // MARC has a single
tag for CJK. That won't do...
case "(N": return "Cyrl";
case "(S": return "Grek";
case "(2": return "Hebr";
default : return "Zyyy"; // Return something
technically valid
I thought about doing detection to distinguish CJK, but that's likely
to cause even more issues. My solution here is to use a private-use
script tag. I've thought some about private-use script and variant
tags -- they're not currently supported by the code (right?), but
perhaps we could approve certain ones as they prove necessary (like
this one), and submit truly necesary and useful ones to IETF for
approval. Thus, we'd enrich our registry a little to support user
needs, but then we could migrate those tags to the final approved ones
if/when they pass the gauntlet of ietf-languages.
I'm behind on the progress you and Frank have made with the
translators, but I'm excited and happy to see that this next important
piece of the multilingual project is coming together so cleanly and
quickly. I'll provide more feedback as soon as I get a chance to make
a running install with all the new test translators.
I should also note that some other catalogs make wider use of the
multi-script support in MARC-21-- most recent Russian publications in
the UCLA catalog include Cyrillic (ru-Cyrl) in addition to LoC
romanization (ru-Latin-alalc97); see http://catalog.library.ucla.edu
(persistent links are not a strong point of the catalog). It also uses
Voyager, so the translator should work unchanged.
Regards to the list,
Avram
2010/10/10 ziche <zi...@noos.fr>:
> I had a look at UCLA and found, for example, a resource with "LC
> control number: 2007222933" (it can be searched as "keyword
> anywhere"). It turns out their parallel titles, places or edition
> statements do not carry any language/script information at all...
> Maybe we should translate these cases simply to dependent entries with
> a private tag, and the multilingual UI should allow the user to
> manually switch the language/script tag for an entry, or even for all
> "private tag" entries of an item at the same time.
UCLA does show some of the data quality issues that are endemic to
library catalogs. In cases where alternate languages and scripts are
present in the input but not tagged, I think it would be best to
assign them the unknown language/script tags as appropriate.
Another record at UCLA (ISBN 9785170545421) provides "Cyrl" alternate
data for the title and other information, but the language is not
specified in the record. I think that we should save the primary data
(fields 100, 245, etc) as the parent record without marking its
language, and save the Cyrl-tagged alternate data in the 880 fields,
using the tag "und-Cyrl".
In general, I don't think that most script-sniffing is necessary,
since the language tag is most important, and few if any sniffing
algorithms can determine that with any reliability. We have ways to
mark data as missing using the existing language tag system-- we can
leverage that.
My suggestion of a private use tag for the CJK catch-all used in
MARC-21 was intended to find a way to maintain the data we have
gleaned from the MARC record-- we don't want to say "Zyyy"
(undetermined) since we in fact know that the script is one of Hans,
Hani, Kana, Jpan, Kore.
In short, I support saving all language variants with their own
language tags if known, and with "und" if not known. If a script
variant is provided and the script is not specified (Florian has found
one of these), then we should save the data as "-Zyyy". If we also
don't know the language, I suppose the data will have to be saved as
"und-Zyyy". In Florian's example, the data in the 880 fields would
have to be saved as und-Zyyy, since it is not declared to be Cyrillic
and the language of the metadata is not specified anywhere.
This is default behavior for the MARC translator that I am describing
-- if we know that a specific catalog has a systematically
non-standard way of handling language variants, we can then override
the process for that catalog.
The introduction of und and Zyyy for data raises the possibility of
multiple language variants that are all tagged with these unknown
tags. Do we want to allow such a beast? Will that make the system far
too complicated?
A final thought is that not only CiNii could benefit from Zotero users
contributing improved metadata -- multilingual metadata is hard to do
and universally weak. A path for improved metadata back to library
OPACs would be quite valuable as well.
Regards,
Avram
Oops. My sight-reading of MARC isn't very good, as you can tell.
> correctly. I still think my former sample (declared Russian,
> undeclared Cyrillic), as well as many BnF records I have seen, could
> benefit from script sniffing. I uploaded a proposal for a database
> table [..]
Now that I see how this would work, it seems a lot more reasonable
than I had feared. This sort of limited lookup would be good for those
cases when we have code 880 alternate graphic representations and we
don't know how to tag the alternate content. We could use this for all
unspecified scripts, since many US catalogs have primary data
exclusively in Latn-- we don't want to end up with "ru" and "ru-Cyrl",
where the former is actually referring to "ru-Latn", just because I
wanted to be cautious about sniffing.
- Avram
When I started populating the Language field with language tags in
anticipation of the field contents mattering for the Multilingual
branch, I realized that I was being limited in ways of expressing the
language of the underlying real item (book, article, document) that I
didn't really appreciate. Consider a book in multiple languages, or
translated from one to another. These are cases that we could solve in
a rather complex manner with language tags, but the language of the
underlying item isn't necessarily needed for the processor to handle
the metadata (itself language-tagged) in a useful way.
There are cases, unfortunately, when a machine readable representation
of the language of the underlying item would be necessary; I'm
thinking in particular of Russian bibliographic practice where a
bibliography may be split by language. Nonetheless, we can drive such
citation generation on the basis of trusting the "master" version of a
multilingual field to be the language that we are privileging and
likely the language of the document for citation purposes.
MARC does try to provide information on whether an item is a
translation, and the source and target languages, but that kind of
information is not necessary for Zotero to maintain unless we're
trying to make it a fully MARC-interoperable library catalog. Which I
hope is not our goal at this point.
If there's a compelling case to make Language tag-driven, maybe I can
go along with it, but I don't yet see what we gain, and we lose the
ease-of-use of the free-form tag.
Perhaps Dan or someone can let us know how frequently the language tag
is actually used by Zotero users? Or does the Zotero team consider
that to be data that shouldn't be aggregated and disclosed from the
synced databases?
Regards,
Avram
I'll be exploring how much it will take to support it using the
current (and proposed) MARC translators, but it reminded me that we
might want to explore some of these issues -- are there other MARC
dialects out there that we don't know about?
The RUSMARC format is discussed here: http://www.rba.ru/rusmarc/ (in
Russian, short English explanation at
http://www.rba.ru/rusmarc/rusmarc_e.html , examples (in HTML!) at
http://www.rba.ru/rusmarc/soft/examples.htm)
Best,
Avram
>
> Best,
>
> Avram
--
You received this message because you are subscribed to the Google Groups "zotero-dev" group.
To post to this group, send email to zoter...@googlegroups.com.
To unsubscribe from this group, send email to zotero-dev+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/zotero-dev?hl=en.
RUSMARC is used by the National Library of Russia:
http://www.nlr.ru/ , catalogs at http://www.nlr.ru:8101/poisk/index.html
No direct link. Example data at
http://github.com/ajlyon/zotero-bits/raw/master/RUSMARC.sample1
- Avram