Executive summary:
This patch to BibTeX.js:
http://groups.google.com/group/zotero-dev/web/sb-bibtex-patch.txt
Fixes the two problems I reported here:
http://forums.zotero.org/discussion/12816/bug-bibtex-key-generation/#Item_4
Note that the patch is to the release version of BibTeX,js, not the
development version, but I think the recent patches to that version
don't touch these lines. If you need me to get the dev version and
generate a patch on that, I can.
For detail, read on...
PROBLEMS:
1. The "banned words" regex in BibTeX.js slurps the whitespace both
before and after the words it extracts, which caused some title words
to be concatenated when producing a BibTeX key.
TITLE: The Man from Nantucket
AUTHOR: Smith
DATE: 2010
---:> BIBTEX KEY: smith_mannantucket_2010
---:> EXPECTED: smith_man_2010
2. The same regex only includes English articles and a small (and
somewhat haphazard?) collection of excluded words.
var citeKeyTitleBannedRe = /(\s+|\b)(a|an|from|does|how|it\'s|its|on|
some|the|this|why)(\s+|\b)/g;
This leads to non-descriptive keys like this:
TITLE: Der Alptraum
AUTHOR: Schmidt
DATE: 2010
---:> BIBTEX KEY: schmidt_der_2010
---:> BETTER KEY: shmidt_alptraum_2010
I had a lot of entries in my bibliography with keys that included
_der_, _le_, _la_, etc.
SOLUTION:
Easy. Both (1) and (2) can be fixed by adjusting this one regex. (1)
by removing the (\s+| ) at the beginning, so that only a word boundary
is matched, and (2) by extending the options for banned words. (I
also added a slight adjustment to a later 'split' function, in case
multiple spaces slip through.)
This extends coverage to French, Spanish and German articles and the
common French/Spanish particle 'de'. It also refines the English list
to include only articles + common prepositons + the helping verb 'do').
[1]
var citeKeyTitleBannedRe = /\b(a|an|the|some|from|on|in|to|of|do|with|
der|die|das|ein|eine|einer|eines|einem|einen|un|une|la|le|l\'|el|las|
los|al|uno|una|unos|unas|de|des|del|d\')(\s+|\b)/g;
This improved the 'friendliness factor' of a significant number of my
BibTeX keys (300 out of 1800), with no regressions or side effects.
Also, the export was about 15% quicker than before the patch. I'd be
glad for some others to test it and comment, and of course I think
it's worth applying to the code.
Of course the long term i8n friendly option would be to put the
relevant part of the above regex in a user-configurable preference.
(Someday I'd love to see an "Advanced BibTeX Export" page in the UI.)
That way if you want to add Italian, you can tack this on to the end:
|lo|i|gli|dei|delle|degli|di|della|al|alle|ed|dell\'|all\'un\'
and, presto, you're Italian.
REMAINING ISSUES:
Two small and one ugly.
I'd be glad if these didn't hold up the application of the patch! The
first two of these occur in a pretty small number of keys and might be
an easy fix for someone more versed in JavaScript. The last is
probably a bigger problem.
(a) The stripping regexs don't remove a colon, so for now, we still
get:
TITLE: Dads: Whither American Fatherhood?
AUTHOR: Smith
DATE: 2010
BIBTEX KEY: ---:> smith_dads:_2010
^^^
This is legal Bibtex, but not expected. It should be simple to fix,
but I couldn't figure it out.
(b) I still am not entirely rid of backticks in my keys
TITLE: `This' and `That'
BIBTEX KEY: ---:> smith_`this_2010
This may have been fixed by the recent backtick fixing patch.
(c) All non-ascii letters are removed from the keys. This is safe
and valid, but ugly and unfriendly.
title = {Tradiciones rabínicas en el Nuevo Testamento},
author = {Miguel Pérez Fernández},
year = {2002},
CURRENT KEY = prez_fernndez_tradiciones_2002,
NICER WOULD BE = perez_fernandez_tradiciones_2002
title = {La distinción},
author = {Domingo Muñoz León},
year = {1992},
CURRENT KEY = muoz_len_distincin_1992,
NICER KEY = munoz_leon_distincion_1992
I suppose that a relatively substitution would catch most cases,
replacing the most common letter+diacritical combinations with ascii
transliterations [ü -> ue, etc.)]
I realize the core dev team likely have bigger fish to fry. I just
add this note here as a documentation of the problem in case someone
wants to take it on. My Javascript is so far limited to adjusting
regexs.
Scot
Footnote:
[1] The original English list (a|an|from|does|how|it\'s|its|on|some|
the|this|why) seemed haphazard. I modified it to include only the
most common English prepositions whose use I judge to be grammatical
than lexical, keeping also the helping verb 'do' for the same
reasons. It's still subjective, but I think more predictable, and a
little more comprehensive. I compared my the differences in my 1800
item library, and I invariably liked the keys better with this list.