Another BibTeX export fix [PATCH]

28 views
Skip to first unread message

Becker

unread,
Jun 5, 2010, 7:43:10 AM6/5/10
to zotero-dev
Executive summary:

This patch to BibTeX.js:
http://groups.google.com/group/zotero-dev/web/sb-bibtex-patch.txt

Fixes the two problems I reported here:
http://forums.zotero.org/discussion/12816/bug-bibtex-key-generation/#Item_4

Note that the patch is to the release version of BibTeX,js, not the
development version, but I think the recent patches to that version
don't touch these lines. If you need me to get the dev version and
generate a patch on that, I can.

For detail, read on...
PROBLEMS:

1. The "banned words" regex in BibTeX.js slurps the whitespace both
before and after the words it extracts, which caused some title words
to be concatenated when producing a BibTeX key.

TITLE: The Man from Nantucket
AUTHOR: Smith
DATE: 2010

---:> BIBTEX KEY: smith_mannantucket_2010
---:> EXPECTED: smith_man_2010

2. The same regex only includes English articles and a small (and
somewhat haphazard?) collection of excluded words.

var citeKeyTitleBannedRe = /(\s+|\b)(a|an|from|does|how|it\'s|its|on|
some|the|this|why)(\s+|\b)/g;

This leads to non-descriptive keys like this:

TITLE: Der Alptraum
AUTHOR: Schmidt
DATE: 2010

---:> BIBTEX KEY: schmidt_der_2010
---:> BETTER KEY: shmidt_alptraum_2010

I had a lot of entries in my bibliography with keys that included
_der_, _le_, _la_, etc.

SOLUTION:

Easy. Both (1) and (2) can be fixed by adjusting this one regex. (1)
by removing the (\s+| ) at the beginning, so that only a word boundary
is matched, and (2) by extending the options for banned words. (I
also added a slight adjustment to a later 'split' function, in case
multiple spaces slip through.)

This extends coverage to French, Spanish and German articles and the
common French/Spanish particle 'de'. It also refines the English list
to include only articles + common prepositons + the helping verb 'do').
[1]

var citeKeyTitleBannedRe = /\b(a|an|the|some|from|on|in|to|of|do|with|
der|die|das|ein|eine|einer|eines|einem|einen|un|une|la|le|l\'|el|las|
los|al|uno|una|unos|unas|de|des|del|d\')(\s+|\b)/g;

This improved the 'friendliness factor' of a significant number of my
BibTeX keys (300 out of 1800), with no regressions or side effects.
Also, the export was about 15% quicker than before the patch. I'd be
glad for some others to test it and comment, and of course I think
it's worth applying to the code.

Of course the long term i8n friendly option would be to put the
relevant part of the above regex in a user-configurable preference.
(Someday I'd love to see an "Advanced BibTeX Export" page in the UI.)
That way if you want to add Italian, you can tack this on to the end:

|lo|i|gli|dei|delle|degli|di|della|al|alle|ed|dell\'|all\'un\'

and, presto, you're Italian.

REMAINING ISSUES:
Two small and one ugly.

I'd be glad if these didn't hold up the application of the patch! The
first two of these occur in a pretty small number of keys and might be
an easy fix for someone more versed in JavaScript. The last is
probably a bigger problem.

(a) The stripping regexs don't remove a colon, so for now, we still
get:

TITLE: Dads: Whither American Fatherhood?
AUTHOR: Smith
DATE: 2010

BIBTEX KEY: ---:> smith_dads:_2010
^^^
This is legal Bibtex, but not expected. It should be simple to fix,
but I couldn't figure it out.


(b) I still am not entirely rid of backticks in my keys

TITLE: `This' and `That'

BIBTEX KEY: ---:> smith_`this_2010

This may have been fixed by the recent backtick fixing patch.


(c) All non-ascii letters are removed from the keys. This is safe
and valid, but ugly and unfriendly.

title = {Tradiciones rabínicas en el Nuevo Testamento},
author = {Miguel Pérez Fernández},
year = {2002},
CURRENT KEY = prez_fernndez_tradiciones_2002,
NICER WOULD BE = perez_fernandez_tradiciones_2002

title = {La distinción},
author = {Domingo Muñoz León},
year = {1992},
CURRENT KEY = muoz_len_distincin_1992,
NICER KEY = munoz_leon_distincion_1992

I suppose that a relatively substitution would catch most cases,
replacing the most common letter+diacritical combinations with ascii
transliterations [ü -> ue, etc.)]

I realize the core dev team likely have bigger fish to fry. I just
add this note here as a documentation of the problem in case someone
wants to take it on. My Javascript is so far limited to adjusting
regexs.

Scot





Footnote:

[1] The original English list (a|an|from|does|how|it\'s|its|on|some|
the|this|why) seemed haphazard. I modified it to include only the
most common English prepositions whose use I judge to be grammatical
than lexical, keeping also the helping verb 'do' for the same
reasons. It's still subjective, but I think more predictable, and a
little more comprehensive. I compared my the differences in my 1800
item library, and I invariably liked the keys better with this list.

Becker

unread,
Jun 11, 2010, 10:31:25 PM6/11/10
to zotero-dev
Patch updated.

Now it is made against the current svn version of translators/
BibTeX.js,(which it should have been before). It includes a few
further code comments and a fix to one further bug in the latex
output (the export of Zotero's "# of pages" field to the latex
"pages" field in the item type @book. This is not what some BibTeX
styles expect. They include it as if it were a page range. Discussion
here:

http://forums.zotero.org/discussion/10603/bibtex-import-book-with-field-pages/
http://forums.zotero.org/discussion/13021/does-your-style-say-omit-the-month-for-journals-if-you-have-an-issue-number/#Comment_63930

The patch can found here:
http://github.com/commonman/zotero-bibtex-sb/raw/master/proposed-for-stock-zotero/BibTeX-patch.txt
Or if you'd rather, a dropin version (same thing, but with a unique
translatorID and label, for easy testing without replacing the
existing BibTeX.js):
http://github.com/commonman/zotero-bibtex-sb/raw/master/proposed-for-stock-zotero/BibTeX-proposed.js

You can see the differences between the output files produced for a
small data set by diff'ing the .bib output files for --proposed and --
stockexport in this directory:
http://github.com/commonman/zotero-bibtex-sb/tree/master/reference

(the BibTeX-lowfat output is for a more assertive set of changes,
which I'm not proposing here)

Cheers,

Scot

On Jun 5, 12:43 pm, Becker <scot.bec...@gmail.com> wrote:
> Executive summary:
>
> This patch to BibTeX.js:http://groups.google.com/group/zotero-dev/web/sb-bibtex-patch.txt
>
> Fixes the two problems I reported here:http://forums.zotero.org/discussion/12816/bug-bibtex-key-generation/#...
>

Avram Lyon

unread,
Jun 13, 2010, 6:53:01 PM6/13/10
to zotero-dev
This looks pretty good to me as a fix for some known issues,
specifically unwanted "pages" entries which are prone to
misinterpretation and the choice of title words for citation keys.
I've tried it, and it looks like it works as advertised. Is there any
reason this smaller patch can't be committed?

Avram

2010/6/11 Becker <scot....@gmail.com>:

> --
> You received this message because you are subscribed to the Google Groups "zotero-dev" group.
> To post to this group, send email to zoter...@googlegroups.com.
> To unsubscribe from this group, send email to zotero-dev+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/zotero-dev?hl=en.
>
>

Becker

unread,
Jun 15, 2010, 8:05:35 AM6/15/10
to zotero-dev
Avram,

Thanks very much for testing this. Can I trouble you to have a look
at a slightly enhanced version? In the meantime I was able to fix the
problem I listed above as (c).

The patch in the first link below includes the last, but adds one
small feature, also affecting only BibTeX auto-gen keys. Instead of
removing all accented l letters from the bibtex key (as we do now) it
first converts a non-english accented characters to ascii equivalents
(mostly just stripping the accents, but umlauts get converted to 'ae,
oe, and ue'). This results in friendlier keys for non-English items
in latin script. They don't have missing letters, but they are still
ascii-safe.

So instead of keys like these:

@article{mnguez_potica_1980,
title = {Poética generativa del Magnificat.},
author = {Dionisio Mínguez}
year = {1980}}

@book{steinfhrer_magnificat_1908,
title = {Das magnificat},
author = {Wilhelm Steinführer},
year = {1908},
},

We get keys like these:
minguez_poetica_1980
steinfuehrer_magnificat_1908


Everything else is the same. This is the whole patch, including that
given above:

http://github.com/commonman/zotero-bibtex-sb/raw/master/proposed-for-stock-zotero/BibTeX.diff

--------------------
And here is just the new part if you'd rather:
http://github.com/commonman/zotero-bibtex-sb/raw/master/proposed-for-stock-zotero/BibTeX-patch-tidy-accents.diff
(though you'll see that the file names which the patch wants to apply
to are wrong. Still you can apply it on top of the one given above or
on its own, as you wish.)

As before, you can try the drop-in version (with a new GUID and name,
so it doesn't conflict). Also in the repo at:
http://github.com/commonman/zotero-bibtex-sb/raw/master/proposed-for-stock-zotero/BibTeX-proposed-dropin.js

Scot


On Jun 13, 11:53 pm, Avram Lyon <ajl...@gmail.com> wrote:
> This looks pretty good to me as a fix for some known issues,
> specifically unwanted "pages" entries which are prone to
> misinterpretation and the choice of title words for citation keys.
> I've tried it, and it looks like it works as advertised. Is there any
> reason this smaller patch can't be committed?
>
> Avram
>
> 2010/6/11 Becker <scot.bec...@gmail.com>:
>
> > Patch updated.
>
> > Now it is made against the current svn version of translators/
> > BibTeX.js,(which it should have been before).  It includes a few
> > further code comments and a fix to one further bug in the latex
> > output  (the export of Zotero's "# of pages" field to the latex
> > "pages" field in the item type @book.  This is not what some BibTeX
> > styles expect. They include it as if it were a page range.  Discussion
> > here:
>
> >http://forums.zotero.org/discussion/10603/bibtex-import-book-with-fie...
> >http://forums.zotero.org/discussion/13021/does-your-style-say-omit-th...
>
> > The patch can found here:
> >http://github.com/commonman/zotero-bibtex-sb/raw/master/proposed-for-...
> > Or if you'd rather, a dropin version (same thing, but with a unique
> > translatorID and label, for easy testing without replacing the
> > existing BibTeX.js):
> >http://github.com/commonman/zotero-bibtex-sb/raw/master/proposed-for-...

Richard Karnesky

unread,
Jun 15, 2010, 11:01:54 AM6/15/10
to zotero-dev
All of these fixes seem to be good to me. More accented characters,
UTF-8 ligatures, etc. could probably be mapped, but I guess you got
most of the common ones. Should the utf-8 entities (\uXXXX) be used,
as they are in mappingTable for any reason (easier to spot entities to
add; possibly more robust against accidental changes of the encoding
of that file, etc.)?

--Rick
Reply all
Reply to author
Forward
0 new messages