Now this (a developer willing to submit patches who effectively cannot)
is not good. What's the solution to this problem?
Bruce
For the record, I'm not pretending that it is something that should be
integrated "as is" into Zotero. It should certainly not replace the
current Bibtex filter for one thing and my solution of creating an
entirely separate filter is not elegant. Quite likely there is a better
way to support this.
Right. We probably need two separate prefs--say,
zotero.extensions.exportUnicodeBibTeX and exportUnicodeRIS (both enabled
by default?)--and some code in translate.js to 1) override the character
set when using these formats with the pref enabled and 2) put some flag
into the sandbox so the BibTeX translator knows not to try to replace
Unicode characters in the places indicated by this patch.
If someone wants to take care of this, take the ticket
(https://www.zotero.org/trac/ticket/749) or post here.
Many thanks to Julian for the patch.
Some notes and questions:
1) What's the more reasonable default setting for outputting UTF-8, on
or off? Julian's patch has it off, and Rick recommends off in the forums.
2) I know it was in the original code, but rather than manually
replacing \u0080-\uFFFF, I think we can just add an else {
Zotero.setCharacterSet("latin1"); } if Zotero.useBibtexUTF8 is false and
it'll replace out-of-bounds characters with question marks
automatically. I'm not sure what character set it's trying to use when
UTF-8 isn't set explicitly, but as it is now an unmapped (Chinese)
character gets mangled in the output file.
3) When the setting is off, we may need to do something smarter to
multibyte characters in keys to keep them valid (since I suspect
question marks aren't valid). Or maybe even when the setting is on--do
UTF-8-aware implementations handle keys with Unicode characters?
4) The useBibtexUTF8 setting doesn't seem to have an effect on Quick
Copy, which just uses Unicode regardless. Quick Copy for BibTeX is
handled by fileInterface.js::exportItemsToClipboard() and may not
trigger the same stream code in translate.js that allows
setCharacterSet() to work. Simon should be able to fix this if it's not
clear.
Indeed. Thanks Julian.
> Some notes and questions:
>
> 1) What's the more reasonable default setting for outputting UTF-8, on
> or off? Julian's patch has it off, and Rick recommends off in the forums.
I recommend it on. I submit that distributions of TeX which can support
UTF8 are in the majority even if users don't realize it. I myself made
that discovery by chance about a year ago. And I'm not running some
strange setup either: the OS is Ubuntu 7.10 and the TeX distribution is
TeXLive 2007 which is produced by the TeX User Group (not marginal by
any means). When I made the discovery I was running Debian sid and the
TeX distribution was a TeXLive release prior to 2007 but I don't
remember what precise release it was.
Moreover, if the exported file is UTF8, it can easily be processed by
other tools (grep, python/Perl scripts, etc.) and I can be trivially
indexed, etc.
> 2) I know it was in the original code, but rather than manually
> replacing \u0080-\uFFFF, I think we can just add an else {
> Zotero.setCharacterSet("latin1"); } if Zotero.useBibtexUTF8 is false and
> it'll replace out-of-bounds characters with question marks
> automatically. I'm not sure what character set it's trying to use when
> UTF-8 isn't set explicitly, but as it is now an unmapped (Chinese)
> character gets mangled in the output file.
Here's a related thought: I'd like Zotero to warn the user that some
characters are not representable in the output format. If programatic
facilities are lacking to do this in a semi-automatic way, a test is
required for the presence of characters in the \u0080-\uFFFF range which
cannot be converted to BibTeX's dialect.
> 3) When the setting is off, we may need to do something smarter to
> multibyte characters in keys to keep them valid (since I suspect
> question marks aren't valid). Or maybe even when the setting is on--do
> UTF-8-aware implementations handle keys with Unicode characters?
To my knowledge, keys are not supposed to contain anything else than a
subset of the characters representable in ASCII (I don't know the
precise subset), no matter which TeX distribution is used. I've done a
quick search and found nothing to confirm or infirm my hypothesis.
Emacs' bibtex.el does operate on the assumption that a key must be
representable in ASCII but I did not find a reference to a reliable
source in that file.
I have nothing to say on point 4.
Thanks,
Louis
On Thu, 2007-12-13 at 08:22 -0800, Richard Karnesky wrote:
> > I recommend it on. I submit that distributions of TeX which can support
> > UTF8 are in the majority even if users don't realize it. I myself made
> > that discovery by chance about a year ago. And I'm not running some
> > strange setup either: the OS is Ubuntu 7.10 and the TeX distribution is
> > TeXLive 2007 which is produced by the TeX User Group (not marginal by
> > any means).
>
> I don't think this single datapoint is enough to base the decision
> on. TeXLive is available for all platforms, however it is not the
> dominant TeX distribution (nor representative of what is deployed).
You counter my impression about the popularity of TeXLive with your own
impression of its popularity but were are still at the level of
impressions.
> Further: those that use older versions of TeXLive would not have the
> same experience you are having.
True but the question is not whether there exist such people but how
representative they are of the people who would want to use Zotero.
> I believe you are finding it easy to
> use UTF-8 because, beginning with version 2007, TeXLive shipped with
> XeTeX (which handles native unicode (although still not perfectly)).
XeTeX is included in TeXLive 2007 but I do not use XeTeX. For Ubuntu,
TexLive is divided in several smaller packages: the one which contains
XeTeX (named texlive-xetex) is not installed on my machine. And the
mere inclusion of XeTeX in the distribution does not mean that
everything else has been changed. Moreover, if you go back to my
original post, you'll see that TeXLive started supporting
Unicode-encoded BibTeX before the 2007 release.
So, no, XeTeX is not a factor.
> Also note that arXiv & other popular/online/public TeX users are
> confined to ISO Latin 1.
What on earth are "popular/online/public TeX users"?
arXiv is an example of a site that is not at all designed to respond to
the needs of the Humanities. I've never used it before because it does
not cater to my needs but I've done a few searches and found problems
(can't do searches with diacritics, articles indexed without diacritics,
etc.). If those guys pretended to cover the Humanities by any means,
they'd get some flack.
In the grand scheme of things the important thing is to have the options
of producing UTF8 or a BibTeX file in bibtex's native format. What the
default is set to is quite secondary. RefWorks only outputs BibTeX in
UTF8, which is problematic for those with old setups. Connotea is
buggy: it outputs a mixture of TeX-coded accents and UTF8 for accents it
seems unable to handle. JSTOR is buggy: if you ask for a BibTeX record,
it strips all accents.
Ciao,
Louis
...
> arXiv is an example of a site that is not at all designed to respond to
> the needs of the Humanities.
Ahem, TeX and BibTeX are "not at all designed to respond to the needs of
the Humanities" ;-). It's a system designed for mathematicians and
scientists. So I'd say arXiv is likely representative.
> In the grand scheme of things the important thing is to have the options
> of producing UTF8 or a BibTeX file in bibtex's native format. What the
> default is set to is quite secondary.
Correct.
You make a good point that the preference here should take into account
the desire of Zotero BibTeX users, rather than to worry about the entire
spectrum of BibTeX users.
I suggest people who care voice their preference, and in the absence of
a clear consensus, Dan just make an executive decision.
I'm not going to vote per se since I don't use BibTeX.
Bruce
That's what I've been using since I switched back to TeX/LaTeX in the
middle of last year. And it is after that that I found that BibTeX
would accept UTF8 files. My guess is that people who don't need to use
UTF8 don't know about \usepackage[utf8]{inputenc}.
Before that I was using Omega/Lamba and Aleph/Lamed which both proved
utterly complicated to use and unstable. The straw that broke the
camel's back was the inability to include Chinese and transliterated
Sanskrit in the same paper: I could get one or the other but not both at
the same time. There's probably some incantation somewhere that could
do it but I did a few tests with LaTeX, saw that I could do it fairly
easily and happily ditched Lambda and Lamed.
> Or do you do something else to make it work? (This is only to satisfy
> my own curiosity--it sounds as if you're using eTeX/pdfTeX & I think
> they need to be prodded like this to accept utf-8.)
In the latest TeXLive, latex resolves to pdftex.
> > What on earth are "popular/online/public TeX users"?
>
> Not all LaTeX is compiled on personal desktop machines & so it is
> important to look at the limitations of some of the places you may
> submit zotero-produced content to.
I see.
>
> I said:
> > I don't know what other reference managers do to citekeys in UTF-8 export.
>
> I'm not a frequent RefWorks user, but I do have a testing account. Is
> it the case that the only BibTeX keys that can be assigned are the
> RefWorks ID #s? If I'm missing the way to have more useful keys, how
> do they handle UTF-8 in keys.
They don't handle UTF8 in keys. They only generate keys that correspond
to RefWork IDs. I've checked for a preference somewhere that would
change that but I have not found it. I could have missed something
though. So if anybody knows otherwise, please correct me. That RefWork
ID scheme is not user-friendly at all. I'm tolerating it right now
because for historical (or hysterical) reasons which are totally my own
fault I'm managing my references in a half-broken way but I think for my
next paper I will made some changes in the way I manage my
bibliographies and then the RefWork scheme will become an obstacle that
I will have to deal with somehow (unless I switch to Zotero in the
meantime).
> [Note that recent versions of bibtex and latex do not consider these
> to be invalid characters. Indeed, you shouldn't even need an inputenc
> since it is in a control sequence--try entering it in by hand instead
> of allowing bibtex-mode to do it for you.]
Good to know.
> It is hard to say what is the right thing to do: I personally find it
> more usable to transliterate to ASCII (since I type them out on a US
> keyboard). However, I know that some consider this "bastardization"
> to be bad & may prefer it to be blanked (as JabRef does) or kept in
> UTF-8.
I know there's been some discussion about key generation being
customizable to some extent. Maybe "convert key to ascii" should be
part of the options?
But this brings up other issues. Let's assume a key is generated from
the author name and title of the work. Let's also assume that those
contain Chinese characters. How do you convert to ASCII? It is
possible in theory to do it automatically but I can see two problems:
1) Incorporating the functionality into a Firefox plugin could make it
beefier than desirable. I've written a Java library to access the
Unihan database. Java library and database are 4.4Mb in a jar file (most
of this is the database).
2) Several Chinese characters (probably the majority of them) have
multiple possible pronunciations. Taking the first pronunciation listed
often works but not always. And that just assuming Mandarin but if a
scholar is referring to articles written by Cantonese scholars then
Cantonese pronunciation might be in order. (Both Mandarin and Cantonese
are in Unihan.)
I've never been able to include Chinese directly into a BibTeX file so I
don't know how urgent this kind of support is.
Ciao,
Louis
...
> I know there's been some discussion about key generation being
> customizable to some extent. Maybe "convert key to ascii" should be
> part of the options?
>
> But this brings up other issues. Let's assume a key is generated from
> the author name and title of the work. Let's also assume that those
> contain Chinese characters. How do you convert to ASCII? It is
> possible in theory to do it automatically but I can see two problems:
>
> 1) Incorporating the functionality into a Firefox plugin could make it
> beefier than desirable. I've written a Java library to access the
> Unihan database. Java library and database are 4.4Mb in a jar file (most
> of this is the database).
...
Yikes. That sounds like an awful lot of complication to in essence
support legacy technology (pre-unicode BibTeX).
Keep in mind that LuaTeX is on the near horizon (final release scheduled
for next summer IIRC), which has support for unicode and OpenType
out-of-box. It also embeds Lua, which leaves room, say, for plugging in
a more modern BibTeX replacement to the LuaTeX core (say one that uses
CSL to configure styles ;-)).
Also, I think Dan mentioned they're working on allowing user-defined
keys as an option.
That doesn't help you now, but it's just to say perhaps better to look
towards the unicode future rather than worry too much about supporting
the limitations of ASCII?
Bruce
Not overnight, but in the same way that while there are still desktop
applications and OSes that don't have real unicode support, I think it
will further contribute to the momentum that says that unicode is the
norm, and ascii and other such encodings are legacy.
E.g. at some point, time for the "some publishers" to get with the 21st
century.
Of course, the only publishers I've ever dealt with want nothing to do
with TeX, and tend to insist on .doc files (though sometimes I can get
the reasonable tech person who will make exceptions and accept, say, XHTML).
> Giving the apparent complexities of transliterating CJK, it seems more
> sensible to either leave out a field from the key (as JabRef does) or
> to just use the UTF-8 (and link to a tool in the FAQ to strip these
> for those times where the "unicode future" is impossible to achieve).
Yeah.
Bruce
It looks like I distracted us unnecessarily with the whole
which-should-be-the-default question, though the info was helpful
regardless. Simon has committed an update to Julian's patch that removes
the global pref in favor of a runtime option that should retain the
last-used setting. The translator now also uses only ASCII in cite keys,
regardless of the mode. (We can revisit this later if we find support
for UTF-8 cite keys in other software, and of course customizable keys
are still planned.) Finally, BibTeX imports are now UTF-8 only, meaning
that extended characters in files encoded as ISO-8859-1 won't import
correctly unless mapped to their ASCII BibTeX representations.
Unfortunately the updated translator, with the enhanced mapping tables,
now may be too large to reliably import via Firefox 2 mozStorage, so
we're not pushing it to existing clients until Zotero 1.5 (which will
require Firefox 3). The updated translator can be downloaded from Trac
(https://www.zotero.org/trac/raw-attachment/ticket/749/bibtex-translator.sql)
and installed into the Zotero DB with the SQLite command line client
(e.g., sqlite3 zotero.sqlite < bibtex-translator.sql). Be sure to close
Firefox and make a backup of the DB before importing.
- Dan