updates to RIS.js to support better import from endnote exports

235 views
Skip to first unread message

jonathan.morgan

unread,
Nov 20, 2010, 7:27:46 PM11/20/10
to zotero-dev
Hello,

I am migrating from EndNote to Zotero, and in the process, I've been
updating the RIS.js file to handle ideosyncracies with EndNote's RIS
export.

Would you all like for me to at least upload the JS file so you can
see the changes? This might be better implemented as a EndNote-
specific RIS import.

Changes:
- added function to detect MIME type from file name extension.
- added function to parse filename from URL or file system path.
- updated doImport() so there is a dataArray as well as a data string,
and for the L1 tag, it behaves same as before (creates a space-
delimited string) but also creates an array where each line is its own
string, so it can import all attachments.
- created processLinkTag() function where:
- then, based on tag,
- if UR - sets item.url, pushes value onto item.attachments.
- if L1 - loops over data array passed in, processing each
item separately. Derives link type based on protocol prefix
("http://" or "https://" = URL; "file://" = file; "internal-pdf://" =
EndNote import file), looks up MIME type, parses file name from
string, then depending on type adds attachment to item passed in with
derived values (so PDF files get their name as name instead of "Full
Text (PDF)", and other types of document get correct MIME type - Word
documents, Excel documents, etc.).
- if L2 or L4 - same as before.
- updated processTag() so it calls processLinkTag() for UR, L1, L2,
L4.
- Added variables where you can tell it to use AB as abstract field
(since EndNote places abstract in AB), and your file path to the PDF
folder in EndNote (so it can substitute that into each internal-pdf).
These should probably be parameterized somehow, not sure how to do
that.

If this might be of value, let me know, and I'd be glad to work
through submitting it.

Thanks,

Jonathan Morgan

Avram Lyon

unread,
Nov 20, 2010, 8:07:13 PM11/20/10
to zoter...@googlegroups.com
Please do upload your proposed changes. We've generally taken the
position that RIS import should be based on the standard, but there's
clearly a need for Endnote support. It is possible to specify
translator options in the JSON header if necessary, so maybe some of
what you need can be enabled that way.

But by all means post the code, and we'll figure out how to best
integrate into Zotero.

- Avram

2010/11/20 jonathan.morgan <jonathan....@gmail.com>:

> --
> You received this message because you are subscribed to the Google Groups "zotero-dev" group.
> To post to this group, send email to zoter...@googlegroups.com.
> To unsubscribe from this group, send email to zotero-dev+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/zotero-dev?hl=en.
>
>

jonathan.morgan

unread,
Nov 20, 2010, 8:39:54 PM11/20/10
to zotero-dev
For some reason, now that it actually knows of and can upload many PDF
files (I attached 6 to the sole record in my import test, and it
worked once), it causes the browser to go out to lunch 9 times out of
ten (the other 9 or so times I tried the import).

I might have to pull the actual trunk out of SVN and look at how files
are copied in import.

Should I still upload the code while it has this problem? If so, do I
just upload the whole file, or some sort of patch (if patch, please
let me know where to go to read up on that) in the Files section?

I looked over the Refman standard, and some of this is in the
standard, but different - you should be able to process multiple links
from one L1 tag, but the standard says they should be in a semi-colon
delimited list on one line instead of on multiple lines. In the
EndNote output style editor, I couldn't find a way to output a list
with a delimiter instead of on multiple lines. So at the least, it
should support multi-item L1, but having them delimited by newlines
instead of ";" is non-standard.

Thanks,

Jon

On Nov 20, 3:07 pm, Avram Lyon <ajl...@gmail.com> wrote:
> Please do upload your proposed changes. We've generally taken the
> position that RIS import should be based on the standard, but there's
> clearly a need for Endnote support. It is possible to specify
> translator options in the JSON header if necessary, so maybe some of
> what you need can be enabled that way.
>
> But by all means post the code, and we'll figure out how to best
> integrate into Zotero.
>
> - Avram
>
> 2010/11/20 jonathan.morgan <jonathan.morgan....@gmail.com>:

jonathan.morgan

unread,
Nov 20, 2010, 9:19:30 PM11/20/10
to zotero-dev
This works on my machine for a single reference with up to about 5
related files (I stepped up through to 5 and at 5 it started dying),
so it will probably work for all but 6 or 7 of my 500 references.
Pretty good. Let me know if I should upload it in this state.

Thanks,

Jon

On Nov 20, 3:07 pm, Avram Lyon <ajl...@gmail.com> wrote:
> Please do upload your proposed changes. We've generally taken the
> position that RIS import should be based on the standard, but there's
> clearly a need for Endnote support. It is possible to specify
> translator options in the JSON header if necessary, so maybe some of
> what you need can be enabled that way.
>
> But by all means post the code, and we'll figure out how to best
> integrate into Zotero.
>
> - Avram
>
> 2010/11/20 jonathan.morgan <jonathan.morgan....@gmail.com>:

Avram Lyon

unread,
Nov 21, 2010, 7:11:08 AM11/21/10
to zoter...@googlegroups.com
Since the files section is closing soon, it's probably best to upload
it to Github in a repository or as a publisc Gist and post the link
here.

Avram

2010/11/21 jonathan.morgan <jonathan....@gmail.com>:

jonathan.morgan

unread,
Nov 21, 2010, 5:07:54 PM11/21/10
to zotero-dev
OK. The updated RIS.js is here: https://github.com/jonathanmorgan/jsm-zotero/blob/master/RIS.js

I also did a little digging in the EndNote RIS output style to see
what is actually in the article types I am using in my library. The
Excel (2011...) spreadsheet with the results is here:
https://github.com/jonathanmorgan/jsm-zotero/blob/master/RIS_spec_overview.xlsx

It includes:
- tab where I went through and marked all the tags in the spec, plus
tags in EndNote export, with notes on how each is used.
- tab with the valid TY values from the RIS spec.
- tab with notes (URLS to RIS spec and latest version of Refman
manual, which is more recent, and has RIS spec in it).
- tab that has columns for RIS field, EndNote field, EndNote reference
type, and whether the field is in the spec or not for all items output
in the EndNote RIS output format (EndNote X4) for Book, Book Section,
Conference Paper, Edited Book, Electronic Article, Electronic Book,
Generic, Journal Article, Magazine Article, Report, Web Page, and
Unpublished Work. You can sort by RIS field to see all the EndNote
fields that are put in each RIS field. You can sort by EndNote field
to see all the RIS fields each EndNote field is placed in, and see
which reference type does what. And each has the in-spec flag so you
can see which RIS "fields" are in the spec and which are not.

It looks like the RIS spec is pretty outdated, so it might make sense
to have a translator for EndNote-extended RIS (there are fields in
EndNote that have corresponding fields in Zotero but no standard RIS
field to hold the value, like DOI and call number), and then maybe a
special EndNote RIS output style that is cleaned and homogenized to
make the mapping from EndNote field to RIS field more consistent
across reference types (they are all over the map for some things).

I'll keep updating RIS.js as I find and fix problems (in mine, for
some of these common ones, I am going to add the non-standard fields
to my copy of the javascript file just to get the values over, and for
some, I'll just make them separate additional N1 fields in the EndNote
output format).

And this is all based on the output formats in X4, so I'm not sure
what it looks like in X3's output format or earlier.

Jon


On Nov 21, 2:11 am, Avram Lyon <ajl...@gmail.com> wrote:
> Since the files section is closing soon, it's probably best to upload
> it to Github in a repository or as a publisc Gist and post the link
> here.
>
> Avram
>
> 2010/11/21 jonathan.morgan <jonathan.morgan....@gmail.com>:

Bruce D'Arcus

unread,
Nov 21, 2010, 5:22:42 PM11/21/10
to zoter...@googlegroups.com

Just a random thought:

It occurs to me it might be helpful to publicly document a mapping
table for these different formats. Probably the authoritative source
would have to be the C-based source for bibutils, which has content
like:

static lookups article[] = {
{ "AU", "AUTHOR", PERSON, LEVEL_MAIN },
{ "A1", "AUTHOR", PERSON, LEVEL_MAIN },
{ "A2", "AUTHOR", PERSON, LEVEL_HOST },
{ "A3", "AUTHOR", PERSON, LEVEL_SERIES },
{ "ED", "EDITOR", PERSON, LEVEL_HOST },
{ "PY", "PARTYEAR", DATE, LEVEL_MAIN },
{ "Y1", "PARTYEAR", DATE, LEVEL_MAIN },
{ "Y2", "PARTMONTH", SIMPLE, LEVEL_MAIN },
{ "SN", "SERIALNUMBER", SERIALNO,LEVEL_HOST },
{ "TI", "TITLE", TITLE, LEVEL_MAIN },
{ "T1", "TITLE", TITLE, LEVEL_MAIN },
{ "T2", "SHORTTITLE", SIMPLE, LEVEL_HOST },
{ "T3", "TITLE", SIMPLE, LEVEL_SERIES },
{ "JO", "TITLE", SIMPLE, LEVEL_HOST }, /* JOURNAL */
......

jonathan.morgan

unread,
Nov 23, 2010, 12:39:41 PM11/23/10
to zotero-dev
Hello,

As I've worked through this, I think I am going to make a more
flexible framework for all this in my RIS.js. I'll keep updating the
copy in my git repository as I go, and if you want to make use of any
of this, please let me know. If I make this framework-y, then I'll
really only have to write the majority of this once (it will be my
Thanksgiving holiday relaxation!) and it will make it easy to adjust
later, and to translate this to other formats should I encounter
others that frustrate me.

Once I get this implemented, I'll have the mappings you are talking
about (though some will be more complicated than a simple data
structure like the one above can accommodate, I think) and so I should
be more ready to talk about the mapping you suggest, which is a good
idea.

If any suggestions of ways I could alter this so that I make it re-
usable for others or Zotero, please let me know. I appreciate that
this is moving beyond the RIS standard, but I just want to get all
information possible from EndNote to Zotero (it is particularly
annoying to me that EndNote's data model places groups/tags in a
separate place from reference, so you can't get at a given reference's
groups/tags in the export process).

I am going to try to make this so standard RIS imports correctly, even
while this framework will aim to support the entirety of EndNote's
idiosyncrasies (still a reason, I think, for Zoterans to consider
having a separate import translation for EndNote RIS if it ends up
that EndNote's use of standard parts of the RIS spec is non-
standard). If that is the case, it would make me nervous leaving this
RIS.js in place for standard imports (though I'd probably do that and
watch with interest to see if the "standard" RIS exports of various
databases are really truly standard).

Jonathan


Here is what I'm planning on doing:

--------------------------------------------------
Plan for redo of Zotero RIS.js:
--------------------------------------------------

to do:
- pull all RIS tag mappings for each EndNote reference type supported
by the EndNote RIS output style out of the style, put them into the
spreadsheet so I can go EndNote field by EndNote field and see true
damage.
- EDBOOK type is not accounted for in import (and looks like it
doesn't fit the pattern in RIS.js, which maps zotero item type to
EndNote reference type, and so can't deal with two EndNote types
mapping to same Zotero type). Need to rewrite the mapping of EndNote
reference type to Zotero item type so the EndNote value is the key,
references value of associated Zotero type, since multiple EndNote
types can map to one Zotero type. Will need to keep original EndNote
type string around, too, so we can reference it during processing
(since EDBOOK and BOOK have different output, for instance).

Import:
- Create an ImportField object for each RIS property that holds:
- inType - type of input mapping - either "direct" (get direct
mapping to Zotero item property from inMapping) or "function" (get
function pointer of function used to process RIS property from
inFunction).
- inMapping - if mapping between RIS field and Zotero item
property is direct (inType = "direct"), name of associated Zotero item
property.
- inFunction - if mapping between RIS field and Zotero item is
complicated (inType = "function"), so if different for different
types, or lots of translation required, then this reference stores a
function pointer to a function with a standard signature (use
signature of function processLinkTag( item_IN, tag_IN, value_IN,
valueArray_IN )) that accepts item, tag name, value, and an optional
value array, uses those values to appropriately process incoming tag,
place the results in the item passed in.
- isInSpec - boolean variable, set to true if RIS property is from
actual spec, false if not (maybe have a source, too?).
- all library functions re-used across multiple RIS tags.
- method to accept item_IN, tag_IN, value_IN, valueArray_IN and
deal with the internal configuration of the ImportField, return
item_IN with the RIS tag appropriately processed.

- Create a map of RIS properties to ImportField objects for each RIS
property where you create and initialize an ImportField instance per
RIS property. These mappings are where the meat of the import will
live.

- Make a function that works as doImport does now - go line by line,
grabbing tags and building up their values. When it comes time to
process a tag, though, this method goes to the tag-to-ImportField map
and lets the ImportField for that tag handle processing.

Export:
- Have an ExportField object for each Zotero item property that holds:
- outputBuffer -
- outType - type of output mapping - either "direct" (get direct
mapping to RIS property/tag name from outMapping) or "function" (get
function pointer of function used to process Zotero output item from
outFunction).
- outMapping - if mapping between Zotero item property and RIS
field is direct (outType = "direct"), name of associated RIS tag.
- outFunction - if mapping between Zotero item property and RIS
field is complicated (outType = "function"), so if different for
different types, or lots of translation required, then this reference
stores a function pointer to a function with a standard signature
(just accepts item_IN?) that accepts item, appropriately renders
export tag(s), stores the result in output buffer.

- Create a map of Zotero item properties to ExportField instances for
each supported Zotero item property. This is where the meat of the
import will live.

- Make a function to process output - will need to see how the export
works now, make sure to not lose any of the logic there (I've been
focusing on import thus far).

On Nov 21, 12:22 pm, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
> On Sun, Nov 21, 2010 at 12:07 PM, jonathan.morgan
>
>
>
> <jonathan.morgan....@gmail.com> wrote:
> > OK.  The updated RIS.js is here:https://github.com/jonathanmorgan/jsm-zotero/blob/master/RIS.js
>
> > I also did a little digging in the EndNote RIS output style to see
> > what is actually in the article types I am using in my library.  The
> > Excel (2011...) spreadsheet with the results is here:
> >https://github.com/jonathanmorgan/jsm-zotero/blob/master/RIS_spec_ove...

Bruce D'Arcus

unread,
Nov 23, 2010, 1:01:15 PM11/23/10
to zoter...@googlegroups.com
On Tue, Nov 23, 2010 at 7:39 AM, jonathan.morgan
<jonathan....@gmail.com> wrote:
> Hello,
>
> As I've worked through this, I think I am going to make a more
> flexible framework for all this in my RIS.js.

A couple of little details (I haven't looked at the rest, and haven't
really followed the conversation):

First, on GitHub, it will automatically render markdown documents if
you give it an appropriate extension (IIRC, both "mdml" and "markdown"
should work).

Second, not super convenient longer-term to dump documentation in an
excel file. FWIW, every project wiki on GH is also its own repo of
markdown files. That can be a place to put documentation in some
cases.

Bruce

jonathan.morgan

unread,
Nov 23, 2010, 3:17:31 PM11/23/10
to zotero-dev
I can do that, but since I am currently analyzing and assessing the
different ways EndNote uses RIS tags and across fields and across
reference types (different reference types use a particular RIS tag
differently, for example, but there are essentially higher-level
classes of reference types that use sets of RIS tags the same within
the class, but that vary across the classes) I need to be able to sort
on columns, so I can continue to cross-compare which EndNote fields
are placed in which different RIS tags, and which RIS tags behave
differently based on the reference type in EndNote. And I need to be
able to keep adding columns, so I can capture and sort on different
classes of reference types.

I am not too familiar with mdml or markdown. Do they allow
hierarchical sorting on multiple columns and rudimentary querying? If
not, I think for now I personally need to stick to an Excel document
so I can understand the data, then I can then make static views in a
markup language. Once I have a handle on the data, know which views
make sense, then I could start making more static markup.

Long-term, I could just markup the different sorted permutations of
the spreadsheet for different needs. Is this what you are
suggesting? For now, though, since I am esssentially creating a
database in the Excel document and using that for my assessment,
analysis and design, and I'd rather get this working (so I can use my
reference library with Word 2011 ASAP), then document.

Does this sound reasonable?

Thanks for the feedback!

Jon

On Nov 23, 8:01 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
> On Tue, Nov 23, 2010 at 7:39 AM, jonathan.morgan
>

Avram Lyon

unread,
Nov 29, 2010, 9:38:01 AM11/29/10
to zoter...@googlegroups.com
I think that Bruce's concerns with using Excel were mainly because
Excel isn't so great for sharing and tracking the development of a
document, especially in a cross-platform manner.

That said, you should certainly use whatever technology makes the job
of understanding the morass of Endnote exports any easier. When you've
worked it out, it'd be great if you could document it. That
documentation would hopefully be in some format that is easy to share
and maintain, lest you get stuck with maintaining the RIS export
documentation in perpetuity. Thus, that format could be something
based on plain text.

I look forward to your help explaining Endnote's exports, and to
improving interoperability.

Best wishes,

Avram

2010/11/23 jonathan.morgan <jonathan....@gmail.com>:

jonathan.morgan

unread,
Nov 29, 2010, 3:22:03 PM11/29/10
to zotero-dev
Understood. I'll definitely document once I understand better what is
going on (and I think it will be interesting to have it all laid out
there for people). You and Bruce both have actually been quite
welcoming and helpful, and I just wanted you to know I appreciate it,
also!

I did more work over Thanksgiving, and at some point we'll have to
look at what it takes to get EndNote RIS in, see if that needs to be a
separate thing from standard RIS, so we don't make EndNote import work
but as a result break standard RIS import or make RIS export into a
non-standard RIS. I'm trying to stick with what you said about not
breaking the standard - it should at least import and populate
standard fields per the standard correctly, then we can decide if we
want to use the EndNote non-standard fields to export stuff that has
come along since RIS standard was last updated.

I added in all output formats to that spreadsheet over the weekend,
and it is interesting to see the number of non-RIS-standard fields
added by EndNote, and the amount of variation in some fields,
specifically when dealing with the difference between a section of
something and the whole something. Section/chapter of something
essentially has two sets of authors/editors - the way zotero's data
model works is good for this - it just has lots of authors with
different types, so you can have editors and authors on a given book
excerpt - I am trying to wrap my head around whether EndNote's RIS is
consistent in the way it moves primary editor of an edited book to a
secondary author position for a chapter, excerpt or section, so you at
least can say that for a section of a book, section author is in X,
editors are in Y.

At any rate, hoping to get the javascript updated for just the record
types in my library as a test of framework today or tomorrow, and I'll
upload that javascript as soon as I have it working, so you can check
it out.

Thanks,

Jon

On Nov 29, 4:38 am, Avram Lyon <ajl...@gmail.com> wrote:
> I think that Bruce's concerns with using Excel were mainly because
> Excel isn't so great for sharing and tracking the development of a
> document, especially in a cross-platform manner.
>
> That said, you should certainly use whatever technology makes the job
> of understanding the morass of Endnote exports any easier. When you've
> worked it out, it'd be great if you could document it. That
> documentation would hopefully be in some format that is easy to share
> and maintain, lest you get stuck with maintaining the RIS export
> documentation in perpetuity. Thus, that format could be something
> based on plain text.
>
> I look forward to your help explaining Endnote's exports, and to
> improving interoperability.
>
> Best wishes,
>
> Avram
>
> 2010/11/23 jonathan.morgan <jonathan.morgan....@gmail.com>:> I can do that, but since I am currently analyzing and assessing the

Bruce D'Arcus

unread,
Nov 29, 2010, 3:45:15 PM11/29/10
to zoter...@googlegroups.com
Yes, just to clarify ...

On Mon, Nov 29, 2010 at 10:22 AM, jonathan.morgan
<jonathan....@gmail.com> wrote:
> Understood.  I'll definitely document once I understand better what is
> going on (and I think it will be interesting to have it all laid out
> there for people).  You and Bruce both have actually been quite
> welcoming and helpful, and I just wanted you to know I appreciate it,
> also!

Avram's interpretation of my comments were correct, and I have the
same take as he (if excel is useful for you now, no problem).

At some point I guess I'm just shooting for extracting the mappings
from bibutil's source into something human readable (a nice table),
and then working from that. E.g. the hard work is actually the logic
of the mappings, not the parsing code.

BTW, there are two classes of issues with Endnote RIS export:

1) bugs (the infamous KW bug where multiple keywords get a single tag)
2) mappings

Bruce

Kieren Diment

unread,
Nov 29, 2010, 9:01:11 PM11/29/10
to zoter...@googlegroups.com
Sorry to be annoying, but while you're there is there any chance you can look at importing the pdfs from internal-pdf:// links? I think just expecting the PDF storage file to be in the subdirectory below the ris file being imported would be fine.

If you can't do this or don't want to do this yourself, is there any chance you can mark it as a TODO in the appropriate place in the code you write please?

jonathan.morgan

unread,
Nov 30, 2010, 7:56:05 AM11/30/10
to zotero-dev
I just checked in re-factored RIS,js code to my little git repository
(https://github.com/jonathanmorgan/jsm-zotero/blob/master/RIS.js).

The import is more modular and compartmentalized now, so it isn't just
a giant if-then-else cascade.

There is an object named ImportField that holds an import type
("direct" or "function") and then either the name of the item property
into which the RIS field value should be placed, or a function pointer
to the javascript function that should be called to process each tag.

I abstracted out the common code across the import routines into a
small set of javascript functions. There were a number of places
where the exact same code had been cut and pasted - those have now
been broken out into functions that are called from each place where
they had formerly been pasted.

Each RIS field that is supported in the import is in a map named
risFieldToImportFieldMap that maps incoming RIS field to an instance
of ImportField. So, there is now one map you can go to to see all
fields being processed, and how they are being processed, either being
mapped directly to a field in item, or being passed to a handler
function.

All handler functions are named process<something>. They are in
alphabetical order starting at the top of the file. Some just process
one tag, some process classes of tags where multiple tags are similar.

The doImport() method now simply loops over the lines and for each
line, grabs the ImportField instance for the current tag, and calls
the processImportField() method on each instance. This method looks
at each ImportField's internal variables and processes the current
line of the file accordingly. Unknown fields are added to the record
as notes, where the note is prepended with the string "Unknown tag
<RIS_tag>: ". This way you don't lose any import data, even if the
import doesn't support a given non-standard RIS tag.

I made a number of modifications to better support EndNote's version
of RIS (I'll have to go back to my notes to find them all, perhaps
tomorrow when I am not bleary-eyed). Most of this still works as it
did for fields in the spec, but the import now knows about additional
fields. I did override the Spec behavior for abstracts since EndNote
insists on putting abstract in AB instead of N2. I also brought the
parts of the spec where multiple values could be in semi-colon
delimited lists up to spec, so the import can handle multiple items on
a line where it should.

For PDFs (and all file attachments - EndNote allows many file types),
this code works as follows:
- there is a variable where you set your PDF root folder,
ENRIS_internalPDFPath.
- for each file, the import uses this path to transform internal-
pdf:// into an absolute file:// path.
- the code parses out the file name from the path and uses it to look
up MIME type based on file extension (an additional function at the
bottom that is probably too comprehensive, but I figured better to be
comprehensive than to not know the mime type for literate haskell
files).
- It then attaches the external file with its actual name and MIME
type to the record (so files keep their names on import and Zotero
knows the difference between Word docs and PDFs, for example).

You could default this path variable to whatever you want, but you
could also make it a property somewhere (I would have done this if I
knew how). Be warned - if you have references with many PDFS (4 or
5), these references can make the Zotero import go out to lunch
forever. If you have lots of PDFs attached to a single reference, I'd
probably strip them out and add them in once you've imported the
reference itself.

I also changed the way that incoming RIS reference types are mapped to
native zotero reference types, so that multiple RIS types can map to a
single Zotero type - so EDBOOK and BOOK can both map to Zotero's
"book" reference type. These mappings are in inputTypeMap. I added
in all incoming reference types contained in the EndNote RIS output
style, tried to figure out mappings, but that is going to need some
more eyes (I added comments of each's descriptive name next to its
mapping so you can see the problems - check it out if you get a
chance). One apparent omission in Zotero's types is any Zotero
reference type for data, datasets, or databases.

I tested this on relatively small numbers of references in one file.
I am going to test it more thoroughly tomorrow on larger sets of
references, and I'll let you know how it goes. I'll probably still be
making minor adjustments to mappings based on input reference type

I am trying to think up a clever and efficient way to get groups
(EndNote's tag system) out of EndNote (perhaps some type of output
style that just outputs internal IDs, so I can grab list of ids for
each group...?). Not sure how I'd work that into zotero, though (I
couldn't get the javascript that makes a pop-up prompt box to work
from inside RIS.js in the few minutes I fiddles with it). I think I'll
have to just rebuild my tags over time.

I haven't mucked with export at all (well, other than to decouple
import from export, so changes to import won't affect export yet),
since I am already a little uncomfortable with making such substantial
changes without really being a part of this community.

I should also apologize in advance for my variable naming conventions
- I always put "_IN" on the end of parameters passed into functions
and "_OUT" on a given routine's return reference, so I can keep them
separate from internal variables. I also adjusted the curly-braces so
they line up to the left.

Bruce, all the existing logic is intact, just moved into functions so
that it is no longer a big long sprawl of conditional code. Sorry to
just go and overhaul this whole thing, but if nothing else, this
should make it easier for you all to see the discrete changes I've
made, and migrate them as you see fit.

Let me know if I can help you make use of this, and I'll try to get
all the stuff in this spreadsheet marked up and more friendly to the
eyes in the next few days.

Thanks,

Jonathan

jonathan.morgan

unread,
Dec 5, 2010, 4:15:01 PM12/5/10
to zotero-dev
Hello again,

Bruce, how can I help get this RIS documentation in a form that is
useful for you? In trying to figure out authors, it has proven
helpful for me to be able to look at a list of all EndNote reference
types, the RIS tags used in export, and the EndNote fields to which
they map, sorted by reference type, then by RIS tag; and then to be
able to look at the same list, resorted by RIS tag, then by EndNote
reference type, so I can see the way EndNote uses a given tag across
reference types; and then to be able to look at the same list, sorted
by EndNote field, then by RIS tag, to detect variations in export of a
given EndNote field across reference types. I could make these views
into HTML (though since they are really the same data and there are
1793 rows, they'd probably get out of sync pretty quickly if the
output style changes).

I also have a tab in the spreadsheet where each RIS tag has a series
of columns indicating whether it is in the true spec (old and outdated
though it may be), used by EndNote, and then a description, link to
the page where it is specified in the RefMan spec, etc. I could make
this HTML as well.

Part of the usefulness is the sortability, but you could always keep a
copy of the spreadsheet around for that.

A few questions:
- in the EDBOOK type exported from EndNote, the AU field is used for
editors, not for authors, but "EDBOOK" maps to "book" just the same as
"BOOK" does (and before I rewrote the mappings, it just mapped to
generic since "BOOK" took up the one slot for mapping to "book" in
zotero). To address this, I started storing the original reference
type in the item along with the zotero item type, and then for an
author, in the section for books, I check if EDBOOK, and if EDBOOK, I
make AU create editors instead of authors. Does this sound about
right?

- Authors are different for book sections, too - EndNote uses the A3
tag (series author, standard part of Refman spec) to hold the series
editor in these reference types: Conference Proceedings, Book Section,
Audiovisual Material, Serial, Electronic Book Section; but uses A2 for
series editors in these types: Book, Computer Program, Edited Book,
Report
Map, Web Page, Online Multimedia, Classical Work. I updated it so it
knows about the A3 field, but that is going to be a problem, too,
because A3 has the following uses:


RIS tag EndNote field EndNote Reference Type
======= ==================== =======================
A3 Advisor Thesis
A3 Higher Court Case
A3 Illustrator Blog
A3 International Author Patent
A3 Producer Manuscript-2
A3 Publisher Report
A3 Series Author Electronic Book
A3 Series Editor Conference Proceedings
A3 Series Editor Book Section
A3 Series Editor Audiovisual Material
A3 Series Editor Serial
A3 Series Editor Electronic Book Section
A3 Tertiary Author Generic

Also if it runs into RIS tags it doesn't know, it makes a note and
stores the information, and if it runs into parsing errors, I'm making
a little function that will also append the value as a note, with
details on where it was intended to go and what broke, so at least
data doesn't get lost. The version in my little git repository will
be updated later today to include everything I need to implement for
my references.

I am about to the point where I can bust my references out of EndNote
(you can assign a tag to mulitple records easily in zotero! I can
even get groups migrated in an hour or so!), so please let me know if
there is a chance this code could be integrated, and if so, how I can
help. If not, I'll probably just put a post on the zotero forums to
let people know they are free to play with it if they want to and I'll
help if I can, but I don't want to implement and maintain a forked
version of the RIS importer. If I'm going to stay involved, I'd
rather see if I can help make the trunk one work better within the way
you want it implemented, or help work on a better interoperability
dialect (and I think the importers would benefit from using a
framework at least similar to the one in this file that separates
overall control flow from mapping logic, instead of having a list of
if-then-else statements that is overall control flow with mapping
logic nested inside, for it to be easily and reliably maintained over
time - it is risky having the control flow for the import process so
tightly coupled to processing of the mappings - better to have the two
split out, so changes to mappings don't inherently also involve
changes to control flow that need to be tested, as well).

Thanks,

Jonathan

Bruce D'Arcus

unread,
Dec 5, 2010, 4:53:46 PM12/5/10
to zoter...@googlegroups.com
On Sun, Dec 5, 2010 at 11:15 AM, jonathan.morgan
<jonathan....@gmail.com> wrote:
> Hello again,
>
> Bruce, how can I help get this RIS documentation in a form that is
> useful for you?

This isn't so much for me personally. But ...

> In trying to figure out authors, it has proven
> helpful for me to be able to look at a list of all EndNote reference
> types, the RIS tags used in export, and the EndNote fields to which
> they map, sorted by reference type, then by RIS tag; and then to be
> able to look at the same list, resorted by RIS tag, then by EndNote
> reference type, so I can see the way EndNote uses a given tag across
> reference types; and then to be able to look at the same list, sorted
> by EndNote field, then by RIS tag, to detect variations in export of a
> given EndNote field across reference types.  I could make these views
> into HTML (though since they are really the same data and there are
> 1793 rows, they'd probably get out of sync pretty quickly if the
> output style changes).
>
> I also have a tab in the spreadsheet where each RIS tag has a series
> of columns indicating whether it is in the true spec (old and outdated
> though it may be), used by EndNote, and then a description, link to
> the page where it is specified in the RefMan spec, etc.  I could make
> this HTML as well.
>
> Part of the usefulness is the sortability, but you could always keep a
> copy of the spreadsheet around for that.
>
> A few questions:
> - in the EDBOOK type exported from EndNote, the AU field is used for
> editors, not for authors,

Is that because EndNote doesn't support authors for that type?

> but "EDBOOK" maps to "book" just the same as
> "BOOK" does (and before I rewrote the mappings, it just mapped to
> generic since "BOOK" took up the one slot for mapping to "book" in
> zotero).  To address this, I started storing the original reference
> type in the item along with the zotero item type, and then for an
> author, in the section for books, I check if EDBOOK, and if EDBOOK, I
> make AU create editors instead of authors.  Does this sound about
> right?

Well, if you look at the bibutils mappings I earlier posted:

{ "A2", "AUTHOR", PERSON, LEVEL_HOST },
{ "A3", "AUTHOR", PERSON, LEVEL_SERIES },
{ "ED", "EDITOR", PERSON, LEVEL_HOST },

So he's mapping the "ED" tag to his internal "EDITOR" variable. But
this for an article, and I don't see anything for an "EDBOOK" type. Is
that actually a standard RIS type?

In any case, I would probably interpret that as a bug in EndNote and
map the AU tag to editor.

> - Authors are different for book sections, too - EndNote uses the A3
> tag (series author, standard part of Refman spec) to hold the series
> editor in these reference types: Conference Proceedings, Book Section,
> Audiovisual Material, Serial, Electronic Book Section; but uses A2 for
> series editors in these types: Book, Computer Program, Edited Book,
> Report
> Map, Web Page, Online Multimedia, Classical Work.

So to correct you a bit, EndNote uses the A3 tag to hold a variety of
non-primary contributors. In other words, they're abusing the spec a
bit in places, I think. The "publisher" one for reports seems odd.

Dan will have to comment on this.

Bruce

Avram Lyon

unread,
Dec 5, 2010, 5:39:48 PM12/5/10
to zoter...@googlegroups.com
Jonathan said:
>> In trying to figure out authors, it has proven
>> helpful for me to be able to look at a list of all EndNote reference
>> types, the RIS tags used in export, and the EndNote fields to which
>> they map, sorted by reference type, then by RIS tag; and then to be
>> able to look at the same list, resorted by RIS tag, then by EndNote
>> reference type, so I can see the way EndNote uses a given tag across
>> reference types; and then to be able to look at the same list, sorted
>> by EndNote field, then by RIS tag, to detect variations in export of a
>> given EndNote field across reference types.  I could make these views
>> into HTML (though since they are really the same data and there are
>> 1793 rows, they'd probably get out of sync pretty quickly if the
>> output style changes).
>>
>> I also have a tab in the spreadsheet where each RIS tag has a series
>> of columns indicating whether it is in the true spec (old and outdated
>> though it may be), used by EndNote, and then a description, link to
>> the page where it is specified in the RefMan spec, etc.  I could make
>> this HTML as well.
>>
>> Part of the usefulness is the sortability, but you could always keep a
>> copy of the spreadsheet around for that.

I think that a simple tab-delimited plain-text table would be great;
it could be easily maintained and distributed, and it can be imported
into Excel rather easily, or manipulated into mappings or re-sorted
using basic command-line tools or text editors.

>> I am about to the point where I can bust my references out of EndNote
>> (you can assign a tag to mulitple records easily in zotero!  I can
>> even get groups migrated in an hour or so!), so please let me know if
>> there is a chance this code could be integrated, and if so, how I can
>> help.  If not, I'll probably just put a post on the zotero forums to
>> let people know they are free to play with it if they want to and I'll
>> help if I can, but I don't want to implement and maintain a forked
>> version of the RIS importer.  If I'm going to stay involved, I'd
>> rather see if I can help make the trunk one work better within the way
>> you want it implemented, or help work on a better interoperability
>> dialect (and I think the importers would benefit from using a
>> framework at least similar to the one in this file that separates
>> overall control flow from mapping logic, instead of having a list of
>> if-then-else statements that is overall control flow with mapping
>> logic nested inside, for it to be easily and reliably maintained over
>> time - it is risky having the control flow for the import process so
>> tightly coupled to processing of the mappings - better to have the two
>> split out, so changes to mappings don't inherently also involve
>> changes to control flow that need to be tested, as well).

I think it should be possible to make this into a framework with
support for the core standard as well as the Endnote dialect; we're
hopefully moving in this direction with the MARC translator, which
runs into rather similar issues of data providers using and abusing
the spec in different ways (as well as multiple specs!).

It should be possible to set the RIS dialect (and thus the logic and
mappings) by using translator options, so users can specify what kind
of RIS they have in the import/export dialogs. Smart dialect sniffing,
if possible, would be great too (the MARC translator currently
attempts to sniff which MARC spec is being used).

I can't speak for Dan and the core team, but I would be very glad to
see your reworked code make it into the trunk if it can make
translator maintenance more pleasant and allow us to correctly import
more dialects of RIS. Out of respect for specifications, I'd want
Zotero's RIS export (at least in its default setting) to adhere to the
spec as we understand it, but added flexibility in import and
customization would be wonderful.

Regards,

Avram

jonathan.morgan

unread,
Dec 6, 2010, 5:23:00 AM12/6/10
to zotero-dev
OK. I'll convert the spreadsheet into tab-delimited files, and people
can convert from there as needed. Should I just chuck those files up
in my little git-space?

I'm happy to help with this if it is wanted. I did systems
integrations and migrations for years before deciding I wanted to
follow my first career love and work in journalism, so I've done a lot
of this kind of thing before. I actually enjoy the design of zotero.
It is great that the import translators are so easily accessible
(though I can't get accessDate to accept data, no matter what I do).

I placed another revision of RIS.js up a few minutes ago. I ran
through an import with 20 records in it and it worked OK, though I had
to run it a few times - I think when you start being able to actually
pull in PDFs and external files, it becomes a lot more memory
intensive, and so it is not perfectly stable (though it is much more
stable when you disable all other add-ons). I tested it a lot today,
fixed bugs, and refined import of the following EndNote reference
types:

- Book (BOOK)
- Book Section (CHAP)
- Conference Paper (CPAPER)
- Edited Book (EDBOOK)
- Electronic Article (EJOUR) - web pages, for the most part. - still
issues with this one.
- Electronic Book (EBOOK)
- Generic (GEN)
- Journal Article (JOUR)
- Magazine Article (MGZN)
- Report (RPRT)
- Unpublished Work (UNPB)
- Web Page (WEB)

I also rewrote the addDate() function so that it uses the built-in-
Javascript Date.parse() method to try to recover valid date
information from strings that are not to RIS spec (and also does a few
transformations - converts dashes to slashes, for example, and removes
periods - Javascript won't convert "Jul. 21, 1976", for example, but
it will convert "Jul 21, 1976"). I fixed a few TODOs related to dates
I saw in there, too, while I was at it.

A couple of things to ponder:
- I changed the MIME type lookup so it is one check of an object that
maps file extension to MIME type string, instead of 600 if-then-else's
chained together. It is still an object with 600 or so things in it,
though. One could probably pare this down, or find a way for people
to enable only those they need.
- I got the import to perform much better by disabling all add-ons
other than zotero and python while I was running the import. This
actually made it so I could pull in lots of PDFs (one reference had 8)
and performance improved substantially. It still occasionally goes
out to lunch, though, and when it does, you have to kill the browser,
and your entire import is not committed, so I'd take it in small
chunks with this script.
- Though disabling all the add-ons made it not matter so much, Regular
Expressions were the code that consistently killed the browser, and so
I tore most of them out of my code (though left the lines in there,
commented out, so they are documented). I tried adding a few back in
after disabling add-ins and it seemed to work about as well, but I
think I am going to leave any I haven't put back out for now. Have
you seen this before?
- Leaving the export to spec is wise. The Refman spec is old and
under-specified, though, so it could be that you'd guarantee that tags
in the spec will match the spec, then you could make up your own
additional tags that can be taken or left - though even at that, it
should probably be implemented as a different output type (Expanded
RIS or something like that).
- EndNote alone is so inconsistent with its use of RIS tags in its
export that supporting all of its ideosyncracies is probably going to
be huge and make the import break for RIS created by other systems.
I'll have to check out the MARC importer. Is the version you are
working on in the production release, or is it somewhere else? In
this file, I didn't make too many customizations. I just altered
things where data I had entered went either the wrong place, or
nowhere. There is probably a good enough that this import can aim for
that won't even come close to supporting all the random things EndNote
does that will be fine for most.
- I did have to update the output style one more time - Electronic
Book had an error in output of series editor (it was outputting
SeriesAuthor, which does not exist, not SeriesEditor). Other than
that and forcing journal articles to output related files, this is all
built to the default RIS Output Style in EndNote X4 (which I think is
the same for previous versions, as well).

Once I get all references in, I'll post any additional notes.

Thanks,

Jonathan

jonathan.morgan

unread,
Dec 12, 2010, 11:36:30 PM12/12/10
to zotero-dev
a few more small changes just checked in:
- two or three bug fixes.
- made it so keywords are split on semi-colons as well as newlines.
It seems like there is some confusion as to Refman standard - some
fields expect semi-colon delimited lists on one line, some expect
multiple tags, one item per tag (like keywords). I had references in
EndNote that were on multiple lines, and each line had multiple tags
separated by semi-colons (not sure if it was originally one line, and
EndNote made it more lines, or if it came like that from external
database).
- no more regular expressions in this code for now (though it looks
like the freeze-ups were from the PDF libraries that zotero uses to
index PDFs).

I'll get the data from the spreadsheet up as csv files after my finals
(later this coming week).

Jonathan

jonathan.morgan

unread,
Dec 12, 2010, 11:38:21 PM12/12/10
to zotero-dev
And EDBOOK isn't from the standard. That is an EndNote creation.

Jonathan

On Dec 6, 12:23 am, "jonathan.morgan" <jonathan.morgan....@gmail.com>
wrote:

Bruce D'Arcus

unread,
Dec 13, 2010, 12:09:28 AM12/13/10
to zoter...@googlegroups.com
On Sun, Dec 12, 2010 at 6:36 PM, jonathan.morgan
<jonathan....@gmail.com> wrote:

...

> - made it so keywords are split on semi-colons as well as newlines.
> It seems like there is some confusion as to Refman standard - some
> fields expect semi-colon delimited lists on one line, some expect
> multiple tags, one item per tag  (like keywords).  I had references in
> EndNote that were on multiple lines, and each line had multiple tags
> separated by semi-colons (not sure if it was originally one line, and
> EndNote made it more lines, or if it came like that from external
> database).

For the longest time (years and years), Endnote had a bug where it
output multiple keywords as:

KW - one kw
two kw
three kw

I'm pretty certain correct behavior with RIS is ....

KW - one kw
KW - two kw
KW - three kw

... and that ...

KW - one kw; two kw; three kw

... would also be a bug of sorts (if less onerous than the first one).

Bruce

jonathan.morgan

unread,
Dec 13, 2010, 3:18:25 PM12/13/10
to zotero-dev
Yeah, both are bugs, I think (one keyword per line, each its own KW
tag is what the spec says).

EndNote still outputs tags on multiple lines without appending a "KW"
to the beginning of each. I think some databases also export tags as
semi-colon delimited lists, broken on new lines - I ran into a few
references from my EndNote database that had the newline bug, but then
within each line, there were between 3 and 5 tags, separated by semi-
colons.

So, it was something like:

KW - one kw; two kw; three kw;
four kw; five kw; six kw;
seven kw; eight kw; nine kw;


I made the code break on newlines, then go through each line and break
on semi-colons, so it can deal with the different possible
permutations. I think that might be a bug further upstream than
EndNote, though. I'll see if I can figure out which database those
references came out of, maybe go look for the original import file and
see what that looks like.

Here's one actual reference with the keywords this way (not sure from
where):

TY - JOUR
AB - Students obtain information from a variety of sources that vary
with respect to their trustworthiness. And, while the trustworthiness
of a source ought to influence the weight and position that the
information in a document plays in students' overall comprehension of
a topic, there is little empirical evidence of this relationship. In
this study, the authors examine whether source evaluation is related
to single or cross-document comprehension. Participants read seven
texts on global warming that varied on important source
characteristics. The authors found that both trustworthiness ratings
of the most reliable documents and the use of document type as rating
criteria independently predicted comprehension. The authors also found
both similarities and differences between source evaluation regarding
a science topic compared to history topics. Several possible
mechanisms for the observed relationship between source evaluation and
comprehension when students use multiple texts are discussed.
AN - ISI:000262948500001
AU - Braten, I.
AU - Stromso, H. I.
AU - Britt, M. A.
DA - Jan-Mar
DO - 10.1598/rrq.41.1.1
KW - HIGH-SCHOOL-STUDENTS; HISTORICAL PROBLEM; EXPOSITORY TEXTS;
INFORMATION; DOCUMENTS; WEB; CREDIBILITY; INTEGRATION; ARGUMENTS;
EXPERTISE

IS - 1
M3 - Article
N1 - Braten, Ivar Stromso, Helge I. Britt, M. Anne
67
PY - 2009
SN - 0034-0553
SP - 6-28
ST - Trust Matters: Examining the Role of Source Evaluation in
Students' Construction of Meaning Within and Across Multiple Texts
T2 - Reading Research Quarterly
TI - Trust Matters: Examining the Role of Source Evaluation in
Students' Construction of Meaning Within and Across Multiple Texts
VL - 44
ID - 71
ER -

Richard Karnesky

unread,
Dec 21, 2010, 5:12:56 AM12/21/10
to zotero-dev
The semicolon-separated entries were surely imported from another tool
(or you manually entered them).

EndNote stores the keywords field as a long string & the citation-
formatting system (which is responsible for generating the RIS output)
is not able to split the string on either semicolons or new lines.
(They have special support for splitting on the author/editor fields,
but almost all other fields have extremely limited support for
manipulation (can't replace delimiters).) it is not possible to make
an EndNote style file that will generate valid RIS with multiple
keywords.

Parenthetically, the reason their XML export is better (though, by
many reports, not great in at least earlier versions) is that they
built a separate function to export the data, rather than rely on the
citation formatter.

--Rick

Jonathan Morgan

unread,
Dec 28, 2010, 3:28:40 PM12/28/10
to zotero-dev
Yes, they were imported from one of the online reference databases (I
am not sure which one - I wish the importer would have recorded that
in EndNote).

I am going to briefly poke around today in the current file handling,
see if I can see where I'd get the directory of the uploaded file. If
I can find that, then we'd have to see if it is saved after the upload
completes. If yes, I can just build that code so it knows how to
retrieve it from the Zotero object. If no, then we are once again at
a point where we'd need to alter code outside the translators.

Then sometime soon I'll have a look at the file handling in the trunk
and see what it is doing, see what options are going forward.

Jonathan

adamsmith

unread,
Jan 20, 2011, 6:44:55 PM1/20/11
to zotero-dev

jkr

unread,
Jan 21, 2011, 1:54:18 AM1/21/11
to zotero-dev
@adamsmith - thanks for the link to this thread. Just to make things
easier - this is what I was asking for over on the other thread:

Wish-list:
1 - BOOK (with only Editor/s as author/s)
- EDITOR is not being transferred to EDITOR in Endnote
- the entry is being marked as BOOK not as EDITED BOOK in Endnote

2 - Newspaper
Would it be possible to transfer the data for:
ISSUE DATE
and/or
EDITION
to the same fields in Endnote?

3 - Report
Right now the following data is transferred wrongly:
- REPORT NUMBER (is being exported to VOLUME, should be exported to
field REPORT NUMBER in Endnote)
- REPORT TYPE is not being exported at all (there is a field TYPE in
Endnote)
- DATE is not being exported at all (there is a field DATE in Endnote)
- ACCESSED is not being exported to ACCESS DATE in Endnote (or
anywhere else)

On Jan 20, 1:44 pm, adamsmith <bst...@gmx.de> wrote:
> See this thread for some other notes on compatibility issues:http://forums.zotero.org/discussion/15376/zotero-ris-export-misses-bo...
Reply all
Reply to author
Forward
0 new messages