Custom item types

29 views

Skip to first unread message

Dan Stillman

unread,

May 31, 2007, 5:49:46 PM5/31/07

to zotero-dev

Custom, user-created item types and fields are not currently possible in
Zotero. A few users have modified the Zotero database to add their own
custom types, but such modifications break upwards-compatibility,
locking users into obsolete versions of the software, and items that use
these custom types cannot be exchanged with other users. We'd like to
make it possible for users to easily create and share items using custom
item types and fields.

To get the conversation going, here's one possible way custom item types
and fields could work:

1) Users could create new item types via an interface in the client by
choosing an existing item type to use as a template. They could then add
add custom fields, and they'd have the option of indicating that a new
field was essentially the same as a field higher up the hierarchy so
that data would be preserved when switching between types (as is already
possible now with many of the built-in types--for example, "label" in
"audioRecording" is the same as "studio" in "videoRecording" because
they both map to "publisher").

2) Metadata for user-created item types and fields would be included in
the Zotero RDF export format so that items could be exchanged with
others. (Data in those fields would be exchanged in other output formats
only if the field was designated as being the same as a built-in field
that already had a mapping.) When importing an item with custom item
types that they hadn't added themselves, users would be prompted to A)
add the custom item types and fields and import all the data or B) not
add the custom item types and fields and discard any data outside of
built-in types and fields.

3) After creating custom types and fields, users would have the option
of submitting them to the Zotero server, where, through some community
process, they could be designated as recognized entities, added to the
main Zotero distribution, and synched to existing clients via the
repository. "Official" types and fields, unlike custom ones, would be
able to be localized for other languages, would not present a warning
when importing into another Zotero clients, and could be mapped to
fields of other data formats where possible. There would be a way to
show/hide fields so that, even with a larger set of recognized fields,
the metadata pane didn't become unwieldy.

Essentially we'd be encouraging--but not enforcing--a controlled
vocabulary for item types and fields to allow for localization and to
try to prevent having hundreds of user-created field names for the same
things. The community process would essentially be to say, "OK, that's a
reasonable English string to use for data of this type of field data,
and there isn't already an existing string to describe it," and if
necessary to map it to existing fields for conversion purposes.

This is related to but separate from the bibliographic ontology issues,
though it'd be preferable to implement a good hierarchical ontology
first so that custom types and fields were built on a strong semantic
foundation. Bruce and others can comment on this from RDF and other
perspectives. Our goal is to get this functionality into Zotero in the
near future for the benefit of users who are struggling to use Zotero in
new fields that it doesn't currently support (e.g. genealogical research).

- Dan

Bruce D'Arcus

unread,

May 31, 2007, 6:54:10 PM5/31/07

to zotero-dev

Dan,

I think you lay out the beginnings of a reasonable solution.

But I think we've yet to really define the problem. Can we perhaps do
that before moving on? I know you've thought about a lot of the below,
but I'd just like us to be explicit.

The most fundamental question is:

1) what do we mean by a resource type, and what purpose does it
achieve?

Is it just a hint for a GUI to configure a form template with the
right fields and their labels so that users can enter their data
reliably? Is it to enable reliable citation formatting? Or something
more like a tag that users assign to find stuff?

Does it refer to the medium or format of a resource (a "DVD"), to its
intellectual content (I dunno, a "poem"), or to the way it's
distributed (an "online document")? And what happens if the types
overlap (a poem included on a DVD)?

Lying behind this question are of course large implementation
questions like:

How would you fit this into the new hierarchical model? In the
database would a journal article be a resource with a type of
"JournalArticle" or is it an "Article" in a "Journal"?

Is a reference type here analogous to an RDF class or an object type?
Or is it rather more like an object attribute?

Or can we think of different kinds of type-like things? Maybe there
are types for GUI configuration and CSL templates (think "article-
journal"), a more internal and more relational type or class for the
database and RDF import/export, and even perhaps a third that is more
a natural language description or label ("press release")?

Two related questions are these:

2) why might users need custom types?
3) why might users need custom properties?

Finally:

4) what are the requirements that we would identify with custom types
and properties?

I think if we can find some consensus answers to these questions, and
also put in place that strong semantic foundation you mention, we
might just end up with an excellent solution that balances what are
often conflicting priorities (say flexibility versus
interoperability). There are dangers if we don't though.

RDF is flexible on the type issue, BTW. A resource may include
multiple types. We could, for example, say:

<http://ex.net/1>
rdf:type <http://bibliontology.org/ns#Article/> ;
rdf:type <http://bibliontology.org/ns#Review/> .

... or:

<http://ex.net/2>
rdf:type <http://bibliontology.org/ns#Broadcast/> ;
rdf:type <http://bibliontology.org/ns#Interview/> .

Note: am not saying what I've done above makes the most sense; just
that RDF allows it.

Bruce

Elena Razlogova

unread,

May 31, 2007, 9:56:50 PM5/31/07

to zoter...@googlegroups.com

Dan & Bruce--

I'm in complete agreement with all parts of Dan's general plan, with
one caveat: I think before doing what Dan proposes the "contained in"
feature for item types should be implemented (Not the entire
hierarchy--it looks like it will take months to work that out--but
just the ability to designate existing item types as containers to
other types), as described here: https://www.zotero.org/trac/wiki/
HierarchicalOntology

A lot of item types that people might want to add are already in
Zotero, but in two parts--for "letter in book" there is a "letter"
and "book" item type, for example--and it would be redundant to have
people add those as custom types. The "container" feature alone would
create dozens of additional "official" Zotero item types people have
been asking for.

To respond to Bruce's questions:

> 1) what do we mean by a resource type, and what purpose does it
> achieve?
>
> Is it just a hint for a GUI to configure a form template with the
> right fields and their labels so that users can enter their data
> reliably? Is it to enable reliable citation formatting? Or something
> more like a tag that users assign to find stuff?
>
> Does it refer to the medium or format of a resource (a "DVD"), to its
> intellectual content (I dunno, a "poem"), or to the way it's
> distributed (an "online document")? And what happens if the types
> overlap (a poem included on a DVD)?

Some of these questions can be answered with the "contained in"
feature. On others, let me argue again for maximum user flexibility,
where the item type is there to aid in research, and hence the same
source could be added in different ways dependent on one's research
focus. Take Bruce's example from the forum: http://forums.zotero.org/
discussion/826/ There are already item types "interview," "TV
broadcast", and "website". For Bruce, the resulting "combined" item
type would be "interview in website", citing interviewer,
interviewee, website title, and url. However, for a TV studies
scholar, the relevant item type would be "interview in broadcast",
with interview fields combined with title of broadcast, title of tv
series, date of broadcast, etc., plus the url for the transcript.
(This gets us back to the universal ID debate. I guess for the
hypothetical TV studies scholar the URL for the transcript would have
to be added as a web-link attachment rather than go into the URL
field, because part of a broadcast would have a different "universal
ID" than the web transcript. But at the very least it is important to
allow people to enter what looks like the same "object" in different
ways. This is done in real life all the time--film scholars would
analyze and cite the film itself, with theatrical date of release,
not the DVD they use to study the film. But a media scholar writing a
book on DVDs would analyze and cite the DVD.)

> How would you fit this into the new hierarchical model? In the
> database would a journal article be a resource with a type of
> "JournalArticle" or is it an "Article" in a "Journal"?

In terms of GUI, i think it would make sense, for example, to have
one item type "letter" in the "add item" menu, and then have user add
various "containers" (book, journal, magazine, newspaper, website). I
think Journal Article and a few other popular types may be an
exception to this--it's so common that it'll be counterintuitive not
to have it in the "add item" menu.

> Two related questions are these:
>
> 2) why might users need custom types?
> 3) why might users need custom properties?

A genealogical example from my own research: I wrote a translator
from Ancestry.com that collects information from US census records.
Right now, it goes into the "book section" item type, with census
citation info as a book, plus a name of the person in the census as
"contributor" and the placename "Winesburg, Ohio" as the book section
title--this is really a hack rather than a proper entry. For my
research I'd like to search and sort the census info by city, date of
birth of the person, etc. All of this info can be parsed by the
translator but in order to search and sort I need dedicated fields
for city, dob, etc. which don't exist.

> 4) what are the requirements that we would identify with custom types
> and properties?

Not sure what you mean by "requirements", but to continue with
flexibility argument: It doesn't make sense to me to designate
certain item types (i.e. "journal") as container types only, while
making others primary types. In theory, a letter could be a primary
item ("letter in book") or a container ("poem in letter"), or both
("poem in letter in book"). Likewise, what is now designated as
"ancillary" item type in ontology above may need to be primary--I
really need a "serial" item type for example to enter periodicals I'm
working on.

> Or can we think of different kinds of type-like things? Maybe there
> are types for GUI configuration and CSL templates (think "article-
> journal"), a more internal and more relational type or class for the
> database and RDF import/export, and even perhaps a third that is more
> a natural language description or label ("press release")?

This makes good sense to me (see example of GUI for entering Journal
Article vs. Letter above--there interface logic would be different
from RDF).

Best,
Elena

Bruce D'Arcus

unread,

May 31, 2007, 11:10:04 PM5/31/07

to zotero-dev

On May 31, 9:56 pm, Elena Razlogova <elena.razlog...@gmail.com> wrote:

> Some of these questions can be answered with the "contained in"
> feature. On others, let me argue again for maximum user flexibility,
> where the item type is there to aid in research, and hence the same
> source could be added in different ways dependent on one's research
> focus. Take Bruce's example from the forum:http://forums.zotero.org/
> discussion/826/ There are already item types "interview," "TV
> broadcast", and "website". For Bruce, the resulting "combined" item
> type would be "interview in website", citing interviewer,
> interviewee, website title, and url. However, for a TV studies
> scholar, the relevant item type would be "interview in broadcast",
> with interview fields combined with title of broadcast, title of tv
> series, date of broadcast, etc., plus the url for the transcript.

Right. This is what I favor: in fact, a defined and flexible set of
core classes and standard properties, where can be combined in
flexible ways.

> (This gets us back to the universal ID debate. I guess for the
> hypothetical TV studies scholar the URL for the transcript would have
> to be added as a web-link attachment rather than go into the URL
> field, because part of a broadcast would have a different "universal
> ID" than the web transcript. But at the very least it is important to
> allow people to enter what looks like the same "object" in different
> ways. This is done in real life all the time--film scholars would
> analyze and cite the film itself, with theatrical date of release,
> not the DVD they use to study the film. But a media scholar writing a
> book on DVDs would analyze and cite the DVD.)

Yes, this is a really interesting and messy issue here. I actually did
recently cite an interview in a television broadcast, but I cited the
published transcript for it, and I used the URL for the transcript, in
the URL field.

> > How would you fit this into the new hierarchical model? In the
> > database would a journal article be a resource with a type of
> > "JournalArticle" or is it an "Article" in a "Journal"?
>
> In terms of GUI, i think it would make sense, for example, to have
> one item type "letter" in the "add item" menu, and then have user add
> various "containers" (book, journal, magazine, newspaper, website). I
> think Journal Article and a few other popular types may be an
> exception to this--it's so common that it'll be counterintuitive not
> to have it in the "add item" menu.

Yes, though I worry that the question of intuitiveness is likely to be
community-specific. And in any case, as I mentioned below, it is
possible to distinguish between what a user sees, and how it's stored
in the database or the RDF.

> > Two related questions are these:
>
> > 2) why might users need custom types?
> > 3) why might users need custom properties?
>
> A genealogical example from my own research: I wrote a translator
> from Ancestry.com that collects information from US census records.
> Right now, it goes into the "book section" item type, with census
> citation info as a book, plus a name of the person in the census as
> "contributor" and the placename "Winesburg, Ohio" as the book section
> title--this is really a hack rather than a proper entry.

Yeah, ouch. I tend to use the "Document" type for this sort of stuff,
but that might not work here.

> For my research I'd like to search and sort the census info by city, date of
> birth of the person, etc. All of this info can be parsed by the
> translator but in order to search and sort I need dedicated fields
> for city, dob, etc. which don't exist.

Right. So in the case of "city" we really need a generic place (or
maybe jurisdiction?) property in the DB and accompanying RDF.

> > 4) what are the requirements that we would identify with custom types
> > and properties?
>
> Not sure what you mean by "requirements",

I mean what do we want to achieve? Flexibility is good, for example,
but not if your citations don't format correctly, or there are
problems transferring data.

> but to continue with flexibility argument: It doesn't make sense to me to designate
> certain item types (i.e. "journal") as container types only, while
> making others primary types. In theory, a letter could be a primary
> item ("letter in book") or a container ("poem in letter"), or both
> ("poem in letter in book"). Likewise, what is now designated as
> "ancillary" item type in ontology above may need to be primary--I
> really need a "serial" item type for example to enter periodicals I'm
> working on.

I have never liked the "ancillary types" bucket. I certainly wouldn't
model that in the RDF.

I also wouldn't define container types per se. As you say, the real
world is too messy for that, and the contained-container thing is a
relation.

What I have done in the RDF and when I was experimenting with my own
SQL schemas previously is to separate out "collections." These are
things one would never cite independently, nor list in the library
view. I include in those periodicals, archival collections, and so
forth.

> > Or can we think of different kinds of type-like things? Maybe there
> > are types for GUI configuration and CSL templates (think "article-
> > journal"), a more internal and more relational type or class for the
> > database and RDF import/export, and even perhaps a third that is more
> > a natural language description or label ("press release")?
>
> This makes good sense to me (see example of GUI for entering Journal
> Article vs. Letter above--there interface logic would be different
> from RDF).

Right.

Bruce

Dan Stillman

unread,

Jun 1, 2007, 2:22:06 AM6/1/07

to zoter...@googlegroups.com

On 5/31/07 11:10 PM, Bruce D'Arcus wrote:
...

> What I have done in the RDF and when I was experimenting with my own
> SQL schemas previously is to separate out "collections." These are
> things one would never cite independently, nor list in the library
> view. I include in those periodicals, archival collections, and so
> forth.

I suspect people will have different preferences for what should show up
in the library view (and perhaps even what would be cited
independently), and it will also be different for different items: users
may not want a parent book item to show up for every book section item
they create, but sometimes they will want the separate parent item (say
when they're citing multiple chapters). Our thinking on this is that all
items contained within other items would automatically create all
necessary parent items in the database, but whether or not the parent
items were shown would be configurable by the user on a per-item basis.
This would allow maximum flexibility and also greatly simplify the
implementation, as almost everything would be an item type as far as the
code was concerned.

Bruce D'Arcus

unread,

Jun 1, 2007, 8:47:37 AM6/1/07

to zotero-dev

On Jun 1, 2:22 am, Dan Stillman <dstill...@zotero.org> wrote:
> On 5/31/07 11:10 PM, Bruce D'Arcus wrote:
> ...
>
> > What I have done in the RDF and when I was experimenting with my own
> > SQL schemas previously is to separate out "collections." These are
> > things one would never cite independently, nor list in the library
> > view. I include in those periodicals, archival collections, and so
> > forth.
>
> I suspect people will have different preferences for what should show up
> in the library view (and perhaps even what would be cited
> independently), and it will also be different for different items: users
> may not want a parent book item to show up for every book section item
> they create, but sometimes they will want the separate parent item (say
> when they're citing multiple chapters).

This I'm not really following. If I am reading a new edited book, I
expect my workflow to be:

1. suck in book metadata from Amazon, worldcat.org, etc.
2. do "add Chapter" (or whatever) and enter the chapter information;
to add information about container, I select the already-stored book
(by autocomplete?).

So the "parent" here is its own row, and only (and always) gets
displayed in the library view once.

Right?

> Our thinking on this is that all
> items contained within other items would automatically create all
> necessary parent items in the database,

Makes sense.

> but whether or not the parent
> items were shown would be configurable by the user on a per-item basis.

But here's the thing: an edited book is a citable item. It belongs in
the library view regardless.

When I am talking "collections" I am talking not about edited books: I
am talking about book series.

I don't know; maybe wihh the nested library view, it might be better
to leave it flexible. I guess I could imagine seeing an archival
collection in the library view, where I could drill down to
collections and documents from there.

> This would allow maximum flexibility and also greatly simplify the
> implementation, as almost everything would be an item type as far as the
> code was concerned.

Yes, I can see that, particularly if you're using a relational
database for storage. My main worry, obviously, is in keeping this in
sync with CSL and the RDF.

Bruce

Dan Stillman

unread,

Jun 1, 2007, 2:43:09 PM6/1/07

to zoter...@googlegroups.com

On 6/1/07 8:47 AM, Bruce D'Arcus wrote:
> On Jun 1, 2:22 am, Dan Stillman <dstill...@zotero.org> wrote:
>> I suspect people will have different preferences for what should show up
>> in the library view (and perhaps even what would be cited
>> independently), and it will also be different for different items: users
>> may not want a parent book item to show up for every book section item
>> they create, but sometimes they will want the separate parent item (say
>> when they're citing multiple chapters).
>
> This I'm not really following. If I am reading a new edited book, I
> expect my workflow to be:
>
> 1. suck in book metadata from Amazon, worldcat.org, etc.
> 2. do "add Chapter" (or whatever) and enter the chapter information;
> to add information about container, I select the already-stored book
> (by autocomplete?).
>
> So the "parent" here is its own row, and only (and always) gets
> displayed in the library view once.
>
> Right?

If you create the parent item first, that would make sense, but it seems
quite possible that someone might not want the automatically created
parent to show up in the items list for every child item they create. If
you're citing a poem in a letter in a book section in an edited book, do
you really want four new items created in the library? Maybe, but I
imagine some people wouldn't.

...

>
> When I am talking "collections" I am talking not about edited books: I
> am talking about book series.

OK, book series might be an example of something that would never be
cited, but for some item types, your own requirements may be different
from others'--Elena has mentioned needing to cite periodicals and
archives as independent items
(http://forums.zotero.org/discussion/391/1/hierarchical-item-relationships/#Item_48).

> I don't know; maybe wihh the nested library view, it might be better
> to leave it flexible. I guess I could imagine seeing an archival
> collection in the library view, where I could drill down to
> collections and documents from there.

We hadn't discussed using the hierarchical view in this way, but that
might make sense.

>> This would allow maximum flexibility and also greatly simplify the
>> implementation, as almost everything would be an item type as far as the
>> code was concerned.
>
> Yes, I can see that, particularly if you're using a relational
> database for storage. My main worry, obviously, is in keeping this in
> sync with CSL and the RDF.

Actually, the relational DB is the biggest problem with this approach.
If things were separate entities, there'd be separate tables for each
one and use fewer queries, but this way everything would be in the same
table. The standard way to approach this would be with a parent field,
but (unless you're using Oracle's hierarchical queries) that's
inefficient performance-wise, since you need to either pull the entire
table or walk the hierarchy with separate queries. There are some more
advanced ways to structure hierarchical data in a DB, though, and we
might use some of those approaches.

But DB/performance issues aside, this would make the code much simpler,
since we wouldn't have to handle each type separately.