Structured names?

6 views
Skip to first unread message

Tom Morris

unread,
Dec 17, 2009, 6:21:00 PM12/17/09
to bib...@googlegroups.com
I was looking at the BibJSON spec dated 25 Nov 2009 and the associated
schema at http://www.bibkn.org/drupal/bibjson/bibjson_schema_v012.json
and, as far as I can tell, there's no provision for including any
structure in names (e.g. surname, given name, etc). Is there a way to
specify this information? If not, is there a plan to include it?

Databases often having information about the various name parts, so
forcing serialization into BibJSON as a single name string will result
in a lossy transformation which is difficult to undo later.

Tom

p.s. The JSON schema is pretty difficult to work with in its raw
format. What modeling tools are recommended for viewing, editing,
etc?

Frederick Giasson

unread,
Dec 18, 2009, 10:20:25 AM12/18/09
to bib...@googlegroups.com
Hi Tom,

> I was looking at the BibJSON spec dated 25 Nov 2009 and the associated
> schema at http://www.bibkn.org/drupal/bibjson/bibjson_schema_v012.json
> and, as far as I can tell, there's no provision for including any
> structure in names (e.g. surname, given name, etc). Is there a way to
> specify this information? If not, is there a plan to include it?
>
> Databases often having information about the various name parts, so
> forcing serialization into BibJSON as a single name string will result
> in a lossy transformation which is difficult to undo later.
>
BKN did talk about this, but it didnlt make it in so far. I would
suggest you to use the same method (vocab) as BIBO
(http://bibliontology.com), and to extends BIBJSON with this new schema
extension.

> p.s. The JSON schema is pretty difficult to work with in its raw
> format. What modeling tools are recommended for viewing, editing,
> etc?
>
Unfortunately nothing else than text editors for now (like notepad++). I
don't particularly like the general JSON serialization myself, and I
know others that don't neither. So it is the reason why we helped BKN to
integrate BibJSON into the irON [1] notation and its irJSON profile [2].
That way, you can serialize BibJSON into other serialization formats
such as irXML [3] and commON [4].


Hope this helps!

[1] http://openstructs.org/iron/iron-specification
[2] http://openstructs.org/iron/iron-specification#mozTocId462570
[3] http://openstructs.org/iron/iron-specification#mozTocId408837
[4] http://openstructs.org/iron/iron-specification#mozTocId603499

Thanks,


Take care,


Fred

Tom Morris

unread,
Dec 18, 2009, 3:42:36 PM12/18/09
to bib...@googlegroups.com
Thanks Frederick. Comments inline below.

On Fri, Dec 18, 2009 at 10:20 AM, Frederick Giasson <fr...@fgiasson.com> wrote:
> Hi Tom,
>>
>> I was looking at the BibJSON spec dated 25 Nov 2009 and the associated
>> schema at http://www.bibkn.org/drupal/bibjson/bibjson_schema_v012.json
>> and, as far as I can tell, there's no provision for including any
>> structure in names (e.g. surname, given name, etc).  Is there a way to
>> specify this information?  If not, is there a plan to include it?
>

> BKN did talk about this, but it didnlt make it in so far. I would suggest
> you to use the same method (vocab) as BIBO (http://bibliontology.com), and
> to extends BIBJSON with this new schema extension.

So to unwrap the layering here, that would appear to be givenname and
family_name from FOAF and a private prefixName and suffixName?

I'd definitely recommend something along those lines, although
choosing just one of underscores, camelCase, or simple concatenated
names as a naming convention might make it easier for users. A way to
indicate preferred name for collating purposes might be useful as
well, although I recognize that there are often
culture/language/application-specific requirements on collating
sequences that make any such information in an interchange format
advisory at best. If you're not going to support honoricPrefix from
FOAF, you might want to at least mention it so that people are aware
it could get dropped.

>> p.s. The JSON schema is pretty difficult to work with in its raw
>> format.  What modeling tools are recommended for viewing, editing,
>> etc?
>
> Unfortunately nothing else than text editors for now (like notepad++). I
> don't particularly like the general JSON serialization myself, and I know
> others that don't neither. So it is the reason why we helped BKN to
> integrate BibJSON into the irON [1] notation and its irJSON profile [2].
> That way, you can serialize BibJSON into other serialization formats such as
> irXML [3] and commON [4].

I've never heard of irON and as far as I can tell none of my modeling
tools support it. By 'modeling tool' I mean something which shows the
relationships between types/properties, allows navigation around the
model, etc. Examples might be Protégé for OWL or the Rational tools
(commercial) or ArgoUML (open source) for UML. What are the
equivalents for JSON schema (or irON)? Even a simple HTML based
browser like ontology-browser would be an improvement over paging
around a text file.

Tom

Benjamin Kalish

unread,
Dec 19, 2009, 11:38:07 PM12/19/09
to bib...@googlegroups.com
When thinking about structured names it may be helpful to look at how BibTeX handles them. BibTeX doesn't allow for much structure besides simple attribute-value pairs and not only does it not provide separate attributes for different components of a name, it requires that all names for a given attribute (authors, editors, etc.) be given in a single string! To make up for this, BibTeX uses a set of rules to interpret the text in such fields and break them down into their component parts.

The rules used by BibTeX are as follows (based on my understanding Oren Patashnik's "BibTeXing", which is as close to a manual for BibTeX as I have found -- if you want to know exactly how BibTeX decomposes names, especially if you want to know how it handles ambiguous or nonconforming strings, I would recommend examining the source code):

1. If the names of many people are given in a single field they must be separated by the word "and".

2. The phrase "and others" may be used at the end of such a field to indicate that additional, unnamed, persons could also have been included in the field. This is similar to the use of "et al." in bibliographies.

3. A name consisting of a single token is always interpreted as a last name.

4.  If a name consists of multiple tokens, each of which begins with an uppercase letter, and contains no commas, then all tokens but the last are interpreted as as the first name while the last token is interpreted as the last name.

5. If a name consists of multiple tokens, some of which begin with a lowercase letter, and contains no commas, then all the tokens before the first lowercase token are interpreted as the first name, all the tokens after the last lowercase token are interpreted as the last name, and the tokens in the middle are considered nobiliary particles or "von words".

6. If a name contains a single comma then any initial lowercased tokens are consider nobiliary particles, remaining tokens preceding the comma are considered the last name, and everything after the comma is considered the first name.

7. If a name contains two commas then any initial lowercased tokens are consider nobiliary particles, remaining tokens preceding the first comma are considered the last name, tokens found between the commas are considered the suffix, and everything after the last comma is considered the first name.

Admittedly, these rules are mess, and were designed only with bibliography creation in mind, not the accurate description of names, but if BibJSON must support support the conversion of BibTeX data to BibJSON using simple and lightweight tools, then we have little choice but to inherit them.

This doesn't mean that BibJSON can't include additional attributes to support structured names. Some may argue that supporting both approaches might be overly complex and lead to confusion, though I think we all agree that authors may always elect to include additional attributes if they wish. Personally, I would like to see the inclusion of such attributes, though their use could not be required.

Benjamin Kalish

Tom Morris

unread,
Dec 20, 2009, 6:09:08 PM12/20/09
to bib...@googlegroups.com
On Sat, Dec 19, 2009 at 11:38 PM, Benjamin Kalish <bka...@gmail.com> wrote:
> Admittedly, these rules are mess, and were designed only with bibliography
> creation in mind, not the accurate description of names, but if BibJSON must
> support support the conversion of BibTeX data to BibJSON using simple and
> lightweight tools, then we have little choice but to inherit them.

Certainly a conversion tool would have to follow the rules, but
there's no reason they need to pollute the BibJSON spec itself if
BibJSON supports explicit declaration of the necessary information.
For example, using the English word "and" as a list separator wasn't a
particularly good practice in the 1980s, but it would be positively
hideous in 2010, particularly when JSON has native list structures.
The various name pieces that the conversion tool parses out using the
idiosyncratic rules would result in a set of tagged name pieces that
could be encoded directly in JSON. Users can then correct errors
directly rather than having to remember what the parsing rules will
do.

Tom

Benjamin Kalish

unread,
Dec 20, 2009, 7:04:56 PM12/20/09
to bib...@googlegroups.com
Hi Tom,

To be honest, the idea of working with a data structure that requires such complicated processing makes me shutter. I have have suggested, as you have, that conversion tools should bare the burden of normalizing such fields. Apparently, however, this goes against the philosophy of BibJSON in many peoples eyes and some feel that it would put an unnecessary burden on dataset authors. I think this conversation is a good reminder that that burden must rest *somewhere*. We can let BibJSON inherit conventions from a wide variety of existing bibliographic standards, including BibTeX, making conversion from these existing formats extremely easy, however in doing so we are putting a large burden on the tools that will use BibJSON and limiting its applications. On the other hand, it can't be denied that a stricter BibJSON spec that doesn't inherit such conventions will require more complicated conversion tools.

I like the idea of the stricter spec both because it will make datasets more uniform and (in my opinion) easier to understand, and also because it means that code necessary to parse complicated strings (such as the BibTeX name fields described above) need only reside in a single conversion tool, and not in every application which needs to extract names from BibJSON data.

I'd love to hear from those who have expressed a preference for a more flexible specification and light weight conversion tools. I am sure I haven't done their argument justice. I know Fred Giasson has spoken up about similar issues in the past. Fred, how do you feel about the above arguments? 

Benjamin Kalish

Nitin Borwankar

unread,
Dec 20, 2009, 8:25:41 PM12/20/09
to bib...@googlegroups.com
Hi all,

I have a preference for lightweight tools but I also understand the need for unambiguous specification.
Given that schema tools for JSON are still primitive I have tried some experiments with the "bibutils" tools at 

These are XML based but do reasonably good job of parsing names into components for the easier cases.
Also the tools provide conversions of multiple bib formats to/from a common MODS (library metadata) format.
Given that XML has a rigorous schema foundation and a rich assortment of validation tools in multiple languages I found it intriguing to consider the following approach.

a) use bibutils to go from BibTeX to MODS 
b) use XML<--> JSON tools to create a JSON version of MODS which allows lossless bidirectional conversion between XML MODS and JSON "MODS".
c) do all the validation, parsing rigorous checking etc on the XML MODS and when we are happy that the data is as it should be (for some definition of "as it should be"), then we do a 1-1 mapping to JSON.

This delegates the hard part (parsing, validation) etc to battle tested tools that are used by many and well supported by the author who is at a well known organization (Scripps).

This approach also has the following benefits  
a) gives us the ability to ingest data from Endnote, RIS, MS Word etc with zero cost - thus including cross-disciplinary data which is a huge plus.
b) allows us to focus on JSON without having to build the whole tool chain from the ground up just to get some work done.
c) plugs us into well defined library standards (yes they may have many shortcomings, but I'd rather us eand extend them than try to "boil the ocean")
d) most importantly allows us to use mature tools which are in use every day.

If we don't like MODS we could use it as a starting point and apply some XSLT transforms to it but this then essentially is a fork which is less desirable from a maintenance and support point of view.

I have been exploring some XML <--> JSON tools but don't yet have any recommendations - the transformation is not obvious and needs some  case by case analysis as seen in 
http://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html

So assuming this approach is useful - the work shifts from building the full tool chain for BibJSON and JSON schema validation, to doing lossless bi-di transformation bet XML and JSON for biblio data sets.
I believe the latter is a far more tractable problem given our resources.
 
Thoughts?

Nitin Borwankar

 






------------------------------------------------------------------
Nitin Borwankar
nborw...@gmail.com
Reply all
Reply to author
Forward
0 new messages