Supporting multiple string formats

Benjamin

unread,

Nov 13, 2009, 6:43:58 PM11/13/09

to bibjson

Hey folks,

I'll chime in on the value of allowing multiple formats. Allowing
multiple formats:

1) Allows tools which migrate data to bibjson to preserve strings with
their original markup while also creating new strings in a more
standard format.
2) Allows data set publishers to provide alternative formats for
strings to ensure that records can be displayed on a wide variety of
systems.

In fact, I think we can do better than support multiple formats, and
support multiple alternative strings. In this way we can attach
arbitrary
attributes to a string. We could, for example, attach
language attributes, which would be especially useful for user
interfaces. (Users will probably prefer to see abstracts and subject
headings in their own language.)

This might look like the following:

"title" : {
"type" : "RichStringList",
"strings" : [
{
"type" : "RichString"
"value" : "An L(⅓ + ε) Algorithm for the Discrete Logarithm
Problem for Low Degree Curves",
"format" : "unicode",
"language" : "en"
},
{
"type" : "RichString"
"value" : "An $L (1/3 + \epsilon)$ Algorithm for the Discrete
Logarithm Problem for Low Degree Curves",
"format" : "latex",
"language" : "en",
"original" : "true",
"preferred" : "true"
},
{
"type" : "RichString"
"value" : "An L(1/3 + Epsilon) Algorithm for the Discrete
Logarithm Problem for Low Degree Curves",
"format" : "ascii",
"language" : "en"
}
]
}

The specification would have to provide a schema for these types and
provide vocabularies for the language and format attributes.

This seems preferable to having separate attributes for different
formats as it allows us to indicate which string should be preferred,
which string is the original, the language used, or any other metadata
that the dataset publisher wishes to include. It also would keep the
number of attributes in our vocabulary relatively small, since their
wont be a _html and _latex variant for every attribute that can take a
text value.

Benjamin Kalish
4 Lawn Ave, Apt 2L
Northampton, MA 01060-2221
Phone: 413-687-7738
Email: bka...@gmail.com
- Show quoted text -

Jack Alves

unread,

Nov 14, 2009, 5:26:54 PM11/14/09

to bib...@googlegroups.com

This makes sense to me. It is simple and flexible.

pitman

unread,

Nov 16, 2009, 1:09:17 AM11/16/09

to bibjson

I like this a lot.
As an exercise set for multiple formats and languages, I have started
to work on this dataset
http://www.math-inst.hu/~p_erdos/Erdos.html
with good cooperation from its owners. This is a challenging high
profile dataset for mathematics
bibliography. If we can handle this one with BibJSON we will have
proof of concept.
--Jim

Frederick Giasson

unread,

Nov 16, 2009, 10:48:45 AM11/16/09

to bib...@googlegroups.com

Hi!

> I'll chime in on the value of allowing multiple formats. Allowing
> multiple formats:
>
> 1) Allows tools which migrate data to bibjson to preserve strings with
> their original markup while also creating new strings in a more
> standard format.
> 2) Allows data set publishers to provide alternative formats for
> strings to ensure that records can be displayed on a wide variety of
> systems.
>

Yes

> In fact, I think we can do better than support multiple formats, and
> support multiple alternative strings. In this way we can attach
> arbitrary
> attributes to a string. We could, for example, attach
> language attributes, which would be especially useful for user
> interfaces. (Users will probably prefer to see abstracts and subject
> headings in their own language.)
>

Since the beginning of these BibJSON discussion (what, 8 months ago to a
year?) I thought the goal was to make these set of record descriptions
as simple as possible. Just describe the information about the records,
nothing more nothing less, is a more or less free-form.

What you are describing bellow certain works, however I would like to
mention a couple of things that could make the whole thing cleaner, and
that explains some of the process we try to explain for a couple of
weeks now.

The problem I see with this solution is that you are mixing two totally
different things in the sae *instance records* description file:

(1) all properties related to the records themselves (title, authors, etc)
(2) system processing properties such as: the language of a string, the
format of a string, etc, etc, etc.

This can be ok, don't get me wrong, but this certainly make instance
records files (this is what the data publisher really care about at the
end) much more complex.

Lets take the BKN usecase: sharing research related information between
departments. In this usecase, the data publishers are professors,
students, etc. People with, or without any knowledge in data description
& maintenance. They don't know what Unicode is, what a charset is, what
the ISO standard for language codes are, what an "original" record is,
etc, etc, etc.

They know that their article has a title, a isbn, authors, affiliated
institutions, etc. They only want to know what "attribute" to use to
describe that information. (give me the cookbook, but don't talk about
the chemical process of the interaction between all nutriments).

It is given this specific vision, this specific use-case, and by keeping
this kind of user in minds, that we created the schemas, and that we put
all the "system oriented" information directly in the schemas. We could
have done like XML and N3 and other serialization formats to add the
format, and language, and datatype information directly in the instance
records files, but given the use-case, we choose not to, and to put that
in the schema.

We want the record files as simple as possible.

It is why we created a "format" attribute that is defined in the
*schema* and not the *instance records* file. About the language, is it
a necessary complexity addition? We could add it to the schema too,
however, won't it be better to create language specific datasets instead
of putting all languages in the same one? Lets take Wikipedia. For the
record USA, would you prefer having one dataset per language, or having
the USA record with 45 times the same attribute in different languages?

Also, isn't a little bit complex to *only* describe a title literal?

> "title" : {
> "type" : "RichStringList",
> "strings" : [
> {
> "type" : "RichString"
> "value" : "An L(⅓ + ε) Algorithm for the Discrete Logarithm
> Problem for Low Degree Curves",
> "format" : "unicode",
> "language" : "en"
> },
> {
> "type" : "RichString"
> "value" : "An $L (1/3 + \epsilon)$ Algorithm for the Discrete
> Logarithm Problem for Low Degree Curves",
> "format" : "latex",
> "language" : "en",
> "original" : "true",
> "preferred" : "true"
> },
> {
> "type" : "RichString"
> "value" : "An L(1/3 + Epsilon) Algorithm for the Discrete
> Logarithm Problem for Low Degree Curves",
> "format" : "ascii",
> "language" : "en"
> }
> ]
> }
>

Won't this be easier, faster and more efficient?

{
"title": "An L(1/3 + Epsilon) Algorithm for the Discrete Logarithm
Problem for Low Degree Curve.",
"titleUnicode: "An $L (1/3 + \epsilon)$ Algorithm for the Discrete

Logarithm Problem for Low Degree Curves",

"titleLatex": "An $L (1/3 + \epsilon)$ Algorithm for the Discrete

Logarithm Problem for Low Degree Curves"
}

All the burden is put in the schema. Data publisher won't have to care
about defining all these properties, record after record after record
after record.

But there are even more considerations here: in your example, you have
712 characters, in mine, I have 335. Which is abuot 53% bigger. If we
extrapolate this to a whole dataset, you dataset would be 53% bigger
than mine just because we put all these technical considerations in the
instance record file instead of the schema. (in reality, it would be
lower, maybe 20 to 30%? bigger for an entire dataset).

But it is not only a matter of physical space, it is also a matter of
parsing/processing time of these datasets.

So, by putting all these consideration at the level of the schema, and
maybe the dataset description for the language, we gain:

(1) simpler instance records files
(1.1) easier to use
(1.2) easier to understand
(1.3) easier to analyze
(2) smaller instance records files
(3) faster to process records files
(4) less prone to transmission/transformation errors because all this
information lies at one place, in one record: the description of types
and attributes.
(5) easier to maintain dataset since a single change in the schema
impact all the records. Otherwise, you have to change *all* records if
you want to change one of these things.
(6) we can validate and process the same dataset using different schemas
for different purposes (think about all the different HTML validation
profiles (strictk relaxed, etc.)
(7) Anything else I don't think about?

The drawback is:

(1) defining more attributes such as title, titleLatex, titleUnicode.
(2) Anything else I don't think about?

> The specification would have to provide a schema for these types and
> provide vocabularies for the language and format attributes.
>
> This seems preferable to having separate attributes for different
> formats as it allows us to indicate which string should be preferred,
> which string is the original, the language used, or any other metadata
> that the dataset publisher wishes to include. It also would keep the
> number of attributes in our vocabulary relatively small, since their
> wont be a _html and _latex variant for every attribute that can take a
> text value.
>

Yes, but any solutions have advantages and disadvantages. If the perfect
solution would exists, we won't have all these serialization formats,
and all these programming languages, and all these cars, and all these
different kind of paper, light, etc... :)

So, we have to think about the usecase. Is there something already
existing given what we want to do? Yes, we reuse, no we create.

What I stated above is the usecase I understood from Jim and Nitin
months and months ago. And as this discussion continue, I see this good
vision fading away for other considerations. There was a need, there was
a usecase that created that need, and a solution has started to be created.

I think think that your example above helps anybody archiving this need
for the core BKN usecase.

So, this is what I have to say about this. I can be right and I can be
wrong. I am not the one that will take any final decision for BKN &
BibJSON, but these are the things to take care about and to consider
given my experience and knowledge. From there, it is up to BKN to choose
what it thinks it is best to do for themselves.

Thanks!

Take care,

Fred

Benjamin Kalish

unread,

Nov 17, 2009, 9:32:34 PM11/17/09

to bib...@googlegroups.com

Hi Fred,

What you say makes a lot of sense. It is easy to get carried away and
introduce needless complexity; this is something we definitely need to
be careful of.

My proposal would, without a doubt, increase the potential complexity
of a BibJSON record, and as such, the work required of a parser. It
need not, however, increase the complexity of any individual record.
The professors and student's in your usecase need not provide
alternative strings and they need not attach metadata to the strings
they provide. I do, however, think it desirable for dataset publishers
to have the option of providing such rich strings -- nearly all the
sample datasets I have seen have had attributes with some sort of
markup language, and my impression has been that we want to both
preserve and, potentially, use that data. If this isn't the case we
could definitely make our lives easier by simply stripping the markup
when converting to BibJSON.

As for languages, I believe that the ability to support multiple
languages greatly enhances a bibliographic standard. At the very
least, it should be possible to provide a string both in its original
language to provide a translation. A system could be devised which
provided just this ability and no more. I am not sure, however, that
it would be preferable. It could definitely lead to smaller file
sizes.

Benjamin Kalish
4 Lawn Ave, Apt 2L
Northampton, MA 01060-2221
Phone: 413-687-7738
Email: bka...@gmail.com

Reply all

Reply to author

Forward