parsing a bibtex file

রুদ্র ব্যাণার্জী

unread,

Sep 22, 2012, 6:06:17 PM9/22/12

to

Dear friends,
Bibtex file is a citation format, used largely with latex. A sample
bibtex entry (copied from bibtex.org) looks like:

@article{mrx05,
Author = "Mr. X",
Title = {Something Great},
Publisher = "nob" # "ody"},
Year = 2005,
}

generally a bibtex file contains hundreds of such entry.
I am trying to parse these file and use the value of each token in a C programme.
To my knowledge, flex is the way of doing this, and I trying to do this.
Till now, I managed to write a code:
$ cat pbib.l
%{
#include <stdio.h>
#include <stdlib.h>
%}

%{
char yylval;
int YEAR;
%}
%x author
%x title
%x pub
%x year
%%
@[a-zA-Z][a-zA-Z0-9]* { printf("%s",yytext);}
[Aa]uthor= {BEGIN(author);}
<author>\"[a-zA-Z\/.]+\" { printf("%s",yytext);
BEGIN(INITIAL);}
[Yy]ear= {BEGIN(year);}
<year>\"[0-9]+\" {printf("%s",yytext);
BEGIN(INITIAL);}
[Tt]itle= {BEGIN(title);}
<title>\"[a-zA-Z\/.]+\" { printf("%s",yytext);
BEGIN(INITIAL);}
[Pp]ublisher= {BEGIN(pub);}
<pub>\"[a-zA-Z\/.]+\" { printf("%s",yytext);
BEGIN(INITIAL);}
[a-zA-Z0-9\/.-]+= printf("ENTRY TYPE ");
\" printf("QUOTE ");
\{ printf(" ");
\} printf(" ");
; printf("SEMICOLON ");
\n printf("\n");
[\,\n\}][\,\}] printf("\nNEWENTRY");
%%

Which is working fine, if I print in stdout only. Since I want to use the output of lex.yy.c,
I need have each yytext from all entry stored in array(probably). So, all yytext
for token Author should be stored in array_author[i], all from year in array_year[i] and so on.
But I have not managed to do this.

What I am currently doing is using shell script to write each entry in different file,
an again pushing C read those file...which is clearly very BAD way of doing things.

I am using flex only because I dont know any better tool, so kindly advice me if any better
way available, but for C (hence, boost is not probably a solution).

Can someone here kindly show me how can I extend my flex code so that I have those data
stored?

none Rouben Rostamian

unread,

Sep 23, 2012, 3:39:41 AM9/23/12

to

In article <1348351577.3333.14.camel@roddur>,

র�§ দ�§ র ব�§ যাণা র�§ �¦ �§ <bnrj....@gmail.com> wrote:
>Bibtex file is a citation format, used largely with latex. A sample
>bibtex entry (copied from bibtex.org) looks like:

>...

>generally a bibtex file contains hundreds of such entry.
>I am trying to parse these file and use the value of each token in a C
>programme.
>To my knowledge, flex is the way of doing this, and I trying to do this.

Have a look at btOOL:

http://www.gerg.ca/software/btOOL/

It is a C library of functions to read and manipulate bibtex files.

--
Rouben Rostamian

Rudra Banerjee

unread,

Sep 23, 2012, 5:50:22 PM9/23/12

to

Is it possible to achieve my goal using flex?

Ben Bacarisse

unread,

Sep 23, 2012, 6:10:14 PM9/23/12

to

Rudra Banerjee <bnrj....@gmail.com> writes:

> Is it possible to achieve my goal using flex?

It would help if you quoted some material when you reply.

Your reply was to a suggestion of a library you could use, but you don't
say why you can't use that library. For example, you might just be
trying to learn to use flex, or your project is not permitted to use
certain kinds of library and so on. Knowing this will help everyone
make more useful suggestions.

But to return to your latest question, I don't know what your goal is so
I can't say if it can be achieved using flex. You previously wanted to
store the tokens that flex had found, and of course you can do that, but
how you should do it depends entirely on what the overall goal is. For
example, if the goal is to count how many people every author has
co-authored a paper with, you would need to store only authors and
counts and you might do that with, say, a hash table. But if the goal
is to write a scanner that can print out publications that match some
given criteria, then you don't need to store anything except the start
position of the most recent entry (though you might choose to store each
entry as you match it to avoid having to seek the file backwards).

--
Ben.

রুদ্র ব্যাণার্জী

unread,

Sep 23, 2012, 6:50:30 PM9/23/12

to

Ben,
Thanks for your reply.
I am trying to learn parsing (and may be scrapping some time later),
which means, I should not use ready-to-use solution(its self assigned
work, please don't assume I am asking you to solve my homework).
As in the first post in this thread, I stated that, I need to store
value of each author token.
So, if in my bibtex file has entries like:

@article{mrx05,
Author = "Mr. X",
Title = {Something Great},
Publisher = "nob" # "ody"},
Year = 2005,
}

@article{mra12,
Author = "Mr. A and Mr. B",
Title = {Something Equally Great},

Publisher = "nob" # "ody"},

Year = 2012,
}

I would like to have
array_author[0]="Mr. X"
array_author[1]="Mr. A and Mr. B"

array_year[0]=2005
array_year[1]=2012

and so on.
For completeness, I am pasting the flex code that I have managed(copy
from the first post). It parses everything correctly, but I don't know
how to store the results in array. Please help.

Ben Bacarisse

unread,

Sep 23, 2012, 7:26:39 PM9/23/12

to

That's a rather crude data structure but since you don't say (or, more
likely, you don't yet know) what you want to do with the data once
you've stored it I can't suggest anything better.

It looks like the current problem is that you don't know how to copy a
string (for titles and so on) nor how to turn a string into an int for
the year. You can use strdup to copy the string values and you can use
strtol (or, for simplicity while learning) atoi to convert yytext to an
integer value. If you don't want to rely on strdup, you should write
your own using malloc followed by strcpy.

You'll need to keep a global index that you increment when a new entry
starts and you'll have to decide what to do when an entry lacks some
data.

<snip>
--
Ben.

Ben Bacarisse

unread,

Sep 23, 2012, 7:33:59 PM9/23/12

to

রুদ্র ব্যাণার্জী <bnrj....@gmail.com> writes:
<snip>

> I am trying to learn parsing

<snip>

> I would like to have
> array_author[0]="Mr. X"
> array_author[1]="Mr. A and Mr. B"
>
> array_year[0]=2005
> array_year[1]=2012
>
> and so on.

I should have made another point.

If you are trying to learn parsing, you probably shouldn't be storing
the values into an array here in the lexical analysis code. It's fun to
see how much you can do with flex alone, but ultimately you should be
building the data structure in the parser, not the lexer. All the lexer
need do is copy the value associated with the token (often a string or
an int) somewhere where the parser can see it. If the parse is
generated by bison (the yacc-a-like that often goes with flex) there are
standard ways to do that. A good yacc/bison tutorial will cover them
for even in the simplest examples.

<snip>
--
Ben.

রুদ্র ব্যাণার্জী

unread,

Sep 23, 2012, 7:34:52 PM9/23/12

to

On Mon, 2012-09-24 at 00:26 +0100, Ben Bacarisse wrote:
> It looks like the current problem is that you don't know how to copy a
> string (for titles and so on) nor how to turn a string into an int for
> the year.

Ben,
Probably I am not clear with my problem. The problem is:
flex write all output as yytext, as in

<author>\"[a-zA-Z\/.]+\" { printf("%s",yytext);

My problem is how to redirect these yytext to specific array.
so for first entry,
I will have
<author>\"[a-zA-Z\/.]+\" { printf("%s",yytext);=> yytext to be written
on array_author[0]

so, the problem is basically handling yytext.

James K. Lowden

unread,

Sep 23, 2012, 8:15:03 PM9/23/12

to

On Sun, 23 Sep 2012 14:50:22 -0700 (PDT)
Rudra Banerjee <bnrj....@gmail.com> wrote:

> Is it possible to achieve my goal using flex?

Just to give you another way to frame the question: Is bibtex a
context-free language? If so, the answer is to your question is Yes --
at least on conjunction with yacc/bison -- because those tools generate
an LALR(1) parser which, by defintion, parse a definite context-free
language.

HTH.

--jkl

Ben Bacarisse

unread,

Sep 23, 2012, 9:09:28 PM9/23/12

to

রুদ্র ব্যাণার্জী <bnrj....@gmail.com> writes:

Yes, I get that. I thought I'd said enough. Should I write out how to
call strdup:

<author>\"[a-zA-Z\/.]+\" { array_author[index] = strdup(yytext); }

with an appropriate #include <string.h>? Obviously you need to change
"index" when an entry is finished.

There are issues to do with feature test macros, but I'm going to leave
that for the moment since I think the problem is a much simpler one
right now.

--
Ben.

Rui Maciel

unread,

Sep 23, 2012, 9:16:10 PM9/23/12

to

Rudra Banerjee wrote:

> Is it possible to achieve my goal using flex?

Flex generates lexers, which are routines that convert sequences of
characters into sequences of tokens. If you wish to write a parser then
getting a hold of a sequence of tokens is half the battle; the other half is
making a sense of that sequence of tokens. For that you need another
routine, called a parser, which builds upon the lexer.

The standard combo to develop parsers consists of relying on lex/flex to
generate a lexer and yacc/bison to generate a parser that uses flex's lexer.
There is a considerable amount of resources dedicated to this specific
topic, both on the web and in good old dead tree format.

In spite of that, I believe that flex/bison does more harm than good,
particularly when developing parsers for languages which are relatively
simple. It takes a bit of time to learn how to use it properly, and it will
also cause a number of problems that may not be trivial to fix. As an
alternative, you can always write the parser yourself, without any fancy
tool or code generator. There are some disadvantages to that, such as
taking a bit of work to write and debug, and ending up with a component
which is harder to maintain. Yet, the advantages often outweigh the
disadvantages: you have much more control over what code goes into your
parser, your parser tends to be considerably more efficient, you don't have
to fiddle with your build system to support flex and bison, and once you
know how to use it you will be able to develop parsers in any language other
than C or C++.

In the case you decide to try to write your own bibtex parser by hand as a
learning experience then you will do no wrong in looking up LL parsers.
Wikipedia has an article on that topic which you might find interesting.

http://en.wikipedia.org/wiki/LL_parser

Hope this helps,
Rui Maciel