EBNF grammar for .proto files

2224 views
Skip to first unread message

Alek Storm

unread,
Jul 11, 2008, 7:14:25 PM7/11/08
to Protocol Buffers
Here is an EBNF grammar for .proto files. It's based on my reading of
the protoc source (mainly parser.cc, tokenizer.cc, and
unittest.proto). It's not much use to me until later, if I write a
standalone validator or .proto->serialized-message converter.
Hopefully it's useful to somebody else looking for a formal definition
of the schema. If there are any errors, please let me know, and I'll
update it (so scroll down, because I don't think I can edit this
message later).

It's not actually compatible as-is with any lexer/parser I know of,
but it should only require a few syntax changes and splitting the
lexer and parser into separate sections to get it running. I haven't
tested it thoroughly, mainly because the parser I use, PLY, is LALR,
so it doesn't handle EBNF. I suppose the best way to test it would be
to use ANTLR, which would require the fewest changes from this format.

There were a few pleasant (undocumented) surprises for me when I
examined the source code:
- For user-defined types, a leading dot means the name is fully
qualified.
- Group names must start with a capital letter.
- Tag numbers must be 2^28-1 or lower

I think 'extensions' should be a message-level option, not its own
keyword. It's not a field or a declaration, it's just... an option. I
really can't put it more plainly, as is often the case when things
make too much sense to me.
For example:
message Foo {
option extensionStart = 100;
}
message Bar {
option extensionStart = 100;
option extensionEnd = 199;
}
would replace
message Foo {
extensions 100 to max;
}
message Bar {
extensions 100 to 199;
}

In addition, options should use text format, because they're really
just setting values in descriptor.proto. The descriptor would be
exposed through the object's name. So the example above would become:
message Foo {
Foo.extensions: { start: 100; };
}
message Bar {
Bar.extensions: { start: 100; end: 199; };
}

I'm not sure about file-level options; perhaps expose __file or
something?

These are only proposed changes; the grammar below reflects the
current spec.

proto ::= ( message | extend | enum | import | package | option |
";" )*

import ::= "import" strLit ";"

package ::= "package" ident ";"

option ::= "option" optionBody ";"

optionBody ::= ident ( "." ident )* "=" constant

message ::= "message" ident messageBody

extend ::= "extend" userType messageBody

enum ::= "enum" ident "{" ( option | enumField | ";" )* "}"

enumField ::= ident "=" intLit ";"

service ::= "service" ident "{" ( option | rpc | ";" )* "}"

rpc ::= "rpc" ident "(" userType ")" "returns" "(" userType ")" ";"

messageBody ::= "{" ( field | enum | message | extend | extensions |
group | option | ":" )* "}"

group ::= modifier "group" camelIdent "=" intLit messageBody

# tag number must be 2^28-1 or lower
field ::= modifier type ident "=" intLit ( "[" fieldOption ( ","
fieldOption )* "]" )? ";"

fieldOption ::= optionBody | "default" "=" constant

extensions ::= "extensions" intLit "to" ( intLit | "max" ) ";"

modifier ::= "required" | "optional" | "repeated"

type ::= "double" | "float" | "int32" | "int64" | "uint32" | "uint64"
| "sint32" | "sint64" | "fixed32" | "fixed64" | "sfixed32" |
"sfixed64"
| "bool" | "string" | "bytes" | userType

# leading dot for identifiers means they're fully qualified
userType ::= ( "."? ident )+

constant ::= ident | intLit | floatLit | strLit | boolLit

ident ::= /[A-Za-z_][\w_]*/

# according to parser.cc, group names must start with a capital letter
as a
# hack for backwards-compatibility
camelIdent ::= /[A-Z][\w_]*/

intLit ::= decInt | hexInt | octInt

decInt ::= /\d+/

hexInt ::= /0[xX]([A-Fa-f0-9])+/

octInt ::= /0[0-7]+/

floatLit ::= /\d+(\.\d+)?([Ee][\+-]?\d+)?/ # allow_f_after_float_ is
disabled by default in tokenizer.cc

boolLit ::= "true" | "false"

strLit ::= quote ( hexEscape | octEscape | charEscape | /[^\0\n]/ )*
quote

quote ::= /["']/

hexEscape ::= /\\[Xx][A-Fa-f0-9]{1,2}/

octEscape ::= /\\0?[0-7]{1,3}/

charEscape ::= /\\[abfnrtv\\\?'"]/

Alek Storm

unread,
Jul 11, 2008, 7:52:06 PM7/11/08
to Protocol Buffers
Next up: grammars for PB text and serialized formats. Assuming my
headache subsides, and I don't get distracted by other projects.

Kenton Varda

unread,
Jul 11, 2008, 8:29:14 PM7/11/08
to Alek Storm, Protocol Buffers
Thanks, Alek, this is useful info!  I looked it over and corrected a couple things below...

A couple comments:

On Fri, Jul 11, 2008 at 4:14 PM, Alek Storm <alek....@gmail.com> wrote:
I think 'extensions' should be a message-level option, not its own
keyword.  It's not a field or a declaration, it's just... an option.

This is debatable.  One immediate problem with using an option is that it's actually possible to declare multiple extension ranges in a single message, either on multiple lines, or with something like:

extensions 1 to 3,6,9,30 to max;

(That doesn't seem to be covered in your grammar, nor is it documented...)

It's actually useful to declare multiple extension ranges because sometimes you want to take an existing field and convert it to an extension, e.g. to break a dependency that most people don't need.

Another issue is that the compiler needs to verify that field numbers and extension ranges don't overlap, and that extensions defined for a message have numbers that reside in that message's extension ranges.  Options are normally used for things that are not critical to the message definition and could reasonably be ignored by an implementation.  Think of options like Java annotations.  It's kind of mushy logic, but I feel that extension ranges are too important to be options.
 
In addition, options should use text format, because they're really
just setting values in descriptor.proto.

I agree.  In particular, if we ever introduce message-typed options, the syntax for them should be protobuf text format.
 
The descriptor would be
exposed through the object's name. So the example above would become:

Interesting idea, but I think we're stuck with the "option foo =" syntax.  The option syntax actually predates descriptor.proto, so the idea of using text format for options wasn't obvious at the time.

proto ::= ( message | extend | enum | import | package | option |
";" )*

import ::= "import" strLit ";"

package ::= "package" ident ";"

package ::= "package" ident ( "." ident )* ";"
 
option ::= "option" optionBody ";"

optionBody ::= ident ( "." ident )* "=" constant

message ::= "message" ident messageBody

extend ::= "extend" userType messageBody

extend ::= "extend" userType "{" ( field | group | ";" )* "}"
 
enum ::= "enum" ident "{" ( option | enumField | ";" )* "}"

enumField ::= ident "=" intLit ";"

service ::= "service" ident "{" ( option | rpc | ";" )* "}"

rpc ::= "rpc" ident "(" userType ")" "returns" "(" userType ")" ";"

messageBody ::= "{" ( field | enum | message | extend | extensions |
group | option | ":" )* "}"

group ::= modifier "group" camelIdent "=" intLit messageBody

# tag number must be 2^28-1 or lower

I think the limit is actually 2^29-1.  Did you find otherwise?

Also, they must be positive (non-zero), and the range 19000 through 19999 is reserved.
 
field ::= modifier type ident "=" intLit ( "[" fieldOption ( ","
fieldOption )* "]" )? ";"

fieldOption ::= optionBody | "default" "=" constant

extensions ::= "extensions" intLit "to" ( intLit | "max" ) ";"

extensions ::= extRange ( "," extRange )* ";"

extRange ::= intLit ( "to" ( intLit | "max" ) )?
 
modifier ::= "required" | "optional" | "repeated"

I would either call this "label" or "cardinality".  ("Label" is the word we use throughout the code currently.  "Cardinality" would have been a better choice.  Oh well.)
 
type ::= "double" | "float" | "int32" | "int64" | "uint32" | "uint64"
      | "sint32" | "sint64" | "fixed32" | "fixed64" | "sfixed32" |
"sfixed64"
      | "bool" | "string" | "bytes" | userType

# leading dot for identifiers means they're fully qualified
userType ::= ( "."? ident )+

I think that matches ".foo bar".  Maybe it should be:

userType ::= "."? ident ( "." ident )*
 
constant ::= ident | intLit | floatLit | strLit | boolLit

ident ::= /[A-Za-z_][\w_]*/

# according to parser.cc, group names must start with a capital letter
as a
# hack for backwards-compatibility
camelIdent ::= /[A-Z][\w_]*/

intLit ::= decInt | hexInt | octInt

decInt ::= /\d+/

To avoid ambiguity with octInt:

decInt ::= /[1-9]\d*/
 
hexInt ::= /0[xX]([A-Fa-f0-9])+/

octInt ::= /0[0-7]+/

floatLit ::= /\d+(\.\d+)?([Ee][\+-]?\d+)?/ # allow_f_after_float_ is
disabled by default in tokenizer.cc

boolLit ::= "true" | "false"

strLit ::= quote ( hexEscape | octEscape | charEscape | /[^\0\n]/ )*
quote

Need to clarify that strLit cannot contain unescaped internal quotes matching the outer quotes, but that maxes the regex get really complicated.

Alek Storm

unread,
Jul 11, 2008, 10:01:48 PM7/11/08
to Protocol Buffers
On Jul 11, 7:29 pm, "Kenton Varda" <ken...@google.com> wrote:
> On Fri, Jul 11, 2008 at 4:14 PM, Alek Storm <alek.st...@gmail.com> wrote:
> > I think 'extensions' should be a message-level option, not its own
> > keyword. It's not a field or a declaration, it's just... an option.
>
> This is debatable. One immediate problem with using an option is that it's
> actually possible to declare multiple extension ranges in a single message,
> either on multiple lines, or with something like:
>
> extensions 1 to 3,6,9,30 to max;
>
> (That doesn't seem to be covered in your grammar, nor is it documented...)

Ah, I missed that. In that case, the current implementation of options
won't work ("option extensionStart ="), but ranges are easy to handle
using the new option text format I described above: just move the
following from DescriptorProto to MessageOptions in descriptor.proto:
message ExtensionRange {
optional int32 start = 1;
optional int32 end = 2;
}
repeated ExtensionRange extension_range = 5;

You can then use the text format I previously described for each
range. I was about to write the *exact same thing* myself, only to
discover it already existed (you might want to make 'start' required).
I'm guessing 'end' is omitted when 'max' is specified.

> Another issue is that the compiler needs to verify that field numbers and
> extension ranges don't overlap, and that extensions defined for a message
> have numbers that reside in that message's extension ranges. Options are
> normally used for things that are not critical to the message definition and
> could reasonably be ignored by an implementation. Think of options like
> Java annotations. It's kind of mushy logic, but I feel that extension
> ranges are too important to be options.

I see what you're saying - these decisions are based more on a gut
feeling. It's just that options give us a much more flexible
framework. Using options, if we want to change how extensions work in
the future, we can just modify descriptor.proto. There are obvious
versioning advantages here, since we can take advantage of tag numbers
in the MessageOptions message. Using a keyword, if we changed how
extensions work, we'd have to change the language, breaking old .proto
files. I can't believe I'm trying to convince someone to use their own
framework ;)

> > In addition, options should use text format, because they're really
> > just setting values in descriptor.proto.
>
> I agree. In particular, if we ever introduce message-typed options, the
> syntax for them should be protobuf text format.
>
> > The descriptor would be
> > exposed through the object's name. So the example above would become:
>
> Interesting idea, but I think we're stuck with the "option foo =" syntax.
> The option syntax actually predates descriptor.proto, so the idea of using
> text format for options wasn't obvious at the time.

So we're good for Protocol Buffers 3.0? :)

-----------------

Old: package ::= "package" ident ";"
New: package ::= "package" ident ( "." ident )* ";"

Thanks.

Old: extend ::= "extend" userType messageBody
New: extend ::= "extend" userType "{" ( field | group | ";" )* "}"

Okay, I checked the source, and you're right. But what if I want to
add a message to an already-defined message's namespace? Take the
AddressBook example, and say phone numbers were added through an
extension, like this:

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
extensions 4 to max;
}

extend Person {
enum PhoneType { ... }
message PhoneNumber { ... }
repeated PhoneNumber phone = 4;
}

This doesn't work, but I think it should. Nested types should be able
to be added through extensions. Or we could do:
extend Person {
repeated PhoneNumber phone = 4;
}

enum Person.PhoneType { ... }
message Person.PhoneNumber P { ... }

> > # tag number must be 2^28-1 or lower
>
> I think the limit is actually 2^29-1. Did you find otherwise?

Looks like an error in src/google/protobuf/unittest.proto. Line 331
says 2^28-1, and that's what it's using for its test value. I checked,
and protoc barfs on 2^29, not 2^28, so unittest.proto needs to be
fixed. Shouldn't have trusted the documentation on that one.

> Also, they must be positive (non-zero), and the range 19000 through 19999 is
> reserved.

With the modification you made to decInt, we can do this:
Old: field ::= modifier type ident "=" intLit ( "[" fieldOption ( ","
fieldOption )* "]" )? ";"
New: field ::= modifier type ident "=" decInt ( "[" fieldOption ( ","
fieldOption )* "]" )? ";"

Note that this is a hack, and a comment should be included to explain.

> I would either call this "label" or "cardinality". ("Label" is the word we
> use throughout the code currently. "Cardinality" would have been a better
> choice. Oh well.)

Old: modifier ::= "required" | "optional" | "repeated"
New: label ::= "required" | "optional" | "repeated"
(References also changed)

> > # leading dot for identifiers means they're fully qualified
> > userType ::= ( "."? ident )+
>
> I think that matches ".foo bar". Maybe it should be:
>
> userType ::= "."? ident ( "." ident )*

Wow, that was a stupid mistake. Thanks. I think I had it that way
originally, but then I changed it ("It's shorter!" I thought).

> > intLit ::= decInt | hexInt | octInt
> > octInt ::= /0[0-7]+/
> To avoid ambiguity with octInt:

Old: decInt ::= /\d+/
New: decInt ::= /[1-9]\d*/

> > strLit ::= quote ( hexEscape | octEscape | charEscape | /[^\0\n]/ )*
> > quote
>
> Need to clarify that strLit cannot contain unescaped internal quotes
> matching the outer quotes, but that maxes the regex get really complicated.

Yup, forgot that. In addition, the starting and ending quotes have to
match. Here's how I would do it with Python's re module:
strLit ::= /(?P<quote>["'])(\\[Xx][A-Fa-f0-9]{1,2}|\\0?[0-7]{1,3}|\\
[abfnrtv\\\?'"]|[^\0\n(?P=quote)])*(?P=quote)/

That's with the escape rules combined, and Python's re certainly can't
be relied upon for a definition. So I guess we should just add a
comment to strLit explaining.

Also, I should point out that the grammar is *not* designed for non-
ambiguity, it's designed for readability. This is especially obvious
with rules like 'camelIdent' and 'boolLit', which plainly conflict
with 'ident'. I think it should stay that way (nobody's challenged
this yet, but just in case).

Once you think the grammar is okay, I'll repost the whole thing with
changes.

Thanks again for the feedback!

K Livingston

unread,
Jul 12, 2008, 12:54:16 AM7/12/08
to Protocol Buffers
Hey this is great, I am working on a lisp implementation of PB (oh ps,
I'm working on a common lisp implementation**) and I was hoping to get
my hands on some more specific specifications or BNF for the .proto
file language, but I was waiting until my implementation was a little
further before I started asking a lot of questions though.

The leading dot to be a fully qualified name is news to me, and good
to know (I was starting to make some assumptions about all that).
Also I had to reorder some operations when I realized all names didn't
need to be forward declared, when I looked at unittest.proto the other
day.

since we're on the topic, question:
I'm assuming name spaces and package statements are always local to
the file they are in, and don't apply to imported files. ie. imported
files start in a clean namespace, and it's not the same as inlining
their contents into the current file. also import statements can only
occur at the top level in a file.

thanks,
Kevin

** more to come on that very soon. (I'm probably about 75% to a
functioning implementation.)

Alek Storm

unread,
Jul 12, 2008, 1:50:42 AM7/12/08
to Protocol Buffers
On Jul 11, 11:54 pm, K Livingston <kevinlivingston.pub...@gmail.com>
wrote:
> I'm assuming name spaces and package statements are always local to
> the file they are in, and don't apply to imported files.  ie. imported
> files start in a clean namespace, and it's not the same as inlining
> their contents into the current file.  also import statements can only
> occur at the top level in a file.

I can answer the latter definitely: yes, they can only occur at the
top level. As for the former, I'm guessing you mean that importing
foo into bar doesn't cause it to become bar.foo. I can only assume
this is the case, as it would be consistent with C++'s and Java's
semantics.

Kenton Varda

unread,
Jul 12, 2008, 7:06:53 PM7/12/08
to Alek Storm, Protocol Buffers
Grammar looks ok.

Actually, I don't think you need to detect and reject field value 0 as part of the grammar.  protoc actually does this in the validation phase (in DescriptorPool), not in the parser.

On Fri, Jul 11, 2008 at 7:01 PM, Alek Storm <alek....@gmail.com> wrote:
Okay, I checked the source, and you're right. But what if I want to
add a message to an already-defined message's namespace?

Extensions do not add things to other namespaces.  If you have:

package pkg;
message Foo { extensions 100 to max; }
extend Foo {
  optional int32 bar = 123;
}

"bar" is *not* placed in Foo's namespace.  It's placed at the package scope.  So you would refer to it in C++ as pkg::bar, not pkg::Foo::bar.  This is why you need to use special methods to access extensions, e.g.:
  pkg::Foo foo;
  foo.setExtension(pkg::bar, 1);

So there's no reason to declare types inside an extend block, since it would be equivalent to declaring it outside the extension block.

When we added extensions to the language, I actually argued that the syntax should have been:

  extend Foo with optional int32 bar = 123;

That way it doesn't look like you're putting "bar" in any other scope.  Others disagreed with me, though.  Oh well.
 
Looks like an error in src/google/protobuf/unittest.proto. Line 331
says 2^28-1, and that's what it's using for its test value. I checked,
and protoc barfs on 2^29, not 2^28, so unittest.proto needs to be
fixed.  Shouldn't have trusted the documentation on that one.

Thanks, I'll correct that.

Kenton Varda

unread,
Jul 12, 2008, 7:13:47 PM7/12/08
to K Livingston, Protocol Buffers
On Fri, Jul 11, 2008 at 9:54 PM, K Livingston <kevinliving...@gmail.com> wrote:

Hey this is great, I am working on a lisp implementation of PB (oh ps,
I'm working on a common lisp implementation**) and I was hoping to get
my hands on some more specific specifications or BNF for the .proto
file language, but I was waiting until my implementation was a little
further before I started asking a lot of questions though.

I'm happy to hear someone is implementing LISP support!

You should consider just writing a new CodeGenerator for the existing compiler rather than writing a whole new parser.  That way if the parser changes you don't have to do anything to keep up.
 
since we're on the topic, question:
I'm assuming name spaces and package statements are always local to
the file they are in, and don't apply to imported files.  ie. imported
files start in a clean namespace, and it's not the same as inlining
their contents into the current file.  also import statements can only
occur at the top level in a file.

Imports can actually appear anywhere in the file.  The only effect of an import is to make the declarations in other files usable within the current file.  In that sense it is like a Python import, not a C++ #include.

As you noticed, declarations can occur in any order.  So, you have to parse everything before you can start looking up type names.

Alek Storm

unread,
Jul 12, 2008, 11:35:20 PM7/12/08
to Protocol Buffers
On Jul 12, 6:13 pm, "Kenton Varda" <ken...@google.com> wrote:
> On Fri, Jul 11, 2008 at 9:54 PM, K Livingston <
> > since we're on the topic, question:
> > I'm assuming name spaces and package statements are always local to
> > the file they are in, and don't apply to imported files.  ie. imported
> > files start in a clean namespace, and it's not the same as inlining
> > their contents into the current file.  also import statements can only
> > occur at the top level in a file.
>
> Imports can actually appear anywhere in the file.  The only effect of an
> import is to make the declarations in other files usable within the current
> file.  In that sense it is like a Python import, not a C++ #include.
>
> As you noticed, declarations can occur in any order.  So, you have to parse
> everything before you can start looking up type names.

He didn't mean the top line, he meant the top *level*. As in, they
can't occur inside a message declaration.

Alek Storm

unread,
Jul 13, 2008, 1:54:05 AM7/13/08
to Protocol Buffers
On Jul 12, 6:06 pm, "Kenton Varda" <ken...@google.com> wrote:
> Grammar looks ok.
> Actually, I don't think you need to detect and reject field value 0 as part
> of the grammar.  protoc actually does this in the validation phase (in
> DescriptorPool), not in the parser.

Since the grammar is designed for human comprehension, it really
doesn't matter what a specific implementation does, as all parsers
will work differently. Therefore, we need to include as much
information as possible in the reference grammar. That being said,
you're completely right for a different reason: that change would make
the grammar reject octal 01 and hex 0x1 as tag numbers. I'll just
write include a comment about it.

> Extensions do not add things to other namespaces.  If you have:
>
> package pkg;
> message Foo { extensions 100 to max; }
> extend Foo {
>   optional int32 bar = 123;
>
> }
>
> "bar" is *not* placed in Foo's namespace.  It's placed at the package scope.

Oh, okay. I see why it has to be that way.

Alek Storm

unread,
Jul 13, 2008, 3:20:16 AM7/13/08
to Kenton Varda, Protocol Buffers
Okay, here is the revised grammar.  Kenton, can we make this official somehow?  Not just because it's mostly my work (okay, kinda), but because it's the *reference* grammar for the language.  It can't just stay in a Groups thread; nobody will find it.  Perhaps a link to it in the 'Language Guide' page?  Or posted there in full?  Developers need something official so they can be confident that their implementations match.  And if the spec changes in the future, it's a lot easier to just check the new grammar.

Because line breaks got mangled in my first post, I've included this one as an attachment.
proto2.ebnf

Kenton Varda

unread,
Jul 15, 2008, 1:04:13 AM7/15/08
to Alek Storm, Protocol Buffers
On Sat, Jul 12, 2008 at 8:35 PM, Alek Storm <alek....@gmail.com> wrote:
He didn't mean the top line, he meant the top *level*. As in, they
can't occur inside a message declaration.

Oops, sorry.  Yes, imports can only appear at the top level, not inside a message definition.

K Livingston

unread,
Jul 16, 2008, 2:54:58 AM7/16/08
to Protocol Buffers
On Jul 12, 6:13 pm, "Kenton Varda" <ken...@google.com> wrote:
> On Fri, Jul 11, 2008 at 9:54 PM, K Livingston <
> kevinlivingston.pub...@gmail.com> wrote:
> > Hey this is great, I am working on a lisp implementation of PB (oh ps,
> > I'm working on a common lisp implementation**) and I was hoping to get
> > my hands on some more specific specifications or BNF for the .proto
> > file language, but I was waiting until my implementation was a little
> > further before I started asking a lot of questions though.
>
> I'm happy to hear someone is implementing LISP support!
>
> You should consider just writing a new CodeGenerator for the existing
> compiler rather than writing a whole new parser.  That way if the parser
> changes you don't have to do anything to keep up.

I'll take a look at that... although, this really started as me saying
to myself, hey that's neat, and it would be nice if Lisp could play
too... I bet I could do that in a couple hundred lines of lisp code
(and it'll give me something to do at the airport)... and make a nice
tutorial out of how to do some more interesting things in Lisp from of
the whole experience (like mucking with the readtable, for example).

Well, I was wrong, I don't thing it's a couple hundred, as I have that
already, but it's only about O(1000) not including comments (so far)
to have a complete scanner and (nearly complete) parser implemented.
Lisp has a pretty powerful object/method model so the code generator
and generate code should be able to be made incredibly terse. Then
there's the library for (de/)serialization to the specific wire
formats - which again shouldn't be that large, but something I intend
to work on last.

It should have been done by now, to at least play with, but I got hung
up on a few things that we have been discussing elsewhere here, and my
free time has gone to virtually zero starting last Friday and until
this Friday, for various reasons.

By the way, if anyone else is interested in looking into this with me,
shoot me an email, or I set this up to document and organize the
effort.
http://code.google.com/p/cl-protobuf/
right now that's about 95% placeholder though.

Kevin



K Livingston

unread,
Jul 22, 2008, 2:07:34 PM7/22/08
to Protocol Buffers
On Jul 11, 6:14 pm, Alek Storm <alek.st...@gmail.com> wrote:
> Here is an EBNF grammar for .proto files. It's based on my reading of

> # leading dot for identifiers means they're fully qualified
> userType ::= ( "."? ident )+
>
> ident ::= /[A-Za-z_][\w_]*/
>
> # according to parser.cc, group names must start with a capital letter
> as a
> # hack for backwards-compatibility
> camelIdent ::= /[A-Z][\w_]*/


Is it correct understanding that although there are style guide
recommendations for user created identifiers, that they are effective
treated as case *in*-sensitive. e.g. all of following in .proto
content map to the same symbol "foo" "Foo" "FOO" "fOo"?
(Which is nice if true, because that's the default in Common Lisp.)

I'm assuming a symbol cannot simultaneously be a message name, and a
enum value, (or enum type name for that matter).

Also just for clarity, if I'm reading the above correct, identifiers
start with a letter, or underscore, and then contain any non-
whitespace characters, right? But cannot start with a number? like
"1stPerson" "3rdPerson".

thanks,
Kevin

Kenton Varda

unread,
Jul 22, 2008, 2:11:26 PM7/22/08
to K Livingston, Protocol Buffers
Symbols are case-sensitive.  For example, you can have:

message Foo {
  message Bar {}
  optional Bar bar = 1;
}

However, some languages are not case-sensitive, so it is up to their code generators to decide how to deal with conflicts.

Identifiers start with a letter or underscore, and can contain letters, underscores, and digits (nothing else).

K Livingston

unread,
Jul 22, 2008, 5:10:36 PM7/22/08
to Protocol Buffers
On Jul 22, 1:11 pm, "Kenton Varda" <ken...@google.com> wrote:
> Symbols are case-sensitive.  For example, you can have:
>
> message Foo {
>   message Bar {}
>   optional Bar bar = 1;
>
> }

Ok, although, in that case context can make it clear what is what. So
then, just to be pedantic, it's a violation of style, but technically
allowed to have the following definitions (with different/conflicting
bodies).

message Foo { ... }
message FOO { ... }
message foo { ... }


> However, some languages are not case-sensitive, so it is up to their code
> generators to decide how to deal with conflicts.
>
> Identifiers start with a letter or underscore, and can contain letters,
> underscores, and digits (nothing else).

oh ok, good to know, I assumed hyphens were ok - I'll take that out.

thanks for the help.
it's easiest to get all this right the first time,
Kevin

Kenton Varda

unread,
Jul 22, 2008, 5:28:52 PM7/22/08
to K Livingston, Protocol Buffers
On Tue, Jul 22, 2008 at 2:10 PM, K Livingston <kevinliving...@gmail.com> wrote:
Ok, although, in that case context can make it clear what is what.  So
then, just to be pedantic, it's a violation of style, but technically
allowed to have the following definitions (with different/conflicting
bodies).

message Foo { ... }
message FOO { ... }
message foo { ... }

Yes, technically you could do that.

Ruffian Eo

unread,
May 31, 2013, 10:17:51 AM5/31/13
to prot...@googlegroups.com
I am not very good with reading regular expressions - so there is a good chance I misunderstood:
 

ident ::= /[A-Za-z_][\w_]*/

The rule above, my brain parses as: A letter or underscore followed by 0..* characters which are not whitespaces. (you may laugh if I misread the expression ;) )
 
Trying to parse a proto file with the enum below, produces an error, though. So I would suspect that the ident rule above is not as implemented in the protoc.exe. 
The admittedly obfuscated name of the enum being: A()?@@@@@$$$$$009934^^^^° which would match the regular expression to my understanding.
 

enum A()?@@@@@$$$$$009934^^^^°

{

A$ = 1;

A$$ = 2;

A%%% = 3;

}

->>> Output of protoc.exe
> protoc test5.proto -I. --cpp_out=.
test5.proto:5:7: Expected "{".
test5.proto:5:22: Numbers starting with leading zero must be in octal.
 
where  "line 5" is the line with the enum with the funny name...
 
For my implementation I hoped to simply pass the ENBF given here to customers, with the implication that "my tool works as long as the grammar is...". So this post is by no means just joking around and finding esoteric cases.. you probably know how creative customers can be... ;)

Stephen Tu

unread,
May 31, 2013, 10:31:40 AM5/31/13
to Ruffian Eo, prot...@googlegroups.com
The [\w_]* means to match any sequence of "word" characters or an underscroll (which I believe is redundant). Please see: http://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Regular_expression


--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+u...@googlegroups.com.
To post to this group, send email to prot...@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ruffian Eo

unread,
May 31, 2013, 10:55:23 AM5/31/13
to prot...@googlegroups.com, Ruffian Eo
Oh yes - thanks! I just noticed myself.
[^\s]* would have been what I first understood.
I so hate that write-only language they call "Regular expressions".
To make it easier to read I propose to rephrase as the more intuitive:
 
ident ::= /[A-Za-z_][A-Za-z0-9_]*/

Walter Schulze

unread,
Jan 1, 2015, 5:11:38 AM1/1/15
to prot...@googlegroups.com, ken...@google.com
Is this grammar now officially available and supported somewhere?
Reply all
Reply to author
Forward
0 new messages