Best way to parse SourceCodeInfo Data From Protobuf Files

644 views
Skip to first unread message

Kyle Papili

unread,
Sep 1, 2022, 10:16:52 AM9/1/22
to Protocol Buffers

I'm parsing a large number of protobuf files and am using the Source Code Info descriptor to extract comment data from the source files as well. I currently use the FileDescriptorProto.ListFields() method to extract the DescriptorProto objects I care about as well as the SourceCodeInfo.

To my knowledge, the only way to pair up Location fields with the corresponding objects is via the path attribute. This is fine; except for the fact that involves me manually stepping through said path to land at my parsed Protobuf Object. This gets complicated when dealing with layers of nested_types and I am convinced there must be a way for me to extract the path from the particular DescriptorProto Object and then use that to match up the object with the path specified in the corresponding Location field.

In short: How can I easily pair up DescriptorProto objects with the Location objects that correspond to them? Specifically for comment parsing purposes.

sh...@google.com

unread,
Sep 7, 2022, 6:11:32 PM9/7/22
to Protocol Buffers
First keep in mind that some comments are detached and thus ignored by SourceCodeInfo.

Jerry Berg

unread,
Sep 9, 2022, 4:02:05 PM9/9/22
to Protocol Buffers
Unfortunately, the only way to know the path to the Location object is to know the path to the descriptor proto object in question.
Alternatively, you could iterate through all the sourcecodeinfo elements and use their paths to navigate to the correct descriptor object.
One technique I have used in the past is to iterate through all the sourcecodeinfo elements and store the location object in a custom option extension on the object in question (or the parent object if it something that doesn't have options).

Also, as shaod@ points out, some comments will not show up in sourcecodeinfo.

Kyle Papili

unread,
Sep 9, 2022, 4:08:12 PM9/9/22
to Protocol Buffers
Yes, the "hacky method" proposed by shaod@ is basically what I am doing currently. It just seems to be unnecessarily complicated. 

What do you mean "store the location object in a custom option extension on the object in question". How would I store the location object as a custom extension of the object without knowing the object? If I knew the object that that location corresponded to then my problem would be resolved. The only way to match up location objects to Proto objects from what I've found is the hacky path traversal suggested by Shaod@. Am I missing something here?

Jerry Berg

unread,
Sep 9, 2022, 4:14:46 PM9/9/22
to Kyle Papili, Protocol Buffers
Ah, no, there is no magic. I only meant that if you wanted to have one part of your code match up location data to descriptor object and attach the location info directly, you could do it in a custom option. There's no getting around the actual awkward stepping through the paths to match them up.

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/protobuf/0c8d36db-53f9-4179-942f-201cd205b9dfn%40googlegroups.com.


--
Jerry Berg | Software Engineer | gb...@google.com | 720-808-1188

Kyle Papili

unread,
Sep 9, 2022, 4:15:51 PM9/9/22
to Protocol Buffers
Is there somewhere in the documentation that provides clear table describing which numbers in the path correlate to which types? I have found some inconsistencies with what I had thought. Any link to a table like this?

Jerry Berg

unread,
Sep 9, 2022, 4:38:21 PM9/9/22
to Kyle Papili, Protocol Buffers
The ints in the path should be the field numbers and array indices along the way from a top level field descriptor proto, like this:
Path: [4, 0, 2, 0]
Starting with the FileDescriptorProto:
4 -> FileDescriptorProto {
  ...
 repeated DescriptorProto message_type = 4;
 repeated EnumDescriptorProto enum_type = 5;
}
0 -> index into FileDescriptorProto.messages[0]
2 -> DescriptorProto {
  optional name = 1;
  repeated FieldDescriptorProto field = 2;
 ...
}
0 -> index into DescriptorProto.field[0] 

Thus this path/Location [4, 0, 2, 0] applies to the whole field statement.
I believe the index of a message in the message_type array generally corresponds to the order of all top-level message items in the file.
I also believe that the index of a field likewise corresponds to the ordering of fields within the message.

So if you have to deal with nested messages, the path will start with:
[4, (top-level-message-index), 3, (index-of-nested-message-type), ...]

If I remember correctly, this breaks down for options because sometimes the comments/location for an option is dropped, and when it is present the path points to field 999 the uninterpreted options.

But maybe you already had gotten that far and I misunderstood your question.


Kyle Papili

unread,
Sep 9, 2022, 4:45:02 PM9/9/22
to Protocol Buffers
My question is far dumber haha. Is there a table that describes what Field numbers correlate to what object types?

I've seen 1,2,3,4,5,6,7 show up in paths as field numbers. My naive brain was under impression that they correlated to object types, no?

4: Message, 5: Enum, 6: Extension??

Is this not correct? Is there a table that can show me what each field number correlates to?

Jerry Berg

unread,
Sep 9, 2022, 5:07:10 PM9/9/22
to Kyle Papili, Protocol Buffers
Yeah, it's a bit confusing, but the numbers are not types.  They are field numbers and array indices -- that's all. So the table is descriptor.proto. 

Start with the FileDescriptorProto and dereference by field number, if you hit an array, the next number is an index:
message FileDescriptorProto {
optional string name = 1; // file name, relative to root of source tree
optional string package = 2; // e.g. "foo", "foo.bar", etc.
// Names of files imported by this file.
repeated string dependency = 3;
...
// All top-level definitions in this file.
repeated DescriptorProto message_type = 4;
repeated EnumDescriptorProto enum_type = 5;
repeated ServiceDescriptorProto service = 6;
repeated FieldDescriptorProto extension = 7;
}

A path starting with 1 would refer to the file name (shouldn't have any further numbers).
A path starting with 2 would refer to the package (shouldn't have any further numbers).
A path starting with 3 would refer to a dependency (import), the next number in the path is an array index (which dependency)
A path starting with 4 refers to a message, the next number is the array index that tells you which message.
A path starting with 5 refers to an enum, the next number is the array index that tells you which enum.
A path starting with 6 refers to an service, the next number is the array index that tells you which enum.

If you have a path starting with [4,0,...] you are looking at fileDescriptor.getMessageType(0);
If you have a path starting with [4,1,...] you are looking at fileDescriptor.getMessageType(1);
If you have a path starting with [5,2,...] you are looking at fileDescriptor.getEnumType(2);

The next path element tells you the field number within the top-level item's descriptor. For example, paths that point into a top-level message definition:
message DescriptorProto {
optional string name = 1;
repeated FieldDescriptorProto field = 2;
repeated FieldDescriptorProto extension = 6;
repeated DescriptorProto nested_type = 3;
repeated EnumDescriptorProto enum_type = 4;
...
}

A path [4,0,2,1,...] corresponds to: fileDescriptor.getMessageType(0).getField(1);
A path [4,0,3,1,2,3...] corresponds to: fileDescriptor.getMessageType(0).getNestedType(1).getField(3);

Just follow the proto field numbers.


Kyle Papili

unread,
Sep 9, 2022, 5:11:33 PM9/9/22
to Protocol Buffers
This was a great description. I appreciate you taking the time to write that out. I hadn't been able to find something as clear as this in the documentation. Thank you!

Kyle Papili

unread,
Sep 11, 2022, 3:40:25 PM9/11/22
to Protocol Buffers
I'm not seeing these methods supported in the Python API. Any idea if this is just unsupported?

Jerry Berg

unread,
Sep 11, 2022, 11:31:25 PM9/11/22
to Kyle Papili, Protocol Buffers
I am mostly familiar with the Java API.
How are you currently getting the SourceCodeInfo now? If you have access to it, you should be able to access to the FileDescriptor.

Kyle Papili

unread,
Sep 11, 2022, 11:35:53 PM9/11/22
to Protocol Buffers
Yes, I had the FileDescriptor no problem but the functions are non-existent in Python. I figured it out though, I can access the elements using (.ListFields(), .nested_type, .message_type, .enum_type, .service, .extension, and .options.

A few questions I still had:
// Test Comment 1
message mainMessage {
// Test Comment 2
enum internalEnum {
     SomeField = 1; // Test Comment 3
     SomeOtherField = 2;
}
}

The location for Test Comment 2 is [4, 0, 4, 0]. Shouldn't it be [4, 0, 5, 0]??

Also, how exactly do you set and retrieve custom option extensions via the Python API? So far I have tried:
setattr(pointer.Extensions, "my_custom_option", comments)
but that is not correct. 

Any ideas here?

Jerry Berg

unread,
Sep 11, 2022, 11:41:34 PM9/11/22
to Kyle Papili, Protocol Buffers
[4, 0, 4, 0]. is the correct path to Test Comment 2.

4 = FileDescriptorProto.message_type
0 = FileDescriptorProto.message_type[0]
4 = DescriptorProto.enum_type
0 = DescriptorProto.enum_type[0]

I have never done custom options using Python. I've only used them in Java and C++.


Kyle Papili

unread,
Sep 11, 2022, 11:42:24 PM9/11/22
to Protocol Buffers
How would you do custom types in Java and / or C++?

Also does that mean 4 can represent both .message_type and .enum_type? 

Thank you again!

Jerry Berg

unread,
Sep 11, 2022, 11:53:08 PM9/11/22
to Kyle Papili, Protocol Buffers
4 doesn't represent a type. It is a field number. So it means a different thing depending on the context.
In the context of a FileDescriptorProto, it refers to the message_type field which is declared with field number 4.
// Describes a complete .proto file.
optional string name = 1; // file name, relative to root of source tree
optional string package = 2; // e.g. "foo", "foo.bar", etc.
// Names of files imported by this file.
repeated string dependency = 3;
// Indexes of the public imported files in the dependency list above.
repeated int32 public_dependency = 10;
// Indexes of the weak imported files in the dependency list.
// For Google-internal migration only. Do not use.
repeated int32 weak_dependency = 11;
// All top-level definitions in this file.

In the context of a DescriptorProto, 4 refers to the enum_type field which is declared with field number 4.
// Describes a message type.
message DescriptorProto {
optional string name = 1;
repeated FieldDescriptorProto field = 2;

The path [4, 0, 4, 0] starts at the FileDescriptor and says to access field #4 which is message_type, then access the 0th element of that list, which gives you a DescriptorProto
The next 4, 0 starting from that DescriptorProto means: access field #4 which is enum_type, then access the 0th element of that list.

In java you get the options object for a descriptor, then use the extensions API to set/get them. I would highly recommend reading the generated code and fiddling with it in an IDE that gives you some contextual prompts like IntelliJ or Eclipse for Java and whatever the equivalent is for Python.

Reply all
Reply to author
Forward
0 new messages