How to extract StructTreeRoot?

Ashish Sethi

unread,

Jan 13, 2021, 1:30:52 AM1/13/21

to PDF::Reader

Hi,

I am unable to get to root[:StructTreeRoot].

I see it under reader object:

require 'pdf/reader'

reader = PDF::Reader.new(ARGV[0])

.

@page_count=4,

@root=

{:AcroForm=>#<PDF::Reader::Reference:0x000000002dfce0e0 @gen=0, @id=965>,

:Lang=>"\xFE\xFF\x00E\x00N\x00-\x00U\x00S",

:MarkInfo=>{:Marked=>true},

:Metadata=>#<PDF::Reader::Reference:0x000000002dfccce0 @gen=0, @id=140>,

:Outlines=>#<PDF::Reader::Reference:0x000000002dfcc9e8 @gen=0, @id=185>,

:Pages=>#<PDF::Reader::Reference:0x000000002dfcc768 @gen=0, @id=929>,

:StructTreeRoot=>#<PDF::Reader::Reference:0x000000002dfd7b18 @gen=0, @id=209>,

:Type=>:Catalog,

:ViewerPreferences=>{:DisplayDocTitle=>true}}>

Please advise.

I would really appreciate it.

Thanks,

Ashish

Wayne Brissette

unread,

Jan 13, 2021, 12:13:37 PM1/13/21

to pdf-r...@googlegroups.com

So you obviously have a tagged PDF since it has that element. I've not been able to access that either, so maybe there's some trick there. I am curious what it is you are hoping to access the titles in some hierarchy? (I've often wondered about that myself but most of the time, simply export to HTML and use Nokogiri to parse things).

-Wayne

Ashish Sethi

unread,

Jan 13, 2021, 2:13:30 PM1/13/21

to PDF::Reader

Hi,

I found this way:

reader.trailer[:Root]

This would give you object xref indirect reference and then you could dig further into individual Tags.

I am trying to get the reading order of PDF/UA through structure tree and see if it is not messed up and compare it with NVDA Speech Viewer output.

Now, if there is an easier way to extract reading order then that would be great.