Extracting Alt Tags using marked_content flags.

140 views
Skip to first unread message

Zonker Harris

unread,
Aug 25, 2014, 3:09:49 AM8/25/14
to pdfne...@googlegroups.com
Hey all,

I have been trying to get this example to work.


   DictIterator itr = mc_prop.GetDictIterator(); 
    while (itr.HasNext()) { 
      Obj key = itr.Key(); 
      // Console.WriteLine("{0}", key.GetName());  // Key 
      Obj value = itr.Value(); 
      // ... 
      itr.Next() 
   } 


The itr.Value() works but I can not for the life of me figure out how to extract the values it returns. 

I have tried a bunch of different approaches along the lines of

      Console.WriteLine("{0}", value.GetACCESSOR());  // value

and nothing seems to work.

Could you post a short chuck of example code to show me what I am missing. 

Thanks in advance for you help

zonker harris
Vitalsource (Ingram)
 

Aaron

unread,
Aug 25, 2014, 2:39:57 PM8/25/14
to pdfne...@googlegroups.com
Hello Zonker,

The SDF.Obj API (http://www.pdftron.com/pdfnet/PDFNet/html/Methods_T_pdftron_SDF_Obj.htm) follows the composite pattern, so from an Obj you can call GetAsPDFText (http://www.pdftron.com/pdfnet/PDFNet/html/M_pdftron_SDF_Obj_GetAsPDFText.htm) to get a printable string for Name and String objects.  You can also use Obj.IsNumber() / Obj.GetNumber() to obtain doubles from PDF numbers.





Support

unread,
Aug 27, 2014, 6:30:35 PM8/27/14
to pdfne...@googlegroups.com

For deeper coverage of SDF API see   https://www.pdftron.com/pdfnet/intro.html

Zonker Harris

unread,
Aug 29, 2014, 2:28:59 PM8/29/14
to pdfne...@googlegroups.com
Hey Aaron.

I am still not getting them.

Here is the output around a Figure/Caption block

CURRENT TAG: Span

Traversing the marked content properties dictionary

Key: MCID

Text: 5.0


CURRENT TAG: Figure

Traversing the marked content properties dictionary

Key: BBox

Key: MCID

Text: 53.0

Key: Type


CURRENT TAG: Caption

Traversing the marked content properties dictionary

Key: MCID

Text: 44.0



And here is the code that generates that

    itr = mcProp.GetDictIterator

        puts "Traversing the marked content properties dictionary"

    while itr.HasNext do

        key = itr.Key

        puts "Key: " + key.GetName.to_s

        value = itr.Value


        ##this is a really dumb way to do it

        ##but if i can find the alt tag this way, can figure out a better way to

        ##extract them


        begin

        eval value.GetNumber.to_s

        rescue StandardError => boom

        else

        puts "Text: " + value.GetNumber.to_s

        end

        begin

        eval value.GetAsPDFText.to_s

        rescue StandardError => boom

        else

        puts "Text: " + value.GetAsPDFText.to_s

        end

        begin

        eval value.IsArray

        rescue StandardError => boom

        else

        puts "Warn: Is array"

        end


        itr.Next

        end

        elsif element.GetType == Element::E_marked_content_end

        puts "MC End"

        end

        puts "\n"

end     

        element = reader.Next

        end


It is seeing everything except the Alt Tags.


I have checked the pdf source. The tags are there as prescribed by Adobe. So I have no idea what i am missing.


Thanks for you help in advance.



zonker


Vitalsource (Ingram)

Support

unread,
Sep 8, 2014, 7:49:48 PM9/8/14
to pdfne...@googlegroups.com


It looks like you are able to extract MCID (Marked Content Identifier), so the remaining question is how do you get the relevant ‘Structure Element’. This is shown in LogicalStructure sample project:

 

  https://www.pdftron.com/pdfnet/samplecode/LogicalStructureTest.cs.html

  https://www.pdftron.com/pdfnet/samplecode.html#LogicalStructure



For more info about marked content, see Section 14.6-7 Marked Convent & Logical Structure in PDF Reference:
Reply all
Reply to author
Forward
0 new messages