How do I extract text from a given PDF layer using PDFNet SDK?

167 views
Skip to first unread message

Support

unread,
Jan 25, 2013, 7:21:01 PM1/25/13
to pdfne...@googlegroups.com
Q: 
 
Our scenario is this:

 

·         Input file is a layered PDF (normally one page, but could be more)

·         We need to check that a particular layer has live (not outlined) text on it

·         We know the layer name we are looking for will contain the word ‘artwork’

·         Therefore, we want to attempt to extract text only on this particular layer (if it is found)

·         If the extracted text is empty, we will fail the process, otherwise we continue

 

Is there a recommended approach to this? My developers have been struggling a little with this as there doesn’t appear to be a way to extract text from only one layer?
 
---------------
A:
 

Yes, this is a somewhat tricky. One thing that pops to mind is that you can extract the required text layer into a temp page then use ‘pdftron.PDF.TextExtractor’ to get text from the page.

 

To extract the layer you can use the approach shown in ElementEdit  sample: http://www.pdftron.com/pdfnet/samplecode.html#ElementEdit

 

To copy elements you would initialize ElementReader with OCG Context similar to the way PDFDraw in PDFLayers sample (http://www.pdftron.com/pdfnet/samplecode.html#PDFLayers):

 

Config init_cfg = doc.GetOCGConfig();

Context ctx = new Context(init_cfg);

ctx.ResetStates(false);
ctx.SetState(ocg, true);
 
reader.Begin(page, ctx);
 
 
if (element.IsOCVisible()) {
    writer.ElementWrite(element);
}
Reply all
Reply to author
Forward
0 new messages