Not getting the expected text from some annotations when extracting text.

22 views
Skip to first unread message

Ryan

unread,
May 17, 2018, 5:57:37 PM5/17/18
to PDFTron PDFNet SDK
Question:

I am extracting text from FreeText annotations, and there are some that return a different value, then what I see on screen.

How do I get the text that I see on screen?

Answer:

Unfortunately, the PDF specification for FreeText fonts actually has two entries that contain the "contents". Actually, there is a third location, which would be the optional appearance stream, which if present is definitely what you see on screen.

There is a Contents entry, which is the contents, and then a RC entry (Rich Content) that supports a subset of HTML. Ideally they are kept synchronized, but this is not enforced/guaranteed. Furthermore, the appearance stream (AP) could have a third value, though it should reflect either Content or RC, but again not enforced/guaranteed.

What you can do is the following to get the RC entry, if present.
SDF.Obj rc_obj = annot.GetSDFObj().FindObj("RC");
if(rc_obj != null && rc_obj.IsString())
{
 
string rc_str = rc_obj.GetAsPDFText();
 
// strip out all HTML syntax, to get raw text. See this post https://stackoverflow.com/a/5870471/3761687
 
// now you can compare rc_str to string from contents if you like, and pick one, or always pick RC "if present".
}



Reply all
Reply to author
Forward
0 new messages