Extra text is getting selected while doing annotation

20 views
Skip to first unread message

raman ratnakar

unread,
Aug 24, 2022, 3:02:49 AM8/24/22
to inception-users
Hi Team,

While doing annotation extra section or paragraphs are getting selected.
Please find the attached screen shot, 
We are trying to annotate paragraph 4.3 but extra paragraph like 1,2,3 paragraphs are also getting selected. The whole selected text is highlighted with green.

Thanks, 
Raman 
extra text is getting selected while doing annotation_issue.PNG

Richard Eckart de Castilho

unread,
Aug 24, 2022, 3:19:49 AM8/24/22
to inception-users
Hi,
It depends on the PDF. We are using the library pdfbox to extract the text information from the PDF. It has a configuration option to "sort characters by position" which can be on or off.

INCEpTION currently uses the "off" option. This means that we read the text in the order in which it is stored in the PDF data stream. This is generally a good idea, in particular for computer-generated PDFs, because the text is usually stored in reading order. E.g. if you have a two column text, the tool that generates the PDF will usually first write the left column, then the right, then the footer, then the header of the next page, and so on.

The "on" option can help for PDFs in which the above does not work well. However, "sort characters by position" is using a pretty basic algorithm. It does for example not detect multi-column text. So if you start selecting in the left column, the selection would carry over to the parallel row in the right column before it goes to the next row in the left column.

It is a little hard to see in your screenshot what exactly is going on. You might try starting your selection at the end of paragraph 4.3 and slowly move the mouse towards the beginning of the paragraph to see exactly when the undesired text starts getting selected. My guess is it might happen when you reach the "4.3" number. Try it out please.

-- Richard

raman ratnakar

unread,
Aug 24, 2022, 3:52:58 AM8/24/22
to inception-users
Hi,

We have tried the way you have suggested,
we started selecting text from last word of paragraph 4.3 and started moving the cursor upward, when we reach the "and" word of the first line of 4.3 paragraph it automatically selects paragraph 1,2,3 as highlighted in the shared screen shot.

Please have a look.

Thanks,
Raman
extra text is getting selected while doing annotation_issue_1.PNG

Richard Eckart de Castilho

unread,
Aug 24, 2022, 4:08:00 AM8/24/22
to inception-users
On 24. Aug 2022, at 09:52, raman ratnakar <ramanra...@gmail.com> wrote:
>
> We have tried the way you have suggested,
> we started selecting text from last word of paragraph 4.3 and started moving the cursor upward, when we reach the "and" word of the first line of 4.3 paragraph it automatically selects paragraph 1,2,3 as highlighted in the shared screen shot.

Can you select only the first line of 4.3 independently?

-- Richard

raman ratnakar

unread,
Aug 24, 2022, 5:05:35 AM8/24/22
to inception-users
Hi,

When we select the first line of Paragraph 4.3 there is no issue, when we just move the cursor to second line it also selects the other paragraph like 1,2,3.
Please have a look into the shared screen shot as asked.

Thanks,
Raman
extra text is getting selected while doing annotation_issue_2.PNG
extra text is getting selected while doing annotation_issue_3.PNG

Richard Eckart de Castilho

unread,
Aug 24, 2022, 5:11:29 AM8/24/22
to inception-users
Hi,

> On 24. Aug 2022, at 11:05, raman ratnakar <ramanra...@gmail.com> wrote:
>
> When we select the first line of Paragraph 4.3 there is no issue, when we just move the cursor to second line it also selects the other paragraph like 1,2,3.
> Please have a look into the shared screen shot as asked.

Then you currently could use the option of the "fragment" annotation that I suggested in the other mail:

> What you can do is create a new span layer e.g. "Fragment" and add a link feature to that layer to your main annotation layer.
> Then you create an annotation of your main layer on say page 1, add a "Fragement" annotation on page two and then link the fragment into the annotation you created first.

There are various potential directions I can think of to develop this further in a more user-friendly way in future versions, but currently, that is currently the only option I can think of that INCEpTION supports for this case.

The other option may be to re-OCR the document externally using a better OCR tool with better column detection - then INCEpTION should also be able to better handle it.

-- Richard


Reply all
Reply to author
Forward
0 new messages