Get Plain Text with marked/location of Hyperlink/Page Title of Outlinks

Nitish Gupta

unread,

Oct 24, 2016, 2:22:59 PM10/24/16

to jwpl-users

Using page.getText(), one gets the Markup text which contains the text of outlink in [[anchor text]] but does not contain the hyperlink.

The getPlainText() looses this information as well.

Is there a way to get PlainText along with knowing the location of hyperlinked text and the link it refers to.

Eg. One of the sentences in the Wikipedia page for Germany.

' Germany was a founding member of the European Union in 1993. '

Is there some functionality or some code that can be written to get information like :

' Germany was a founding member of the [[European Union]]{https://en.wikipedia.org/wiki/European_Union} in 1993.

Or something that gives the same information for the whole page?

Thanks in advance.

Torsten Zesch

unread,

Oct 24, 2016, 2:33:29 PM10/24/16

to jw...@googlegroups.com

We internally use the Sweble parser

http://sweble.org/projects/swc/swc-example-basic/

to extract the plain text.

If you retrieve the text with markup and use your own parsing with Sweble (you may use the PlainTextConverter in JWPL as a starting point), it should be easy to implement what you need.

-Torsten

--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nitish Gupta

unread,

Oct 24, 2016, 2:50:51 PM10/24/16

to jwpl-users

One issue I have with pursuing this is, the output of the page.getText() has some markup but for internal links, it only contains [[ ]] around the hyperlinked text and not the links. So there is now way for a parser that takes this input to get the hyperlinks. So where can I get an output of the page that contains the hyperlinks as well, and which parser (possible with that configuration) can I use to implement the output I want.

-- Nitish

To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.

Torsten Zesch

unread,

Oct 24, 2016, 2:57:10 PM10/24/16

to jw...@googlegroups.com

Internal Wikipedia links use that syntax.

It will always be the same base URL as the Wikipedia you are looking at + whatever is between [[ ]]

-T

To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+unsubscribe@googlegroups.com.

Nitish Gupta

unread,

Oct 24, 2016, 3:32:01 PM10/24/16

to jwpl-users

Okay. After squinting at the output of page.getText, I found some links to be [[Los Angeles]] where surface form is same as linked Wiki Title and some to be

[Load (album)|Load]], where before the | is the Wiki Page Title and on the right of | is the surface form. I hope this is consistent and correct.

Now I have to find a way to use the Sweble parser to remover all other information (Something that the PlainTextConverter does) but keep this information somehow. Any ideas? I will try to find a way otherwise.