Get Plain Text with marked/location of Hyperlink/Page Title of Outlinks

32 views
Skip to first unread message

Nitish Gupta

unread,
Oct 24, 2016, 2:22:59 PM10/24/16
to jwpl-users
Using page.getText(), one gets the Markup text which contains the text of outlink in [[anchor text]] but does not contain the hyperlink. 
The getPlainText() looses this information as well.

Is there a way to get PlainText along with knowing the location of hyperlinked text and the link it refers to. 

Eg. One of the sentences in the Wikipedia page for Germany. 

Germany was a founding member of the European Union in 1993. '

Is there some functionality or some code that can be written to get information like : 
Germany was a founding member of the [[European Union]]{https://en.wikipedia.org/wiki/European_Union} in 1993.

Or something that gives the same information for the whole page?

Thanks in advance.

Torsten Zesch

unread,
Oct 24, 2016, 2:33:29 PM10/24/16
to jw...@googlegroups.com
We internally use the Sweble parser
to extract the plain text.

If you retrieve the text with markup and use your own parsing with Sweble (you may use the PlainTextConverter in JWPL as a starting point), it should be easy to implement what you need.

-Torsten

--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nitish Gupta

unread,
Oct 24, 2016, 2:50:51 PM10/24/16
to jwpl-users
One issue I have with pursuing this is, the output of the page.getText() has some markup but for internal links, it only contains [[ ]] around the hyperlinked text and not the links. So there is now way for a parser that takes this input to get the hyperlinks. So where can I get an output of the page that contains the hyperlinks as well, and which parser (possible with that configuration) can I use to implement the output I want.

-- Nitish
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.

Torsten Zesch

unread,
Oct 24, 2016, 2:57:10 PM10/24/16
to jw...@googlegroups.com
Internal Wikipedia links use that syntax.
It will always be the same base URL as the Wikipedia you are looking at + whatever is between [[ ]]

-T

To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+unsubscribe@googlegroups.com.

Nitish Gupta

unread,
Oct 24, 2016, 3:32:01 PM10/24/16
to jwpl-users
Okay. After squinting at the output of page.getText, I found some links to be [[Los Angeles]] where surface form is same as linked Wiki Title and some to be 

[Load (album)|Load]], where before the | is the Wiki Page Title and on the right of | is the surface form. I hope this is consistent and correct.


Now I have to find a way to use the Sweble parser to remover all other information (Something that the PlainTextConverter does) but keep this information somehow. Any ideas? I will try to find a way otherwise.


Thanks, 

Nitish

Reply all
Reply to author
Forward
0 new messages