Getting Plain Text from Wiki Discussions

23 views
Skip to first unread message

Kevin Lang

unread,
Mar 17, 2017, 8:34:00 AM3/17/17
to jwpl-users
Hello,

I'm using the DKPro JWPL with the current english Wikipedia Dump inclusive talk pages (discussions). I'd like to analyze them and for that need their plain text, but when I use the "page.getPlainText()" all lines which start with a number of indents ":" are thrown out in the plain text. e.g.:

(in discussion article id: 34984282)

Wiki Text:
== Update needed? ==

From the lead: 'The film's purpose is to promote the charity's "Stop Kony" movement to make Ugandan cult and militia leader, indicted war criminal and the International Criminal Court fugitive Joseph Kony globally known in order to have him arrested by the end of 2012, the time when the campaign expires.' Well, it's 2013, and Kony remains at large. Would it be a fair assessment to say that, judged by its own standards, the campaign failed? Or would that be original research? [[User:Robofish|Robofish]] ([[User talk:Robofish|talk]]) 17:16, 1 January 2013 (UTC)
:I think that would be original research if we said it ourselves, but I bet that we can find a source somewhere that addresses this for us. The Kony 2012 site itself lists this goal as unachieved, so perhaps that's fair game. -- [[User:Khazar2|Khazar2]] ([[User talk:Khazar2|talk]]) 18:25, 1 January 2013 (UTC)

:Of course, they've been producing a number of documentaries toward this end for years. So it's not surprising. It's an ongoing thing, really. But, yes, would be original research. <font color="silver">[[User:Silver seren|Silver]]</font><font color="blue">[[User talk:Silver seren|seren]]</font><sup>[[Special:Contributions/Silver seren|C]]</sup> 11:05, 2 January 2013 (UTC)
::Hey, could you edit the article to fix the issues raised in the review above? --[[User:Niemti|Niemti]] ([[User talk:Niemti|talk]]) 15:08, 2 January 2013 (UTC)


Plain Text:
Update needed?
From the lead: 'The film's purpose is to promote the charity's "Stop Kony" movement to make Ugandan cult and militia leader, indicted war criminal and the International Criminal Court fugitive Joseph Kony globally known in order to have him arrested by the end of 2012, the time when the campaign expires.' Well, it's 2013, and Kony remains at large. Would it be a fair assessment to say that, judged by its own standards, the campaign failed? Or would that be original research? Robofish (talk) 17:16, 1 January 2013 (UTC)


Even though the Parser for Wiki media works, lines that start with ":" are simply ignored, but which are crucial for the discussions. Also using "wiki.getDiscussionPage(34984282)" doesn't help - the plain text stays the same.

I think the plain text with the lines starting with indents are not existing in the sql database anymore because the JWPL DataMachine already throws them out while creating the mysql files?

Is there a way to get the right plain text from the discussion pages?

Johannes Daxenberger

unread,
Mar 20, 2017, 2:59:36 AM3/20/17
to jw...@googlegroups.com
Hi Kevin,

have you tried getText() instead of getPlainText()? It should return the discussion page including all markup (i.e. including the lines that seems to be missing after the page has been parsed). The database should contain the whole page content (anything else would be a bug) – it’s the responsibility of the parser to deal with the markup and correctly return plain text. In this case, the sweble parser seems to have failed.
You could try parsing the page with the (deprecated) JWPL MediaWiki parser (package de.tudarmstadt.ukp.wikipedia.parser), please see the comments of the respective methods in de.tudarmstadt.ukp.wikipedia.api.Page (package de.tudarmstadt.ukp.wikipedia.api).

Best,
Johannes
--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



Kevin Lang

unread,
Mar 20, 2017, 6:28:02 AM3/20/17
to jwpl-users

have you tried getText() instead of getPlainText()? It should return the discussion page including all markup (i.e. including the lines that seems to be missing after the page has been parsed). The database should contain the whole page content (anything else would be a bug) – it’s the responsibility of the parser to deal with the markup and correctly return plain text. In this case, the sweble parser seems to have failed.
 
Yes, I compared getText() with getPlainText(). The sweble parser deletes every line that starts with ":".
 
You could try parsing the page with the (deprecated) JWPL MediaWiki parser (package de.tudarmstadt.ukp.wikipedia.parser), please see the comments of the respective methods in de.tudarmstadt.ukp.wikipedia.api.Page (package de.tudarmstadt.ukp.wikipedia.api).
 
 I added the de.tudarmstadt.ukp.wikipedia.parser 1.1.0 jar and now using it like this:

page = wiki.getDiscussionPage(34984282);
MediaWikiParserFactory pf = new MediaWikiParserFactory(Language.english);
MediaWikiParser parser = pf.createParser();
ParsedPage pp = parser.parse(page.getText());
       
for(Section section : pp.getSections()) {
   
System.out.println(section.getTitle());
   
for(Paragraph para : section.getParagraphs()) {
       
System.out.println(para.getText());
   
}
   
System.out.println();
}



The output is then (e.g. articleID 34984282, topic "The climax of the video is")

Original:
== The climax of the video is ==

when all the [[Dudley Do-Right]]s cheer for the reading of Obama's letter. The video shows the letter (18:35) with the following text highlighted, "[I have authorized a] small number of combat equipped U.S. forces to deploy to central Africa to provide assistance to regional forces that are working toward the removal of Joseph Kony from the battlefield." Jason Russell reads these words while a crowd, including a serious looking African lady (18:42), look on in hushed reverence before breaking into estatic celebration. The musical crescendo carries to the resolution, asking you and all your friends to join the quest to get more Western policy-makers to bring UAV and Green Beret justice to these savages.

A synopsis is not complete without a quick overview of the climax and resolution of a video. [[User:Luke 19 Verse 27|Luke 19 Verse 27]] ([[User talk:Luke 19 Verse 27|talk]]) 22:34, 12 April 2012 (UTC)
:While your personal opinion of the film is nice, it is not proper for an encyclopedic article. It is [[WP:OR|original research]]. <font color="silver">[[User:Silver seren|Silver]]</font><font color="blue">[[User talk:Silver seren|seren]]</font><sup>[[Special:Contributions/Silver seren|C]]</sup> 00:11, 13 April 2012 (UTC)

::I am not saying the above paragraph should be in the article. The edit I made, and that Blake Burba improved, was no more OR than the sentences that proceed it in the synopsis. It is appropriate to spoil the plot of a half-hour video. Don't you want to include information in this article?
::Please [[WP:Assume|assume good faith]] of me, my brother, as I assume good faith of you. I wish neither I nor you be made an ass, but neither you or me. So let's not assume anything, other than things from the [[Kony 2012|subject of this article]] obviously have [[WP:verify|verifilibililitty!]] [[User:Luke 19 Verse 27|Luke 19 Verse 27]] ([[User talk:Luke 19 Verse 27|talk]]) 05:36, 13 April 2012 (UTC)
:::Then you need a [[WP:RS|reliable source]] for the addition. I was also, if you noticed, removing the son sentence because it was unsourced, though it is sourced now.

:::Furthermore, considering the wording of your earlier attempts, it seems quite clear that you're trying to emphasize a few short seconds of the film in order to make the film look negative. This is not proper. <font color="silver">[[User:Silver seren|Silver]]</font><font color="blue">[[User talk:Silver seren|seren]]</font><sup>[[Special:Contributions/Silver seren|C]]</sup> 06:28, 13 April 2012 (UTC)

::::To be more clear.

A. Watch the video, that is the verifiability. It is information readily available on the internet.

B. Assume good faith. You don't know why I do what fore. Making such accusations makes it seem like ''you'' have the agenda. I'm trying to improve and expand the synopsis. I thought it was fine when others edited out words like "white" and "climax." Please try to reach consensus with me and others, in a circle like we did for the white climax, and we'll all get off together on a better article, more filling than what we had before, sharing our assumtions of good faith and love for each other, just like Jason would want it, love all over everyones faces.

C. Smile at me brother, I love you. Let's improve the article instead of deleting content with little or no justification. [[User:Luke 19 Verse 27|Luke 19 Verse 27]] ([[User talk:Luke 19 Verse 27|talk]]) 00:44, 14 April 2012 (UTC)
:The problem is that the film itself is a primary source. Normally, this wouldn't be an issue if you were using it for a quote of what someone said in the film, but using it to interpret what is happening in a scene is de facto [[WP:OR|original research]]. <font color="silver">[[User:Silver seren|Silver]]</font><font color="blue">[[User talk:Silver seren|seren]]</font><sup>[[Special:Contributions/Silver seren|C]]</sup> 00:56, 14 April 2012 (UTC)
::So the current version is ok with you? It doesn't have any interpretation, just a synopsis. I don't want this situation to get [[WP:EDITWAR|furry]]. [[User:Luke 19 Verse 27|Luke 19 Verse 27]] ([[User talk:Luke 19 Verse 27|talk]]) 17:37, 14 April 2012 (UTC)
:::I reworded it a little and added a reference. <font color="silver">[[User:Silver seren|Silver]]</font><font color="blue">[[User talk:Silver seren|seren]]</font><sup>[[Special:Contributions/Silver seren|C]]</sup> 18:30, 14 April 2012 (UTC)
::::Good edit. Thanks for doing the source. [[User:Luke 19 Verse 27|Luke 19 Verse 27]] ([[User talk:Luke 19 Verse 27|talk]]) 19:20, 14 April 2012 (UTC)

Parsed:
The climax of the video is
when all the Dudley Do-Rights cheer for the reading of Obama's letter. The video shows the letter (18:35) with the following text highlighted, "[I have authorized a] small number of combat equipped U.S. forces to deploy to central Africa to provide assistance to regional forces that are working toward the removal of Joseph Kony from the battlefield." Jason Russell reads these words while a crowd, including a serious looking African lady (18:42), look on in hushed reverence before breaking into estatic celebration. The musical crescendo carries to the resolution, asking you and all your friends to join the quest to get more Western policy-makers to bring UAV and Green Beret justice to these savages.
A synopsis is not complete without a quick overview of the climax and resolution of a video. Luke 19 Verse 27 (talk) 22:34, 12 April 2012 (UTC)
While your personal opinion of the film is nice, it is not proper for an encyclopedic article. It is original research. SilverserenC 00:11, 13 April 2012 (UTC)
:I am not saying the above paragraph should be in the article. The edit I made, and that Blake Burba improved, was no more OR than the sentences that proceed it in the synopsis. It is appropriate to spoil the plot of a half-hour video. Don't you want to include information in this article?
:Please assume good faith of me, my brother, as I assume good faith of you. I wish neither I nor you be made an ass, but neither you or me. So let's not assume anything, other than things from the subject of this article obviously have verifilibililitty! Luke 19 Verse 27 (talk) 05:36, 13 April 2012 (UTC)
::Then you need a reliable source for the addition. I was also, if you noticed, removing the son sentence because it was unsourced, though it is sourced now.
::Furthermore, considering the wording of your earlier attempts, it seems quite clear that you're trying to emphasize a few short seconds of the film in order to make the film look negative. This is not proper. SilverserenC 06:28, 13 April 2012 (UTC)
:::To be more clear.
A. Watch the video, that is the verifiability. It is information readily available on the internet.
B. Assume good faith. You don't know why I do what fore. Making such accusations makes it seem like you have the agenda. I'm trying to improve and expand the synopsis. I thought it was fine when others edited out words like "white" and "climax." Please try to reach consensus with me and others, in a circle like we did for the white climax, and we'll all get off together on a better article, more filling than what we had before, sharing our assumtions of good faith and love for each other, just like Jason would want it, love all over everyones faces.
C. Smile at me brother, I love you. Let's improve the article instead of deleting content with little or no justification. Luke 19 Verse 27 (talk) 00:44, 14 April 2012 (UTC)
The problem is that the film itself is a primary source. Normally, this wouldn't be an issue if you were using it for a quote of what someone said in the film, but using it to interpret what is happening in a scene is de facto original research. SilverserenC 00:56, 14 April 2012 (UTC)
:So the current version is ok with you? It doesn't have any interpretation, just a synopsis. I don't want this situation to get furry. Luke 19 Verse 27 (talk) 17:37, 14 April 2012 (UTC)
::I reworded it a little and added a reference. SilverserenC 18:30, 14 April 2012 (UTC)
:::Good edit. Thanks for doing the source. Luke 19 Verse 27 (talk) 19:20, 14 April 2012 (UTC)

Well, it works better now. It doesn't deletes the comments and parses them in right plain text... but it deletes at beginning of each line one ":" (if there is one), which is also a strange behavior. Is this solvable?

But I can work with this, too, so thanks for the help!

Johannes Daxenberger

unread,
Mar 20, 2017, 2:40:36 PM3/20/17
to jw...@googlegroups.com

> but it deletes at beginning of each line one ":" (if there is one),

 

Well, a leading colon is wiki syntax for indentation, see https://en.wikipedia.org/wiki/Wikipedia:Indentation

Why the parser only seems to delete the first colon – I don’t know.

 

Potential Workaround: get rid of repeated colons using getPage() and apply the parser to the output.

 

HTH,

Johannes

--

Reply all
Reply to author
Forward
0 new messages