Issue 96 in gwtwiki: Convert Wikipedia pages to HTML/XML

46 views
Skip to first unread message

gwt...@googlecode.com

unread,
May 26, 2012, 10:30:00 AM5/26/12
to bl...@googlegroups.com
Status: Started
Owner: axe...@gmail.com
Labels: Type-Enhancement Priority-Medium

New issue 96 by axe...@gmail.com: Convert Wikipedia pages to HTML/XML
http://code.google.com/p/gwtwiki/issues/detail?id=96

Copied from:
https://groups.google.com/forum/?fromgroups#!topic/bliki/eBsfyHZ4xVY

I think I can not use the dumps, because I would like to process the
article of the day, which isn't effective available as a dump (for
example "http://de.wikipedia.org:80/w/api.php?action=query&titles=Wikipedia:Hauptseite/Artikel_des_Tages/Montag&prop=revisions&rvprop=content&format=xml").
I would also like to avoid the overhead by using the filesystem and derby
database for downloading/caching. I just need to parse/convert the response
of the URL above, by providing a string or stream.


gwt...@googlecode.com

unread,
May 26, 2012, 10:42:25 AM5/26/12
to bl...@googlegroups.com

Comment #1 on issue 96 by axe...@gmail.com: Convert Wikipedia pages to
HTML/XML
http://code.google.com/p/gwtwiki/issues/detail?id=96

Commited r5313.

With this change the example call:

testWikipediaENAPI("Wikipedia:Hauptseite/Artikel_des_Tages/Montag", "http://de.wikipedia.org/w/api.php",
Locale.GERMAN);

creates at least the HTML file and downloads the referenced image file.

To avoid caching the template files you can copy (don't derive) your own
wiki model from the APIWikiModel and override the getRawWikiContent()
method and
eliminate the usage of the Derby database.
If a template name is requested in your getRawWikiContent() method you
don't use fWikiDB.selectTopic() but your own static files.
So you hould be able to eliminate the dependency from Derby database.






gwt...@googlecode.com

unread,
May 26, 2012, 6:11:14 PM5/26/12
to bl...@googlegroups.com

Comment #2 on issue 96 by sven.str...@gmail.com: Convert Wikipedia
Wow, thank you for the fast support. :-)

I have successfully eliminated the downloading of images and the usage of
the Derby database by deriving an "own" wiki model as you suggested.

But the User object is still required to let the model load the templates
and article via the URL, itself. Do you have an idea how this can be
eliminated? At the moment I parse the raw wiki text itself and let bliki
just converting the article text. But my self-implemented parsing logic of
raw wiki text is very complicated and can not handle all situations.
Therefore I would like to use bliki also for parsing the raw wiki text.
From the current parsing and rendering I have a lot of unit tests which
test the final rendering with various contents. These tests can not be
switched to the new bliki integration, because I can only call bliki with a
URL, but can not inject the pre-loaded raw wiki text as a string. The unit
tests should of course not load the text via the internet. This is also not
possible, because the content behind the URL of the article of the day is
changed weekly. ;-) So I would need a way to initialize the wiki model with
a pre-loaded raw wiki text as a string or InputStream or I need a way to
mock the remote call for loading the article. I could not yet find a way. I
tried to replace "List<Page> thePages = myUser.queryContent(thePageTitles)"
within "getRawWikiContent" with "XMLPagesParser theParser = new
XMLPagesParser(theRawWikiTextAsString).parse; List<Page> thePages =
theParser.getPagesList();", but could not yet get it working because this
method is also used to load templates.

Thank you in advance

Regards,

Sven S.

Attachments:
APIWikiModelLite.java 2.4 KB

gwt...@googlecode.com

unread,
May 28, 2012, 5:48:30 AM5/28/12
to bl...@googlegroups.com

Comment #3 on issue 96 by axe...@gmail.com: Convert Wikipedia pages to
HTML/XML
http://code.google.com/p/gwtwiki/issues/detail?id=96

I committed r5377.

With these new methods:
DocumentCreator#renderToFile(String rawWikiText, String title,
ITextConverter converter, String filename) throws IOException;

HTMLCreatorExample#testWikipediaText(String rawWikiText, String title,
Locale locale);

you can render a wiki text snippet directly into a file.

This is a quick and dirty solution.
You should copy DocumentCreator to your own class and delete/refactor the
things you don't need.

If possible please contribute back your finished solution, so that other
users can also use your Creator and WikiModel classes.


gwt...@googlecode.com

unread,
May 31, 2012, 2:52:42 PM5/31/12
to bl...@googlegroups.com

Comment #4 on issue 96 by sven.str...@gmail.com: Convert Wikipedia
Hi,

it is almost done and I will post it or provide a patch when it is ready.
One thing regarding the article image is strange. The example at the bottom
contains the image name/reference "Datei:Nyatapole2.jpg", but when I
convert it to HTML with bliki, it results in "Datei:116px-Nyatapole2.jpg".
The image size is appended to the filename which isn't correct. The
concrete image can be found
via "http://de.wikipedia.org/w/api.php?action=query&titles=Datei:Nyatapole2.jpg&prop=imageinfo&iiprop=url&format=xml",
but not with the bliki-modified image
name: "http://de.wikipedia.org/w/api.php?action=query&titles=Datei:116px-Nyatapole2.jpg&prop=imageinfo&iiprop=url&format=xml".

Do you have an idea why this is happening and how it can be avoided?

Example

<?xml version="1.0"?><api><query><normalized><n
from="Wikipedia:Hauptseite/Artikel_des_Tages/Donnerstag"
to="Wikipedia:Hauptseite/Artikel des
Tages/Donnerstag"/></normalized><pages><page pageid="964888" ns="4"
title="Wikipedia:Hauptseite/Artikel des Tages/Donnerstag"><revisions><rev
xml:space="preserve">{{Shortcut|WP:ADTDO}}{{Wikipedia:Hauptseite/Artikel
des Tages/Bearbeitungshinweise}}
<onlyinclude> {{AdT-Vorschlag
| DATUM = 28.07.2011
| LEMMA = Bhaktapur
| BILD = Datei:Nyatapole2.jpg
| BILDBESCHREIBUNG = Nyata-Tempel, 1708 erbaut, dreißig Meter hoch und der
hinduistischen Gottheit Lakshmi geweiht
| BILDGROESSE = 116px
| BILDUMRANDUNG =
| TEASERTEXT = '''[[Bhaktapur]]''' (nepali ??????? ‚Stadt der Frommen‘)
oder ''Khwopa'' (newari ???? ''Khvapa'') ist neben Kathmandu und Lalitpur
mit über 78.000 Einwohnern die dritte und kleinste der Königsstädte des
Kathmandutals in Nepal. Bhaktapur liegt am Fluss Hanumante und wie
Kathmandu an einer alten Handelsroute nach Tibet, was für den Reichtum der
Stadt verantwortlich war. Das Bild der Stadt wird bestimmt von der
Landwirtschaft, der Töpferkunst und besonders von einer lebendigen
traditionellen Musikerszene. Wegen seiner über 150 Musik- und 100
Kulturgruppen wird Bhaktapur als Hauptstadt der darstellenden Künste Nepals
bezeichnet. Die Einwohner von Bhaktapur gehören ethnisch zu den Newar und
zeichnen sich durch einen hohen Anteil von 60 Prozent an Bauern der
Jyapu-Kaste aus. Die Bewohner sind zu fast 90 Prozent Hindus und zu zehn
Prozent Buddhisten. Vom 14. Jahrhundert bis zur zweiten Hälfte des 18.
Jahrhunderts war Bhaktapur Hauptstadt des Malla-Reiches. Aus dieser Zeit
stammen viele der 172 Tempelanlagen, der 32 künstlichen Teiche und der mit
Holzreliefs verzierten Wohnhäuser. Zwar verursachte ein großes Erdbeben
1934 viele Schäden an den Gebäuden, doch konnten diese wieder so instand
gesetzt werden, dass Bhaktapurs architektonisches Erbe bereits seit 1979
auf der UNESCO-Liste des Weltkulturerbes steht.
}} </onlyinclude>
[[Kategorie:Wikipedia:Hauptseite/Artikel des Tages|
Donnerstag]]</rev></revisions></page></pages></query></api>

gwt...@googlecode.com

unread,
Jun 7, 2012, 3:03:05 AM6/7/12
to bl...@googlegroups.com

Comment #5 on issue 96 by axe...@gmail.com: Convert Wikipedia pages to
HTML/XML
http://code.google.com/p/gwtwiki/issues/detail?id=96

I'm appending the width with the "iiurlwidth" parameter like this
http://de.wikipedia.org/w/api.php?action=query&titles=Datei:Nyatapole2.jpg&prop=imageinfo&iiprop=url&format=xml&iiurlwidth=116

See the example I've commited: r5528

See the info.bliki.wiki.impl.APIWikiModel#appendInternalImageLink() method
for details;
http://code.google.com/p/gwtwiki/source/browse/trunk/info.bliki.wiki/bliki-pdf/src/main/java/info/bliki/wiki/impl/APIWikiModel.java


gwt...@googlecode.com

unread,
Jun 7, 2012, 5:58:25 PM6/7/12
to bl...@googlegroups.com

Comment #6 on issue 96 by sven.str...@gmail.com: Convert Wikipedia
Hm, I tried to overwrite appendInternalImageLink, but the call parameters
have already the "extended" image filename. Therefore
appendInternalImageLink can not cause the magic extension.

hrefImageLink = "Datei:116px-Nyatapole2.jpg"
srcImageLink = "116px-Nyatapole2.jpg"

gwt...@googlecode.com

unread,
Aug 20, 2012, 3:36:03 PM8/20/12
to bl...@googlegroups.com

Comment #7 on issue 96 by sven.str...@gmail.com: Convert Wikipedia
Hi,

I have created a new "in-memory" APIWikiModel along with an example and
another modification to the DocumentCreator. Everything is contained within
the attached patch (SVN). Is it possible to apply and commit this patch?

Regards,

Sven S.

Attachments:
in-memory-support.patch 8.3 KB

gwt...@googlecode.com

unread,
Aug 21, 2012, 5:32:57 PM8/21/12
to bl...@googlegroups.com

Comment #8 on issue 96 by axe...@gmail.com: Convert Wikipedia pages to
HTML/XML
http://code.google.com/p/gwtwiki/issues/detail?id=96

I added your patch with this commit: r6831.

gwt...@googlecode.com

unread,
Sep 1, 2012, 7:21:28 AM9/1/12
to bl...@googlegroups.com

Comment #9 on issue 96 by sven.str...@gmail.com: Convert Wikipedia
Hi,

thanks for adding the patch.

I detected two new problems which I have fixed with another patch. Could
you please also add this patch?

1. Problem: When the image file has a SVG extension, the extension is
changed from ".svg" to ".svg.png" by the WikiModel. This behavior isn't
desired in the in-memory model, because it breaks the image URL. I added a
quick-fix like with the file-size extensions I described above. This should
be improved in the future for example by override possibilities of the
WikiModel.

2. Problem: I had the problem that an article image weren't detected,
because the prefix/namespace check for images does not work sometimes.
INamespace#getImage() returned "Datei" (german locale for "File") and
INamespace#getImage() returned "Image", but the article contained "File"
(not localized). So these three prefixes should get checked, because some
article requests return "Datei" and some other articles return "File".

Attachments:
in-memory-support-2.patch 2.3 KB

gwt...@googlecode.com

unread,
Sep 2, 2012, 4:55:13 AM9/2/12
to bl...@googlegroups.com

Comment #10 on issue 96 by axe...@gmail.com: Convert Wikipedia pages to
HTML/XML
http://code.google.com/p/gwtwiki/issues/detail?id=96

Added you patch with commit r6896.

gwt...@googlecode.com

unread,
Oct 17, 2013, 5:27:21 PM10/17/13
to bl...@googlegroups.com

Comment #11 on issue 96 by sven.str...@googlemail.com: Convert
Hi,

I have improved the code again and final. The solution is now more stable
(an error occurred when the original image name contained "-" sign), the
code is now clean (the ToDo could also be solved) and it should be better
for the performance.

Could you please integrate the patch in 3.0.20? I hope it can also get
deployed to Sonatype soon. :-)

Thanks.

Attachments:
in-memory-support-3.patch 3.6 KB

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

gwt...@googlecode.com

unread,
Oct 17, 2013, 5:29:45 PM10/17/13
to bl...@googlegroups.com

Comment #12 on issue 96 by sven.str...@googlemail.com: Convert
I think the issue can also get marked as fixed when the
in-memory-support-3.patch is applied.

gwt...@googlecode.com

unread,
Oct 20, 2013, 1:14:36 PM10/20/13
to bl...@googlegroups.com

Comment #13 on issue 96 by axe...@gmail.com: Convert Wikipedia pages to
HTML/XML
http://code.google.com/p/gwtwiki/issues/detail?id=96

Committed r9124 and r9125
Reply all
Reply to author
Forward
0 new messages