Parsing WXR (Wordpress eXpress RSS) XML file data

60 views
Skip to first unread message

Aharon Varady

unread,
May 28, 2019, 9:12:18 PM5/28/19
to Open Siddur Technical Discussion List
I mentioned in the previous email that the site data for opensiddur.org is now available in downloadable WXR (Wordpress eXpress RSS) XML files. 

Making these files publicly accessible is mainly intended as a way for researchers to access the site data without having to scrape opensiddur.org. Aside from RSS, there's really no public API (that I know of) for accessing all 960+ posts on opensiddur.org.

But I have another objective for beginning to move our site data into XML. For all of our transcribed text on opensiddur.org, I want to separate our data from its presentation.

A digression:
Such a goal should be no surprise to folk watching this project from its early days. Efraim and I envisioned the Open Siddur as a database by which we could serve liturgists sharing new prayers, scholars researching liturgy, and crafters compiling collections of prayers and related work into new prayerbooks. By separating data from its presentation, that data could be presented in an infinity of ways in an infinity of variations. Our project was founded with great hope in 2009 with this in mind. However, by late 2010, it was clear we wouldn't be realizing this vision soon. So, I began to do something simple and useful with my own modest skills -- just to help collect and curate liturgical content contributed by our community on the wordpress site that had up till then mainly served as a blogspace. In that way, opensiddur.org became the CMS it is today. Meanwhile, development continued on our collaborative transcription environment and siddur building web application at app.opensiddur.org.

Back to these WXR files. By themselves they are large, unwieldly XML files containing the raw HTML and postmeta data of every one of the posts and pages of opensiddur.org. It seems to me that the next step in making this data accessible is to parse these files into 960+ individual post files containing both the raw HTML data and relevant postmeta data such as title, author, co-author(s), content license, date published, categories, and tags -- and to do as much as we can to provide that as structured data. (Further steps can link these files to the manifests of page images linked to the Internet Archive, make them into nice JLPTEI conforming XML, and write some XSLT to display them once again in HTML.)

I've had some success in parsing the WXR posts file into individual text files containing the body of each post using a wxr2txt python script I found here: https://gist.github.com/ruslanosipov/b748a138389db2cda1e8

Unfortunately, that script doesn't bother to copy over the postmeta data along with the HTML in the post body. So I'm still trying to figure out what I need to add to this script to better parse the WXR file. (I also noticed that the file seems to choke on the pages WXR file.) So there's room for improvement for folk who want to help out and flex their python skills. The HTML parse module should come of service into service as can be seen in this fork of the script:
https://gist.github.com/aegis1980/4d00c381b0eb67f83cf93365cd7b69ad

(For some reason, HTML Parse isn't working for me in my Python install, so if you can get the above fork to work, let me know.)

So have fun experimenting with the site data and this wxr2txt.py script -- and let me know what success you have in parsing the site data.

Aharon


--
Aharon Varady
Founding Director, shamesh
the Open Siddur Project
https://opensiddur.org

Twitter | Facebook
Pronouns: He/him/his

Aharon Varady

unread,
Jul 18, 2019, 9:19:43 PM7/18/19
to opensiddur-tech

WEBSITE UPDATE

Folks watching this space know that for a while I've been looking for ways to make the content on opensiddur.org "site-neutral." Opensiddur.org currently runs on a Wordpress site which means all its data is stored in a MySQL database on the site's backend. While theoretically, that database is accessible by Wordpress's REST API, I've been keen to make sure that all the content on opensiddur.org is practically accessible through an easily downloadable public repository. I also have my eye on a future when Efraim Feinstein's app.opensiddur.org will be able to make use of the last ten years of content shared from opensiddur.org -- on our way to realizing our Open Siddur builder application.


tl;dr -- You can now download or navigate through all our site's posts via https://github.com/aharonium/opensiddur.org


The posts are preserved as simple HTML pages. They each have a markdown (.md) extension that enables them to be viewable through github's frontend.


This isn't the first time I've attempted this, but I've had much more success on this latest try using a python script to parse the XML file that Wordpress generates. I update this script here: https://gist.github.com/aharonium/1d148b57e2b8488f68e2f2781ce92e00
The closest I've come to bootstrapping from an XSLT solution is via this script: https://www.oipapio.com/question-361474


There is still work to be done to make the python script we're using for this process more robust, so please speak up if you'd like to help with this task. (Co-authors, categories, tags, license and attribution metadata isn't being parsed out from the XML yet.) If you have experience parsing XML via python or XSLT, let me know.

Reply all
Reply to author
Forward
0 new messages