WikiSpaces

33 views
Skip to first unread message

Emilio J. Rodríguez-Posada

unread,
May 4, 2018, 3:52:43 PM5/4/18
to wikiteam...@googlegroups.com
Hello;

Please, somebody add WikiSpaces to ArchiveBot. The website is closing on July.

We don't have a scraper to extract the metadata and histories for WikiSpaces wiki engine, so I think ArchiveBot is a good solution for now.

Thanks

Emilio J. Rodríguez-Posada

unread,
May 4, 2018, 4:25:54 PM5/4/18
to wikiteam...@googlegroups.com
I am running our spider for WikiSpaces, to retrieve a list of urls of wikis (subdomains). It could help to launch wget on individual wikis.

But the ArchiveBot wget on the main domain Wikispaces.com is still needed. I don't know if it will discover much content from the mainpage...

Federico Leva (Nemo)

unread,
May 4, 2018, 5:25:30 PM5/4/18
to wikiteam...@googlegroups.com, Emilio J. Rodríguez-Posada
Emilio J. Rodríguez-Posada, 04/05/2018 22:52:
> Please, somebody add WikiSpaces to ArchiveBot. The website is closing on
> July.

18.07 <@SketchCow> Wikispaces should be the thing we go after

This was several days ago, I think, but
<https://www.archiveteam.org/index.php?title=Wikispaces> is still
basically empty.

Archivebot needs at least some URLs to cycle through. On their main page
I don't see a directory of wikis. Can you compile a list of domains, to
be fed into archivebot?

It would be useful to add the URL structure to the wiki page, too. I
can't tell whether there is some page which generates loops, or some
directory of all pages on a domain.

Federico

Emilio J. Rodríguez-Posada

unread,
May 4, 2018, 5:37:04 PM5/4/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com
2018-05-04 23:25 GMT+02:00 Federico Leva (Nemo) <nemo...@gmail.com>:
Emilio J. Rodríguez-Posada, 04/05/2018 22:52:
Please, somebody add WikiSpaces to ArchiveBot. The website is closing on July.

18.07 <@SketchCow> Wikispaces should be the thing we go after

This was several days ago, I think, but <https://www.archiveteam.org/index.php?title=Wikispaces> is still basically empty.

Archivebot needs at least some URLs to cycle through. On their main page I don't see a directory of wikis. Can you compile a list of domains, to be fed into archivebot?


It would be useful to add the URL structure to the wiki page, too. I can't tell whether there is some page which generates loops, or some directory of all pages on a domain.

I will try to add some info to AT wiki.
 

Federico

Emilio J. Rodríguez-Posada

unread,
May 4, 2018, 8:20:17 PM5/4/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com
I just committed a first version of WikiSpaces downloader. https://github.com/WikiTeam/wikiteam/commit/cfb225ea5ecba2271b3839c42e661ad1aa1b0290

It downloads: 1) all pages 2) all files 3) all pages CSV metadata 4) all files CSV metadata

But it downloads just the current version of every page or file. Not full history. I think it is better like this to avoid overload servers, there isn't a way to download a page full history in a request, you would need to request every revision one at a time.

Right now pages are downloaded as HTML. Tomorrow I will fix it to download wikitext.

You can do tests (please, just one or two small wikis for now) and report bugs.

A list of 10,000 WikiSpaces wikis have been uploaded https://github.com/WikiTeam/wikiteam/commit/145b040784afea6bd4e39946cfd0afb89c2e502f

Most of them are empty I guess. More to arrive soon, the spider is runnning.

Emilio J. Rodríguez-Posada

unread,
May 5, 2018, 4:27:28 AM5/5/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com
New version uploaded. It should work fine. I have seen sometimes server returns empty CSV for files metadata history. Not a big problem...

python3 wikispaces.py https://wikiurl.wikispaces.com

Some example small wikis to test:

https://andaluciacd.wikispaces.com

https://conociendopraga.wikispaces.com

https://conociendotrujillo.wikispaces.com

Do your tests and give me feedback.

Emilio J. Rodríguez-Posada

unread,
May 6, 2018, 8:37:08 AM5/6/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com
The download & upload process is coded in the same script.

https://github.com/WikiTeam/wikiteam/blob/master/wikispaces.py

Requirements:

zip command (apt-get install zip)
ia command (pip install internetarchive, and configure it)

Usage:

python3 wikispaces.py wikilist.txt --upload (--admin too if you are a WikiTeam collection admin)

For further details:

python3 wikispaces.py --help


The firsts wikis are being uploaded https://archive.org/search.php?query=subject%3A%22wikispaces%22%20AND%20subject%3A%22wikiteam%22

You can start to archive WikiSpaces now.

Federico Leva (Nemo)

unread,
May 6, 2018, 2:29:58 PM5/6/18
to Emilio J. Rodríguez-Posada, wikiteam...@googlegroups.com
Emilio J. Rodríguez-Posada, 06/05/2018 15:36:
> The download & upload process is coded in the same script.

Thanks! If you need me to run the script on a list of wikis from a fast
server to speed up things, just send me the list.

Federico

Emilio J. Rodríguez-Posada

unread,
May 6, 2018, 3:17:42 PM5/6/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com
I just uploaded a few lists with 1,000 wikis each (wikispaces00-04) https://github.com/WikiTeam/wikiteam/tree/master/listsofwikis/wikispaces

python3 wikispaces.py wikispaces00 --upload --admin
 

Federico

Emilio J. Rodríguez-Posada

unread,
May 6, 2018, 3:19:11 PM5/6/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com
Be sure to use always the last version of wikispaces.py.

I did some changes in the last hours.

Emilio J. Rodríguez-Posada

unread,
May 6, 2018, 3:22:22 PM5/6/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com
I am having issues uploading to IA right now. Not sure if some server is down.

Emilio J. Rodríguez-Posada

unread,
May 7, 2018, 4:24:23 AM5/7/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com
Servers are working good again.

Federico Leva (Nemo)

unread,
May 14, 2018, 1:47:03 PM5/14/18
to Emilio J. Rodríguez-Posada, wikiteam...@googlegroups.com
Emilio J. Rodríguez-Posada, 07/05/2018 11:23:
> Servers are working good again.

Good. I see that you're already over 7k wikis archived. I can take let's
say 50k wikis if you tell me which lists to go through. How many
parallel instances of the script is it fine to run?

Federico

Emilio J. Rodríguez-Posada

unread,
May 14, 2018, 2:08:32 PM5/14/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com
You can use the wikispaces00-04 files, 1000 wikis each. When you need more, I will commit more lists. Spider has found about 150,000 wikispaces and it is still running.

Every batch of 1000 wikis needs about 50GB of space (directories + zips) and 2 or 3 days to download. You could run 3 or 5 parallel jobs at the same time.

If any batch has errors, you can delete the url from the list and relaunch, script will skip all previously uploaded wikispaces.

Federico Leva (Nemo)

unread,
May 14, 2018, 3:54:58 PM5/14/18
to Emilio J. Rodríguez-Posada, wikiteam...@googlegroups.com
Emilio J. Rodríguez-Posada, 14/05/2018 21:07:
> You can use the wikispaces00-04 files, 1000 wikis each.

Ok, doing. Could you call the internetarchive upload library from inside
Python, rather than shelling out? That way it's easier to run without
being root.

Federico

Federico Leva (Nemo)

unread,
May 16, 2018, 9:19:54 AM5/16/18
to Emilio J. Rodríguez-Posada, wikiteam...@googlegroups.com
Federico Leva (Nemo), 14/05/2018 22:54:
> Ok, doing. Could you call the internetarchive upload library from inside
> Python, rather than shelling out? That way it's easier to run without
> being root.

I've patched that locally. Using the same method as in uploader.py, you
can also do without the block of code which checks for existence of an item.

I still don't manage to make it resume uploads where they were left: it
doesn't upload the zip files which are already there. I have 2178 zips
but I think very few have been uploaded.

Federico

Emilio J. Rodríguez-Posada

unread,
May 24, 2018, 4:44:48 AM5/24/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com

Emilio J. Rodríguez-Posada

unread,
May 28, 2018, 10:28:16 AM5/28/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com

Federico Leva (Nemo)

unread,
May 28, 2018, 10:48:30 AM5/28/18
to wikiteam...@googlegroups.com, Emilio J. Rodríguez-Posada
Emilio J. Rodríguez-Posada, 28/05/2018 17:21:
> Biggest is 4.5 GB.

Remarkable! And those are mostly lecture material, although with some
audio/video file here and there.

Federico

Federico Leva (Nemo)

unread,
Jul 16, 2018, 5:16:07 AM7/16/18
to wikiteam...@googlegroups.com, Emilio J. Rodríguez-Posada
What's the status with the download? I see we're at 160k items in the
WikiTeam collection, mostly (?) uploaded by emijrp: well done!

My offer to run more instances of the script is still valid, as soon as
the script cleans after itself and resumes partial downloads/uploads
without manual intervention. I don't have the energy to maintain my
local hacks while the version in master diverges.

Federico

Federico Leva (Nemo)

unread,
Jul 31, 2018, 1:46:41 AM7/31/18
to wikiteam...@googlegroups.com, Emilio J. Rodríguez-Posada
Now at 227k, do you have enough resources for the final sprint?
https://archive.org/search.php?query=subject%3A%22wikispaces%22

Federico
Reply all
Reply to author
Forward
0 new messages