NOTE THAT the multistream dump file contains multiple bz2 'streams' (bz2 header, body, footer) concatenated together into one file, in contrast to the vanilla file which contains one stream. Each separate 'stream' (or really, file) in the multistream dump contains 100 pages, except possibly the last one.
In the dumps.wikimedia.org directory you will find the latest SQL and XML dumps for the projects, not just English. The sub-directories are named for the language code and the appropriate project. Some other directories (e.g. simple, nostalgia) exist, with the same structure. These dumps are also available from the Internet Archive.
Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from dumps.wikimedia.org. In conclusion, download these images at your own risk (Legal).
Compressed dump files are significantly compressed, thus after being decompressed will take up large amounts of drive space. A large list of decompression programs are described in comparison of file archivers. The following programs in particular can be used to decompress bzip2, .bz2, .zip, and .7z files.
If you seem to be hitting the 2 GB limit, try using wget version 1.10 or greater, cURL version 7.11.1-1 or greater, or a recent version of lynx (using -dump). Also, you can resume downloads (for example wget -c).
Also, if you want to get all the data, you'll probably want to transfer it in the most efficient way that's possible. The wikipedia.org servers need to do quite a bit of work to convert the wikicode into HTML. That's time consuming both for you and for the wikipedia.org servers, so simply spidering all pages is not the way to go.
You can do Hadoop MapReduce queries on the current database dump, but you will need an extension to the InputRecordFormat tohave each be a single mapper input. A working set of java methods (jobControl, mapper, reducer, and XmlInputRecordFormat) is available at Hadoop on the Wikipedia
As part of Wikimedia Enterprise a partial mirror of HTML dumps is made public. Dumps are produced for a specific set of namespaces and wikis, and then made available for public download. Each dump output file consists of a tar.gz archive which, when uncompressed and untarred, contains one file, with a single line per article, in json format. This is currently an experimental service.
MediaWiki 1.5 includes routines to dump a wiki to HTML, rendering the HTML with the same parser used on a live wiki. As the following page states, putting one of these dumps on the web unmodified will constitute a trademark violation. They are intended for private viewing in an intranet or desktop installation.
The wikiviewer plugin for rockbox permits viewing converted Wikipedia dumps on many Rockbox devices.It needs a custom build and conversion of the wiki dumps using the instructions available at . The conversion recompresses the file and splits it into 1 GB files and an index file which all need to be in the same folder on the device or micro sd card.
Instead of converting a database dump file to many pieces of static HTML, one can also use a dynamic HTML generator. Browsing a wiki page is just like browsing a Wiki site, but the content is fetched and converted from a local dump file on request from the browser.
I want to count entities/categories in wiki dump of a particular language, say English. The official documentation is very tough to find/follow for a beginner. What I have understood till now is that I can download an XML dump (What do I download out of all the available different files), and parse it (?) to count entities (The article topics) and categories.
As a reminder, Kiwix is an offline reader: once you download your zim file (Wikipedia, StackOverflow or whatever) you can browse it without any further need for internet connectivity. There's much talk that one could fit Wikipedia into 21 Gb, but that would be a text-only, compressed and unformatted (ie not human readable) dump. Kiwix, on the other hand, is ready for consumption and use cases range from preppers to rural schools to Antarctic bases and anything inbetween.
This is an unofficial listing of Wikimedia data dump torrents, dumps of Wikimedia site content distributed using BitTorrent, the most popular peer-to-peer file sharing protocol. This includes both dumps already being distributed at dumps.wikimedia.org and dumps created and distributed solely by others.
BitTorrent is not officially used to distribute Wikimedia dumps; this article lists user-created torrents. Please protect your computer and verify the checksum at dumps.wikimedia.org for any file downloaded from these unofficial mirrors (eg. for the last enwiki dumps).
There are several different kinds of data dumps available. Note that while JSON and RDF dumps are considered stable interfaces, XML dumps are not. Changes to the data formats used by stable interfaces are subject to the Stable Interface Policy.
JSON dumps containing all Wikidata entities in a single JSON array can be found under entities in the array are not necessarily in any particular order, e.g., Q2 doesn't necessarily follow Q1.The dumps are being created on a weekly basis.
Secondly, so called truthy dumps are provided. They use the nt format. They are in the same format as the full dumps, but limited to direct, truthy statements. Therefore, they do not contain meta data such as qualifiers and references.
The -all dump files contain all entity information in Wikidata with the exception of order (of aliases, of statements, etc.), which is not naturally represented in RDF. The -truthy dump files encode the *best* statements (i.e. the ones with the highest rank of each given (subject, property) pairs) as single RDF triples (qualifiers and references are omitted).
Warning: The format of the JSON data embedded in the XML dumps is subject to change without notice, and may be inconsistent between revisions. It should be treated as opaque binary data. It is strongly recommended to use the JSON or RDF dumps instead, which use canonical representations of the data!
Incremental dumps (or Add/Change dumps) for Wikidata are also available for download. These dumps contain stuff that was added in the last 24 hours, reducing the need of having to download the full database dump. These dumps are considerably smaller than the full database dumps
Hi, I use very often to load an actual wikipedia dump (without pictures) (like -20130602-pages-articles.xml.bz2) and convert it into Wikitaxi (there is an importer included).
Is there any importer for this files into ZIM format? That would be very helpful to get fresh wikipedia version inside kiwix.
Wikitaxi is simple, fast and ok, but kiwix has much more nice feature, but the importer is missing.
So kiwix might have a future, if the import of new wikipedia dumps would be easy. (no question such a convert procedure takes time, but that is a good compromise).
Here I will briefly describe how the data is structured inside the dump - which is a humongous XML file - and try to make sense of such structure while not going full clichê and only displaying the XML schema. In the end I'll give an example of how I visualize the document as a relational model, and on following posts I'll describe how I wrote a processor for this XML to fill the given relational model - parallelizing bz2 streams, hopefully. Let's go and understand this data first.
To store the clean needed data, we first need to somehow ingest it from the source, and the source is the Wikimedia database backup dump. Those backup dumps come in the form of big XML files compressed into big bz2 multistream archives. The provided XML schema is a good way of understanding the whole specification, sometimes looking at the data is very important to understand what we have and what we need from this immense amount of information. To make things easier, I'll try to refer to such XML documents in a object-oriented manner and I swear it'll be much easier than referring to all the different types of "complex elements" that might be inside of this XML document.
We can see that the first field is the sitename which is referring to the name of the site where this dump came from - I'll not dive deeper to explain the whole usage of the MediaWiki software to build what is now the Wikipedia website here, so I'll let you discover it all if you want to. Then the dbname comes, and I like to think of this dbname as a discriminator to be used inside of the Wikipedia dumps. Fir example, we might have the "enwiki" for the English variation, the "dewiki" for the German variation, and so on. The base attribute indicated the base page to facilitate the access and the case is a formatting thing that should not be useful for us right now.
The namespaces is a child object of the siteinfo which allow us to search for information in the right place. When looking at the explanation for each namespace, we might understand that if we only need to analyse articles, we might only retrieve from this dump pages associated with the namespace 0, but if we want to analyse what users are discussing about changes in a certain page, we might look for pages associated with the namespace 1. The namespaces object is formatted as:
Now things get a bit more exciting. This page object is where all the information of a single page is stored inside this big XML file, and since we have millions of pages inside the Wikipedia, we have millions of such objects inside the XML dump. Each page object is given as: