Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Wiktionary Dump Download

266 views

Skip to first unread message

Pok Gramby

unread,

Dec 31, 2023, 7:51:27 PM12/31/23

I've just finished my first small project in Rust, a very limited parser for the English wiktionary. It's true that nobody needs another wiktionary parser, and it might even be true I didn't need a custom one for myself, but it turned out a nice task to get acquainted with Rust, even covering unsafe and macro_rules a little bit. If you want to take a look, it's on github: -dubovik/wiktionary-parsley

wiktionary dump download

Download Zip https://3ciohi-pbuku.blogspot.com/?c=2x0dea

2. Good ecosystem. I use quick-xml, regex, and serde_json for the primary operations. All these packages were easy to find, easy to use (nod to Cargo), and have good documentation. And while I do not know if there are any speed differences in comparison with packages from other languages, they are fast enough for my use case: all in all it takes about 45s on my machine to parse an extracted wiktionary dump.

I'm just trying to parse Wiktionary Data Dump provided by Wikimedia. My intention is to parse that XML data dump into the MySQL database. I didn't find proper documentation regarding the structure of this XML. Also, I'm not able to open the file because it's in fact really huge (1 GB). I thought of parsing it using some PHP script but I don't have any idea about the XML structure to proceed. So If anyone had already parsed (or has an idea about any tool to parse) this dump into MySQL using PHP, Please share the details. If nothing is in PHP, Other methods are also fine. I just followed this post ( -a-local-copy-of-wiktionary-mysql/) but It didn't work out..:( If anybody has succeeded in this process, please help.

If you don't have a good reason to import the dump into mysql it's better to avoid it as it's extremely slow with such a large amount of data. (Unless you're a good dba with a couple of fast machines at your disposal.)

This data is extracted from Wiktionary and is updated regularly. The full original Wiktionary data can be downloaded from Wikimedia dumps. This data is made available under the same licenses as Wiktionary - both CC-BY-SA and GFDL. See Wiktionary copyright page for more information.

This page is a part of the kaikki.org machine-readable dictionary. This dictionary is based on structured data extracted on 2023-12-23 from the enwiktionary dump dated 2023-12-20 using wiktextract (f8a5b86 and f21d6ca).The data shown on this site has been post-processed and various details (e.g., extra categories) removed, some information disambiguated, and additional data merged from other sources. See the raw data download page for the unprocessed wiktextract data.If you use this data in academic research, please cite Tatu Ylonen: Wiktextract: Wiktionary as Machine-Readable Structured Data, Proceedings of the 13th Conference on Language Resources and Evaluation (LREC), pp. 1317-1325, Marseille, 20-25 June 2022. Linking to the relevant page(s) under would also be greatly appreciated.

This is a Python package and tool for extracting information fromWiktionary data dumps. It reads theenwiktionary--pages-articles.xml.bz2 file (or correspondingfiles from other wiktionaries) and returns Python dictionariescontaining most of the information in Wiktionary.

The tool can be used to extract machine translation dictionaries,language understanding dictionaries, semantically annotateddictionaries, and morphological dictionaries withdeclension/conjugation information (where this information isavailable for the target language). Dozens of languages haveextensive vocabulary in enwiktionary, and several thousandlanguages have partial coverage.

The wiktwords script is the easiest way to extract data fromWiktionary. Just download the data dump file fromdumps.wikimedia.org andrun the script. The correct dump file the nameenwiktionary--pages-articles.xml.bz2.

The parse_wiktionary call will call word_cb(data) for wordsand redirects found in the Wiktionary dump. data is informationabout a single word and part-of-speech as a dictionary (multiplesenses of the same part-of-speech are combined into the samedictionary). It may also be a redirect (indicated by presence of a"redirect" key in the dictionray). It is in the same format as theJSON-formatted dictionaries returned by the wiktwords tool. Theformat is described below.

Before JWKTL is ready to use, you need to parse the obtaining Wiktionary dump file. The rationale behind this is to get in a position to efficiently access the Wiktionary data within a productive application environment by separating out all preparatory matters in a parsing step. In this step, the wiki syntax is being parsed by JWKTL and stored in a Berkeley DB. The parsing methods are based on text mining methods, which obviously require some computation time. This is, however, a one-time effort. The resulting database can then be repeatedly and quickly accessed, as discussed in the next section.

You can take the -pages-articles.xml.bz2 from the Wikimedia dumps site and process them with WikiTaxi (download in upper left corner). Wikitaxi Import tool will create a .taxi(around 15Gb for Wikipedia) file out of the .bz2 file. That file will be used by WikiTaxi program to search through articles. The experience is very similar to the browser experience.

Or you can use Kiwix, faster to set up because it also provides the already processed dumps (.zim files). As the comment specify in order to take other MediaWiki sites for kiwix mwoffliner can be used, it may not work with all since they may have custom differences but it is the only variant I came across.

Another possibility is to install a local LAMP/WAMP server stack (usually Apache, MySQL, and PHP), then install MediaWiki (the software that runs Wikipedia, etc.), then get a MySQL dump of Wiktionary, and set up a local instance. Complex and messy! Uses too many resources! Takes DAYS to configure!

35fe9a5643

0 new messages