Problem with enwiki 20240701 dump

38 views
Skip to first unread message

AardF...@web.de

unread,
Jul 7, 2024, 10:53:18 AMJul 7
to aarddict


Running the command
mw2slob dump     -b $binsize     -c lzma2        -o ~/Downloads/$lang"wiki"-$DAT.slob --siteinfo $lang"wiki".si.json     ~/data/tmp/$lang"wiki"-NS0-$DAT-ENTERPRISE-HTML.json.tar.gz   -f wiki common 

results after 4.3 GB of data in:
S Bill Sorvino (34988)
ERROR:mw2slob.core:
Traceback (most recent call last):
  File "/home/markus/env-slob/lib/python3.9/site-packages/mw2slob/core.py", line 140, in run
    for title, aliases, text, error in resulti:
  File "/usr/lib/python3.9/multiprocessing/pool.py", line 448, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/usr/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
zlib.error: Error -3 while decompressing data: invalid block type

Finished adding content in 1 day, 8:07:46
Finalizing...
Sorting... sorted in 0:04:51
Resolving aliases...
Sorting... sorted in 0:04:58
Resolved aliases in 0:04:58
Finalized in 0:10:34Traceback (most recent call last):
  File "/home/markus/env-slob/bin/mw2slob", line 8, in <module>
    sys.exit(main())
  File "/home/markus/env-slob/lib/python3.9/site-packages/mw2slob/cli.py", line 394, in main
    args.func(args)
  File "/home/markus/env-slob/lib/python3.9/site-packages/mw2slob/cli.py", line 109, in cli_dump
    run(outname, info, itertools.chain(*scrape_articles, dump_articles), args)
  File "/home/markus/env-slob/lib/python3.9/site-packages/mw2slob/cli.py", line 67, in run
    core.create_slob(
  File "/home/markus/env-slob/lib/python3.9/site-packages/mw2slob/core.py", line 197, in create_slob
    run(slb, articles, filters, info.interwikimap, info.namespaces, html_encoding)
  File "/home/markus/env-slob/lib/python3.9/site-packages/mw2slob/core.py", line 140, in run
    for title, aliases, text, error in resulti:
  File "/usr/lib/python3.9/multiprocessing/pool.py", line 448, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/usr/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
zlib.error: Error -3 while decompressing data: invalid block type


is there anything wrong with the datafile?
I checked hashes and they are identical.

The same command is running fine with other dump files.
Any idea?

Any idea?

Igor Tkach

unread,
Jul 7, 2024, 12:05:09 PMJul 7
to aard...@googlegroups.com
On Sun, Jul 7, 2024 at 10:53 AM 'AardF...@web.de' via aarddict <aard...@googlegroups.com> wrote:

is there anything wrong with the datafile?

it does seem that way, yes
 
I checked hashes and they are identical.

If so then I guess they published a broken archive... you can test the archive like so:

tar -tzf enwiki-NS0-20240701-ENTERPRISE-HTML.json.tar.gz

or better check if it can actually be extracted:

tar -xvzf enwiki-NS0-20240701-ENTERPRISE-HTML.json.tar.gz -O > /dev/null  
 
(this immediately discards extracted bytes so you don't need extra disk space, but it will take a while)


The same command is running fine with other dump files.
Any idea?

Any idea?

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/627c1b8d-a10f-4141-be14-d40c65656042n%40googlegroups.com.

AardF...@web.de

unread,
Jul 9, 2024, 4:15:43 AMJul 9
to aard...@googlegroups.com
Yep. That's what I meant with the hashes. I am recalculating the md5 sum and compare it to the given one in case the is a download glitch with the 120 GB. 

I tried to unpack the file and tar is throwing errors. 

gzip: stdin: invalid compressed data--format violated
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

I guess there will be no update on enwiki this month. 

Thank you for putting me into the right direction to find the cause. 


Markus Braun



From: Igor Tkach <itk...@gmail.com>
Sent: Sunday, July 7, 2024 18:04
To: aard...@googlegroups.com
Subject: Re: Problem with enwiki 20240701 dump
Reply all
Reply to author
Forward
0 new messages