"pymongo.errors.AutoReconnect: connection closed" after 127 bulk inserts

466 views
Skip to first unread message

PlatonB

unread,
Aug 22, 2019, 9:08:54 PM8/22/19
to mongodb-user
elementary OS 5.0
MongoDB server 4.2.0
PyMongo 3.9.0

I tried to convert large tab-separated-file to a MongoDB-collection. Exactly after placing the 127th 10000-rows-fragment, an error occurs.

import pymongo, gzip
src_file_path
= input()

client
= pymongo.MongoClient(compressors='zstd')
vcf_db
= client.vcf_db
vcf_coll
= vcf_db.vcf_coll

with gzip.open(src_file_path, mode = 'rt') as src_file_opened:
       
for line in src_file_opened:
               
if line.startswith('##'):
                       
continue
                header_row
= line.split('\n')[0].split('\t')
               
break
       
        fragment
, fragment_len, added_fragment_num = [], 0, 0
       
for line in src_file_opened:
                row
= line.split('\n')[0].split('\t')
                fragment
.append(dict(zip(header_row, row)))
                fragment_len
+= 1
               
if fragment_len == 10000:
                        vcf_coll
.insert_many(fragment)
                        added_fragment_num
+= 1
                        fragment
.clear()
                        fragment_len
= 0
       
if fragment_len > 0:
                vcf_coll
.insert_many(fragment)
                added_fragment_num
+= 1

>>> added_fragment_num
127

I don't see 127 fragments limit on the corresponding documentation page.

Maybe it's a bug?

PlatonB

unread,
Oct 18, 2019, 5:24:00 AM10/18/19
to mongodb-user
Answer please.

пятница, 23 августа 2019 г., 4:08:54 UTC+3 пользователь PlatonB написал:

Robert Cochran

unread,
Oct 18, 2019, 8:30:58 AM10/18/19
to mongodb-user
Hi,

I suggest that you extract approximately 100 records from the zip file that are at the location where the error occurs. Look at the input data. It might not be what you are expecting. Processing comma separated values or tab separated values files can be quite tricky. I'd check the input data.

I hesitate to download the zip file you provide since it is apparently very large in size and I do not know the content of it. You could provide a small set of just a few records plus the header record.

Can you reproduce this on Ubuntu 18.04 or some other free server-side operating system?

Please note I am not an employee of MongoDB, Inc. I'm just another list user trying to help you.

Thanks so much

Bob

PlatonB

unread,
Oct 19, 2019, 5:46:52 AM10/19/19
to mongodb-user
I easily produce similar bulk inserts of the same data in other DBMS.

Start of one of the source tables:
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
1    10019    rs775809821    TA    T    .    .    RS=775809821;RSPOS=10020;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
1    10039    rs978760828    A    C    .    .    RS=978760828;RSPOS=10039;dbSNPBuildID=150;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP
1    10043    rs1008829651    T    A    .    .    RS=1008829651;RSPOS=10043;dbSNPBuildID=150;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP
1    10051    rs1052373574    A    G    .    .    RS=1052373574;RSPOS=10051;dbSNPBuildID=150;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP

Yes, I'm using Ubuntu 18.04-based OS.

пятница, 18 октября 2019 г., 15:30:58 UTC+3 пользователь Robert Cochran написал:

Tim Hawkins

unread,
Oct 19, 2019, 7:47:09 AM10/19/19
to mongodb-user
Have you checked for invalid utf-8 characters around the place the error occured?

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/24a97493-aa45-4b93-88fe-518aea037d48%40googlegroups.com.

PlatonB

unread,
Oct 19, 2019, 10:17:25 AM10/19/19
to mongodb-user
All characters are correct.

суббота, 19 октября 2019 г., 14:47:09 UTC+3 пользователь Tim Hawkins написал:
To unsubscribe from this group and stop receiving emails from it, send an email to mongod...@googlegroups.com.

Robert Cochran

unread,
Oct 20, 2019, 6:28:19 PM10/20/19
to mongodb-user
Hi,

Let me explain the steps I have taken to investigate this issue. Bear in mind I am not an employee of MongoDB, Inc. I'm just another list user trying to help.

1. I worked with your small sample dataset given above, however running that through your Python code made me believe I need to download the dataset that you quoted in your original post. I did so.
2. I also downloaded the README file for that group of datasets but didn't find it helpful for purposes of reformatting the input data. It is helpful to a biologist but not for an information technology person.
3. Examining the data in the input file plus your python code makes me believe that the header record starts at line number 131, which is the line with only a single hash mark ('#') at the beginning. I hope this is correct.
4. I used a sed command to extract the header record plus the two data lines that follow it. The '134q' command instructs sed to quit upon reaching line 134. Therefore, lines 131, 132, and 133 of the input file are extracted and added to the file test2.csv.

$ zcat ALL.chr6_GRCh38.genotypes.20170504.vcf.gz | sed -n '131p;132p;133p;134q' > test2.csv

 
5. I then added test2.csv as the only file in a gzipped tar file:

$ tar -cvzf test2.tar.gz test2.csv

6. I then modified your code in order to connect to a database named 'platon_db' and a collection named 'vcf1'. I also started to add documentary comments to the code.

$ cat modified_code.py
# This code is expecting Python 3. You need to install compatible
# modules with pip3.
#
import pymongo, gzip
from pymongo import MongoClient
src_file_path = input()

client = MongoClient(compressors='zstd')
vcf_db = client.platonb_db
vcf_coll = vcf_db['vcf1']

with gzip.open(src_file_path, mode = 'rt') as src_file_opened:
        for line in src_file_opened:
                if line.startswith('##'):
                        continue
                header_row = line.split('\n')[0].split('\t')
                break

        fragment, fragment_len, added_fragment_num = [], 0, 0
        for line in src_file_opened:
                row = line.split('\n')[0].split('\t')
                fragment.append(dict(zip(header_row, row)))
                fragment_len += 1
                if fragment_len == 10000:
                        vcf_coll.insert_many(fragment)
                        added_fragment_num += 1
                        fragment.clear()
                        fragment_len = 0
        if fragment_len > 0:
                vcf_coll.insert_many(fragment)
                added_fragment_num += 1


7. I then ran your code inside a Ubuntu 18.04 LTS server (not a "Ubuntu-based OS", but an actual Ubuntu 18.04 OS) which us running MongoDB Enterprise Server version 4.0.13. I received the following error:

$ python3 modified_code.py
test2.tar.gz
Traceback (most recent call last):
  File "modified_code.py", line 30, in <module>
    vcf_coll.insert_many(fragment)
  File "/home/vagrant/.local/lib/python3.6/site-packages/pymongo/collection.py", line 758, in insert_many
    blk.execute(write_concern, session=session)
  File "/home/vagrant/.local/lib/python3.6/site-packages/pymongo/bulk.py", line 511, in execute
    return self.execute_command(generator, write_concern, session)
  File "/home/vagrant/.local/lib/python3.6/site-packages/pymongo/bulk.py", line 346, in execute_command
    self.is_retryable, retryable_bulk, s, self)
  File "/home/vagrant/.local/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1385, in _retry_with_session
    return func(session, sock_info, retryable)
  File "/home/vagrant/.local/lib/python3.6/site-packages/pymongo/bulk.py", line 341, in retryable_bulk
    retryable, full_result)
  File "/home/vagrant/.local/lib/python3.6/site-packages/pymongo/bulk.py", line 295, in _execute_command
    result, to_send = bwc.execute(ops, client)
  File "/home/vagrant/.local/lib/python3.6/site-packages/pymongo/message.py", line 895, in execute
    request_id, msg, to_send = self._batch_command(docs)
  File "/home/vagrant/.local/lib/python3.6/site-packages/pymongo/message.py", line 889, in _batch_command
    self.codec, self)
  File "/home/vagrant/.local/lib/python3.6/site-packages/pymongo/message.py", line 1376, in _do_bulk_write_command
    namespace, operation, command, docs, check_keys, opts, ctx)
  File "/home/vagrant/.local/lib/python3.6/site-packages/pymongo/message.py", line 1301, in _do_batched_op_msg
    operation, command, docs, check_keys, ack, opts, ctx)
bson.errors.InvalidDocument: Key names must not contain the NULL byte
 

8. I believe this error happened because you are not properly formatting each document into key-value pairs. Each column header in the header line needs to be treated as a key, and you must pair it with the actual value string that you are reading in from the data. The actual PyMongo documentation explains here --> how to do a correct bulk insert in pymongo . You need to look under the heading "Bulk inserts" to see how the documents need to be formatted.

9. Your pymongo code needs additional remediation in order to work correctly. This is where your true problem is. My advice to you is to try to correctly reformat just one or two records before you attempt to process a huge input dataset. It is much easier to find and correct problems if you start with small test datasets at first. What you are attempting to do is not a trivial project. It can easily take you 40-50 hours and quite possibly a lot more to get the code working exactly right. 

10. The utility 'mongoimport' can handle csv input files that have header records, and it might be easier for you to try mongoimport...again start with a tiny dataset of a header record plus 2 or 3 input records. You will still need to put in a significant amount of work to get 'mongoimport' to successfully add even 1 document. 

I am not professional in Python programming. I started work on a very different project to reformat a very large input csv file (from a different source) into JSON-formatted documents that mongoimport can then import. I'm using Node.js 10.16.x and it's built-in 'readline' API. I'm not real good with Node.js either: however I picked it because it is a lot simpler to use than some other languages for this purpose, and by sticking with very small imput datasets I can see my mistakes right away and fix them quickly. If you search for my recent posts, you will see I did a small project that uses Node and deadline, and the code is freely available on GitHub...I included a link in my responses to the person.

Good luck to you! To repeat myself once more...again...I am not an employee of MongoDB. I'm just trying to help you. 

Thanks so much

Bob

Robert Cochran

unread,
Oct 20, 2019, 7:49:44 PM10/20/19
to mongodb-user
Hi,

I forked this GitHub repository from another person. I didn't touch the master branch. There is a second branch, 'mongodbv4' which contains code which has the goal of reading a large csv file using the Node.js readline API. It captures the first line as the header line. It then reformats each subsequent line of input into key-value pairs in JSON format that are suitable for importing into MongoDB. Each element in the header array becomes a key, and it is associated with the correct element in the subsequent line(s) of input so that a key-value pair is formed and made part of a JSON document. 

The reformatted JSON output is written to a file, and this file can later be used as input to mongoimport. 

The code in my repository as currently written does have a flaw: it does not check the transaction amount for a null value correctly. I'm working on correcting that flaw in my local repository. At some point in the future after testing, I'll upload the commits to the 'mongodbv4' branch. In your case you seem to be working with string data. There are many key-value pairs that you need to format. A for loop can handle that for you.

Thanks so much

Bob    

PlatonB

unread,
Oct 21, 2019, 10:13:59 AM10/21/19
to mongodb-user
Thank you very much for testing and very detailed answer!

About the source big file. It was created by a very serious organization - The European Bioinformatics Institute. Such files are initially designed for script processing. They are perfectly formatted, do not contain unnecessary or difficult to parse characters. These tables have been used by many programs for a long time.

>> It is much easier to find and correct problems if you start with small test datasets at first.

As I wrote in the first message, the problem is reproduced only after 127 bulk inserts. Therefore, I cannot reproduce it on tiny tables. But just in case, I cut the first 1000 lines with the help of a self-written script. Then I successfully converted the received table into a database. The database works fine:

thgenomes_coll.find_one({'ID': 'rs544472365'})
{'_id': ObjectId('5dadaf6d96b0b87c22032a9b'), '#CHROM': '14', 'POS': '18224734', 'ID': 'rs544472365', 'REF': 'T', 'ALT': 'G', 'QUAL': '100', 'FILTER': 'PASS', 'INFO': 'AC=3;AF=0.000599042;AN=5008;NS=2504;DP=37227;EAS_AF=0;AMR_AF=0;AFR_AF=0.0023;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP;GRCH37_POS=19001211;GRCH37_REF=T;GRCH37_38_REF_STRING_MATCH', 'FORMAT': 'GT', 'HG00096': '0|0', 'HG00097': '0|0', 'HG00099': '0|0', 'HG00100': '0|0', 'HG00101': '0|0', <...>}

>> not a "Ubuntu-based OS", but an actual Ubuntu 18.04 OS
elementary OS is 100% compatible with the latest Ubuntu LTS. As a matter of fact, it's Ubuntu with an alternative DE.

понедельник, 21 октября 2019 г., 1:28:19 UTC+3 пользователь Robert Cochran написал:

Wan Bachtiar

unread,
Nov 20, 2019, 7:34:33 PM11/20/19
to mongodb-user

Exactly after placing the 127th 10000-rows-fragment, Maybe it’s a bug?

Hi,

I managed to reproduce the issue that you’re seeing and submitted a PyMongo bug report PYTHON-2055. The MongoDB Python driver team has released a fix in PyMongo v3.10 to address the issue.

The problem was that PyMongo does not factor in the 16 byte message header when batching a compressed bulk write OP_MSG. The 127th of 10K rows batch size of your input file was 9 bytes short of the OP_MSG compressed limit of 48MB, which managed to trigger this bug. If you were to reduce the size of the rows, or the documents were smaller in size you won’t be able to trigger the bug.

Regards,
Wan.

Reply all
Reply to author
Forward
0 new messages