400: Bad Request when uploading file

623 views
Skip to first unread message

Steve Chang

unread,
Jul 22, 2020, 12:24:00 PM7/22/20
to Dataverse Users Community
We have an instance of Dataverse that we have been using to host data for various projects.  Everything is on local file storage; the main file store that we use is NFS mounted to the server, and we have set the following parameters to place everything in this directory:

-Ddataverse.files.directory=/cph-commons-dbs
-Ddataverse.files.file.directory=/cph-commons-dbs

One of the users of the system has reported issues uploading a large (~8GB) data set, which is a tab separated file.  They were uploading in a split zip archive, and were getting a 400 Bad Request error once the upload completed.

I have been working with this user and their data set, and was working on breaking up the data set to upload and troubleshoot, and have found the following:

These uploads are successful:
- a file with only lines 1-50,000
- a file with only lines 1-60,000
- a file with only lines 1-80,000

When attempting to add the file with lines 1-90,000, the upload completes with a 400 Bad Request error.  However, if I create a file with lines 50,000 - 100,000, the upload and ingestion competes.  This leads me to believe that it is not the data itself that is a problem.  I do not see anything in server.log for glassfish that seems related to the error.

I don't believe temporary data file storage is an issue, as I can watch "df" on the system while the upload is in progress, and I do not see any file system usage incrementing.  On the successful uploads, I do see the files appear in /cph-commons-dbs/temp prior to their ingestion, so I believe that the correct temp directory is being utilized.

Upload file size was set to 5GB, and apache ssl.conf ProxyPass timer was extended to 900s.  The 90,000 line file was right around 100MB, which is taking about 5 minutes to upload on my home Internet connection.

Any tips or advice would be appreciated!

Thanks,
Steve

James Myers

unread,
Jul 22, 2020, 12:56:55 PM7/22/20
to dataverse...@googlegroups.com

Steve,

I’m pretty sure a split zip file (a single zip archive sent as multiple files) would cause an issue as Dataverse is trying to unzip each file independently. Could that be the issue? There could be problems with ingesting a large tabular file, but I don’t think that results in a 400 error – just the original file being saved without a .tab version.

 

-- Jim

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/fb235273-19d1-499e-8a44-b170c1cc2a29o%40googlegroups.com.

Steve Chang

unread,
Jul 22, 2020, 9:29:34 PM7/22/20
to Dataverse Users Community
Thanks for the reply Jim!

I was wondering about the split zip too, and I'm going to suggest to the user that we don't split the zips. 

However, I've tested uploading with a subset of the data in a single zip, and and seeing the same issue when I get to a certain file size or line count.  I also started tinkering around with the API to see if that would be any different, and it is slightly different but has the same end result.  In the UI, the upload appears to complete and then fail immediately with a 400 with a large enough file.  In the API, the transfer is not allowed to complete and throws an error immediately.

root@steve-VirtualBox:/media/sf_Downloads# head -10000 training_master_small.tsv > clip_10000.tsv
root@steve-VirtualBox:/media/sf_Downloads#
root@steve-VirtualBox:/media/sf_Downloads# python3 dv_put.py clip_10000.tsv
Connected to OSU DataVerse
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 20.5M 100 486 100 20.5M 12 542k 0:00:40 0:00:38 0:00:02 42708
root@steve-VirtualBox:/media/sf_Downloads#
root@steve-VirtualBox:/media/sf_Downloads# head -90000 training_master_small.tsv > clip_90000.tsv
root@steve-VirtualBox:/media/sf_Downloads#
root@steve-VirtualBox:/media/sf_Downloads# python3 dv_put.py clip_90000.tsv
Connected to OSU DataVerse
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 216M 0 85 0 0 454 0 --:--:-- --:--:-- --:--:-- 454
Traceback (most recent call last):
File "dv_put.py", line 63, in <module>
resp = api.upload_file(dataset, file_path)
File "/usr/local/lib/python3.8/dist-packages/pyDataverse/api.py", line 1035, in upload_file
resp = json.loads(result.stdout)
File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)


Thanks,
Steve

James Myers

unread,
Jul 23, 2020, 8:14:36 AM7/23/20
to dataverse...@googlegroups.com

Steve,

For an immediate error, I can’t think of too many possibilities. One would be if your file has gone over the :MaxFileUploadSizeInBytes setting limit. If that’s not it - I see from the trace below that pyDataverse is failing to parse the response as json, so seeing the raw message from Dataverse might help, i.e. by printing before that failure on line 1035 or just trying the curl command directly (see http://guides.dataverse.org/en/latest/api/native-api.html#id62).

Steve Chang

unread,
Jul 23, 2020, 9:56:30 AM7/23/20
to Dataverse Users Community
:MaxFileUploadSizeInBytes is set to 5GB.  I had it set to 2GB prior to the user reporting this issue, and increased it since his zipped data set was ~4.5GB.  The smaller samples that are failing are only around 100MB, which should be well below the threshold.

It's funny - I started working on the python script since I was having trouble getting the curl command working, and it turns out that the reason the curl failed was because of this issue.  There's not much that I see helpful from the return code.  Here's a good and a bad curl example:

root@steve-VirtualBox:/media/sf_Downloads# export FILENAME='stomp_rocket.pdf'
root@steve-VirtualBox:/media/sf_Downloads# curl -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME" -F 'jsonData={"description":"test123"}' "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"
{"status":"OK","data":{"files":[{"description":"test123","label":"stomp_rocket.pdf","restricted":false,"version":1,"datasetVersionId":128,"dataFile":{"id":328,"persistentId":"","pidURL":"","filename":"stomp_rocket.pdf","contentType":"application/pdf","filesize":265369,"description":"test123","storageIdentifier":"file://1737be7caa6-9717a7909f23","rootDataFileId":-1,"md5":"fdcee63745d4a0cf31960d178418b2bd","checksum":{"type":"MD5","value":"fdcee63745d4a0cf31960d178418b2bd"},"creationDate":"2020-07-23"}}]}}root@steve-VirtualBox:/media/sf_Downloads#
root@steve-VirtualBox:/media/sf_Downloads#
root@steve-VirtualBox:/media/sf_Downloads# export FILENAME='clip_90000.zip'
root@steve-VirtualBox:/media/sf_Downloads# curl -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME" -F 'jsonData={"description":"test123"}' "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"
<html><head><title>Error</title></head><body>
<h2>ERROR: </h2>
<br>


James Myers

unread,
Jul 23, 2020, 10:16:18 AM7/23/20
to dataverse...@googlegroups.com

Yep – not very informative J Could it be something in front of Dataverse with a size limit? Apache servers have a LimitRequestBody directive and nginx/load balancers could have a limit as well. Doing a curl with the –vv option would show you the ‘server:’ – if that’s not your Apache server it would tell you if the request is being stopped by something else (e.g. I’ve seen awslb as the :server for errors where an aws load balancer was

stopping the request). If it is apache/httpd, there might be a clue in the apache log rather than the Dataverse one.

Steve Chang

unread,
Jul 23, 2020, 4:23:47 PM7/23/20
to Dataverse Users Community
Hi Jim - It sounds like Kevin has been keeping you in the loop on this.  I was able to upload the file by hitting Glassfish directly on 8080 from a jumpstation,  I tried again going through Apache while watching apache's access_log and ssl_access_log but didn't see anything.  The error_log had not been written to since yesterday.

At this point, I'm checking with our IT administrators to see if there's something that might be stopping the transfer before it gets to the system.  I know there's a load balancer, but I'm not sure what else.  I did re-run the curl with both -vv and -trace, but I didn't see an indication as to where the error message was coming from.  Here's the message part of that exchange:
> POST /api/datasets/:persistentId/add?persistentId=doi:10.5072/FK2/RO82WG HTTP/1.1
> Host: covid.commons.osu.edu
> User-Agent: curl/7.68.0
> Accept: */*
> Content-Length: 110859643
> Content-Type: multipart/form-data; boundary=------------------------eda22102fb55625c
> Expect: 100-continue
>
* Mark bundle as not supporting multiuse
* HTTP 1.0, assume close after body
< HTTP/1.0 400
< Content-Type: text/html

<
<html><head><title>Error</title></head><body>
<h2>ERROR: </h2>
<br>
* TLSv1.2 (IN), TLS alert, close notify (256):
* Closing connection 1
* TLSv1.2 (OUT), TLS alert, close notify (256):


I'll update again when I hear back from the admins. 

Thanks,
Steve
Reply all
Reply to author
Forward
0 new messages