Uploaded file is empty

443 views
Skip to first unread message

max.c...@gmail.com

unread,
Aug 10, 2020, 9:02:24 AM8/10/20
to OpenRefine
I am trying to create a project from upload via the API. I am uploading a very simple csv file, and even though the creation is successful and that I get a projectID, the project is empty. Did it already happen to someone before?

Tom Morris

unread,
Aug 10, 2020, 12:47:55 PM8/10/20
to openr...@googlegroups.com
On Mon, Aug 10, 2020 at 9:02 AM <max.c...@gmail.com> wrote:
I am trying to create a project from upload via the API. I am uploading a very simple csv file, and even though the creation is successful and that I get a projectID, the project is empty. Did it already happen to someone before?

OpenRefine's API is for use by its web client and isn't documented/maintained for external use. If you're using someone else's client library, you should report the problem to them (and they'd probably appreciate more detail about the code that you're using, versions of OpenRefine and client library, test setup, error messages, etc, etc.)

Tom

Owen Stephens

unread,
Aug 19, 2020, 7:24:04 AM8/19/20
to OpenRefine
Hi Max,

Did you solve this? How did you upload the file?
If you are looking at a client library to interact with OpenRefine I think https://github.com/opencultureconsulting/openrefine-client is probably your best bet (a Python client which is being actively maintained)
Although, as Tom says, the API is intended for communication between the OpenRefine web client and backend, if you share more of what you are trying to achieve and what you've tired so far it might be possible to identify why you are ending up with an empty project

Owen

Anna Gossen

unread,
Jan 24, 2021, 3:18:43 PM1/24/21
to OpenRefine
Hi,

having the same problem, using https://github.com/opencultureconsulting/openrefine-client . Creating a project with csv, ending up with an empty project..
Anybody solved this?

Anna

Felix Lohmeier

unread,
Jan 25, 2021, 4:43:55 AM1/25/21
to OpenRefine
Hi Anna,

On Sunday, 24 January 2021 at 21:18:43 UTC+1 Anna Gossen wrote:
having the same problem, using https://github.com/opencultureconsulting/openrefine-client . Creating a project with csv, ending up with an empty project..
Anybody solved this?

If you share the file, I'll be happy to take a look. To be able to reproduce the error, please also tell which environment (Win/Mac/Linux, OpenRefine version, One-file-executable via command line or Python environment, ...) you are using.

Best wishes,
Felix

Anna Gossen

unread,
Jan 25, 2021, 5:21:09 AM1/25/21
to OpenRefine
Hi Felix,

It's Windows, OpenRefine 3.4. 
The way I am creating the project is via Python:

        project_format = 'text/line-based/*sv'
        project_options = {}
        PATH_TO_TEST_DATA = os.path.join(os.path.dirname(__file__), 'data')
        project_file = os.path.join(PATH_TO_TEST_DATA, 'duplicates.csv')

        project = orefine.new_project(
            project_file=project_file,
            project_file_name='duplicates.csv',
            project_format=project_format,
            project_name='test',
            **project_options
        )

The token is included. The response looks perfect, includes the new ID of the project. But when I check it on the Refine server it has no name and is empty. No error messages.

Best regards

Anna

Felix Lohmeier

unread,
Jan 25, 2021, 8:39:30 AM1/25/21
to OpenRefine
Hi Anna,

I have not managed to reproduce the error. Can you share the CSV file and the whole Python script?

This is how it worked for me (Win10, Powershell 5.1, Python 2.7.18, OpenRefine 3.4.1):

1. Download test file

wget https://git.io/fj5hF -OutFile duplicates.csv

2. Ensure OpenRefine is running at http://localhost:3333

3. Install openrefine-client 0.3.10

C:\Python27\python.exe -m pip install openrefine-client==0.3.10

4. Starting Python 2.7 environment

C:\Python27\python.exe

4a. via cli function

from google.refine import cli
p1 = cli.create('duplicates.csv')


4b. via "upstream way"

from google.refine import refine
server1 = refine.Refine('http://localhost:3333')
project1 = server1.new_project(project_file='duplicates.csv')


4c) trying parts of your code snippet

# guessing your preps for "orefine"
from google.refine import refine
server = refine.RefineServer()
orefine = refine.Refine(server)


project_format = 'text/line-based/*sv'
project_options = {}

project = orefine.new_project(
    project_file='duplicates.csv', # changed this line to filename

    project_file_name='duplicates.csv',
    project_format=project_format,
    project_name='test',
    **project_options
)


Best wishes,
Felix

Anna Gossen

unread,
Jan 25, 2021, 7:20:49 PM1/25/21
to OpenRefine
Hi Felix,

the code is exactly like you added:


        server = ref.RefineServer()
        orefine = ref.Refine(server)

        project_format = 'text/line-based/*sv'
        project_options = {}
        PATH_TO_TEST_DATA = os.path.join(os.path.dirname(__file__), 'data')
        project_file = os.path.join(PATH_TO_TEST_DATA, 'duplicates.csv')

        project = orefine.new_project(
            project_file=project_file,
            project_file_name='duplicates.csv',
            project_format=project_format,
            project_name='test',
            **project_options
        )

The new_project method:

    def new_project(self, project_file=None, project_url=None, project_name=None, project_format='text/line-based/*sv',
                    project_file_name=None,               
                    encoding='',
                    separator=',',
                    ignore_lines=-1,
                    header_lines=1,
                    skip_data_lines=0,
                    limit=-1,
                    store_blank_rows=True,
                    guess_cell_value_types=False,
                    process_quotes=True,
                    store_blank_cells_as_nulls=True,
                    include_file_sources=False,
                    **opts):

        if (project_file and project_url) or (not project_file and not project_url):
            raise ValueError('One (only) of project_file and project_url must be set')

        def s(opt):
            if isinstance(opt, bool):
                return 'true' if opt else 'false'
            if opt is None:
                return ''
            return str(opt)

        new_style_options = dict(opts, **{
            'encoding': s(encoding),
            'separator': s(separator)
        })
        params = {
            'options': json.dumps(new_style_options),
        }

        # old style options
        options = {
            'format': project_format,
            'ignore-lines': s(ignore_lines),
            'header-lines': s(header_lines),
            'skip-data-lines': s(skip_data_lines),
            'limit': s(limit),
            'guess-value-type': s(guess_cell_value_types),
            'process-quotes': s(process_quotes),
            'store-blank-rows': s(store_blank_rows),
            'store-blank-cells-as-nulls': s(store_blank_cells_as_nulls),
            'include-file-sources': s(include_file_sources)
        }

        if project_url is not None:
            options['url'] = project_url
        elif project_file is not None:
            options['project-file'] = {
                'fd': open(project_file),
                'filename': project_file,
            }
        if project_name is None:
            project_name = (project_file or 'New project').rsplit('.', 1)[0]
            project_name = os.path.basename(project_name)
        options['project-name'] = project_name
        response = self.server.urlopen(
            'create-project-from-upload', options, params

        url_params = urllib.parse.parse_qs(urllib.parse.urlparse(response.url).query)
        if 'project' in url_params:
            project_id = url_params['project'][0]
            return RefineProject(self.server, project_id)
        else:
            raise Exception('Project not created')



The response is a success, the project id is generated. No error messages. 
           

What could be wrong?

Best

Anna

Felix Lohmeier

unread,
Jan 25, 2021, 8:21:14 PM1/25/21
to openr...@googlegroups.com
Hi Anna,

If you share the file, I'll be happy to take a look.

Best wishes,
Felix


On 24. Jan 2021, at 21:18, Anna Gossen <annag...@gmail.com> wrote:

Hi,
--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/205fe322-e3f0-4e92-a199-d2d9c3c31ea3n%40googlegroups.com.

Anna Gossen

unread,
Jan 26, 2021, 2:56:33 AM1/26/21
to OpenRefine
Hi Felix,


But it's the same thing with any file.

Best

Anna

Felix Lohmeier

unread,
Jan 26, 2021, 4:51:27 AM1/26/21
to OpenRefine
Hi Anna,

Apologies for my duplicate request here ("If you share the file, I'll be happy to take a look.") and the duplicate reply in the other thread. Both were delivered to the group a day late because I had accidentally sent a quick reply from another mail address first (that was not registered for Google Groups).

I have tested again and now also tried to reproduce the "os.path.join" part of your code. Attached is a screenshot.

screenshot_2021-01-26_10-36.png

I tested with Win10, Powershell 5.1, Python 2.7.18, OpenRefine 3.4, openrefine-client 0.3.10. Is anything different in your setting? Could you possibly try again if it works with this minimal script?

#!/usr/bin/env python

import os

from google.refine import refine

server = refine.RefineServer()
orefine = refine.Refine(server)

project_format = 'text/line-based/*sv
project_options = {}
PATH_TO_TEST_DATA = os.path.join(os.path.dirname(__file__), 'data')
project_file = os.path.join(PATH_TO_TEST_DATA, 'duplicates.csv')

project = orefine.new_project(
    project_file=project_file,
    project_format=project_format,
    project_name='test',
    **project_options
)

Best wishes,
Felix

Anna Gossen

unread,
Jan 26, 2021, 4:34:42 PM1/26/21
to OpenRefine
Hi Felix,

Thanks a lot for trying this out.

I am using the same script sample as in your example. But the difference is, that I can only use Python 3.8. So the part with the request looks like (simplified):


        params = {}
        if self.token:
            params['csrf_token'] = self.token


        data = {
            "project-name":"test",
            "project-file": {
                "fd": open("C:/.../data/duplicates.csv", "rb"),
                "filename": "duplicates.csv",
            }
        }
        response = requests.post(url, data=data, params=params)

The response looks good. The openrefine stack:

                                             refine] GET /command/core/get-csrf-token (45776ms)
21:05:16.933 [                   refine] POST /command/core/create-project-from-upload (60095ms)
21:06:57.635 [                   refine] GET /command/core/get-csrf-token (100702ms)
21:08:32.487 [                   refine] POST /command/core/create-project-from-upload (94852ms)
21:10:06.100 [                   refine] POST /command/core/load-language (93613ms)
21:10:06.115 [                   refine] GET /command/core/get-preference (15ms)
21:10:06.129 [                   refine] POST /command/core/load-language (14ms)
21:10:06.135 [                   refine] POST /command/core/load-language (6ms)
21:10:06.241 [                   refine] POST /command/core/get-importing-configuration (106ms)
21:10:06.266 [                   refine] GET /command/core/get-all-project-tags (25ms)
21:10:06.291 [                   refine] GET /command/core/get-all-project-metadata (25ms)

No error messages. But the created project in openrefine is empty ('Untitled' with 0 rows)

Sorry for bothering you with this. You probably cannot help now with python 3.8. But may be some other user will give me a hint.

Thanks a lot

Anna

Tom Morris

unread,
Jan 26, 2021, 5:05:20 PM1/26/21
to openr...@googlegroups.com
Hi Anna,

On Tue, Jan 26, 2021 at 4:34 PM Anna Gossen <annag...@gmail.com> wrote:

Sorry for bothering you with this. You probably cannot help now with python 3.8. But may be some other user will give me a hint.

Are you using the branch from the Python 3 pull request? If you look at the comments thread you'll see that there were a number of str vs bytes issues with the Python 3 port, which is pretty typical for Python 3 ports, and that's a likely source of all kinds of protocol weirdness.

If you want to dig into debugging it, I'd suggest comparing what goes on the wire for a real OpenRefine client request vs what your client is sending. That'll likely give you a hint as to the source of the problem.

Tom

Anna Gossen

unread,
Jan 26, 2021, 5:24:23 PM1/26/21
to OpenRefine
Hi Tom,

thanks for the hint. What do you mean with the "real OpenRefine client"? I tried https://github.com/wolfv/openrefine-client/tree/py3_port and https://github.com/daniel-butler/refine-client-py but none of them worked.

Do you have an example of a working request? 

Best

Anna

Tom Morris

unread,
Jan 26, 2021, 6:07:14 PM1/26/21
to openr...@googlegroups.com, openref...@googlegroups.com
Sorry, I should have been clearer. The standard OpenRefine web client is what I'm talking about. If you create a project from a local file on your computer, it should hit that same API endpoint. If you look at the HTTP requests in the network tab of the developer console in your browser, you can see what it's sending for a payload.

Since this is getting kind of "techie", perhaps we can continue it on the developers list, so we don't strain the patience of the OpenRefine users.

Tom 

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
Message has been deleted

Thad Guidry

unread,
Jan 28, 2021, 1:48:49 PM1/28/21
to openr...@googlegroups.com, annag...@gmail.com
Hi Anna,

I just checked on our openrefine-dev mailing list access, and you are an allowed member as shown, so posts don't get deleted:

image.png
But what was happening was that the spam block was catching your messages.
I've added you now to the always approve...so you shouldn't have problems now posting to either mailing list.


Anna Gossen

unread,
Jan 28, 2021, 3:20:25 PM1/28/21
to OpenRefine
Thanks, Thad
Reply all
Reply to author
Forward
0 new messages