Gensim KeyedVector load from s3

47 views
Skip to first unread message

Thanos Tasakos

unread,
Jun 28, 2023, 2:32:15 PM6/28/23
to Gensim
Hello all,

supposedly gensim uses smart _open to open a file, and this allows it to pass directly s3 urls like s3://bucket/file. I am using a local minio installation and i want to pass the endpoint_url as well (e.g http://localhost:9000) and the credentials. The credentials can be parsed from env variables or ~/.aws/credentials, but the url i dont know how to pass it.

Gordon Mohr

unread,
Jun 28, 2023, 3:15:44 PM6/28/23
to Gensim
The `smart_open` project README shows examples of supplying credentials inside the URL:


(This is the top result for the Google query [smart_open s3 credentials].)

Have you tried that?

- Gordon

Thanos Tasakos

unread,
Jun 29, 2023, 3:49:37 AM6/29/23
to Gensim

My problem is not the credentials, as boto3 is able to take them directly from environmental variables or inside the url in the format "s3://{user}:{pass}@bucket/key".
The problem is how to tell gensim, that the s3 host and port is e.g http://localhost:9000. I can do it directly ,if i use smart_open myself, and pass the boto3 client as well, but  gensim does not provide a way of doing so.

KR,
Thanos

Gordon Mohr

unread,
Jun 29, 2023, 5:25:38 PM6/29/23
to Gensim
Aha, sorry for misreading your request!

Seems like either Gensim's uses of `smart_open` should allow passing custom extra `transport_params`/etc (which it doesn't currently) or the `boto3` library should let users specify alternate endpoints via its configuration file (which SO answers like at <https://stackoverflow.com/questions/32618216/override-s3-endpoint-using-boto3-configuration-file> suggest it doesn't support). 

So, potential workarounds:

Gensim via smart_open can often take an already-open file-like object instead of a path/URL. So, depending on exactly which Gensim IO-using functions you're using, you *may* be able to open the file yourself, using smart_open directly, then pass that (single front-to-back streamable) file-descriptor to the Gensim function instead of a string path/URL.

Other Gensim IO (like opening a `.save()` model that may be multiple separate files deduced by their one 'main' file) may still require a path/URL. 

In those cases, you *might* be able to work-around the current limitations by "monkey-patching" the `smart_open.open` function that Gensim is about to use to add your extra parameters. This risks side effects – be sure to read-up on the risks of moneky-patching if you're unfamiliar with the technique – but *might* be fine in your pattern of use. 

Very roughly (untested & my variable-args syntax is rusty):

    # assumes you've already set up your custom boto3 Session/client into `my_s3_client`
    inner_open = smart_open.open
    def my_open(*args, **kwargs):
        return inner_open(*args, **kwargs, transport_params={'client': my_s3_client)})
    gensim.util.open = my_open  # clobber the original `open`

    # do your Gensim IO that needs your modification
    # ...

    gensim.util.open = old_open  # reset just in case it'd break other later opens

Let me know if either of these tactics work for your purposes, or if your specific use runs into other problems. 

- Gordon

Thanos Tasakos

unread,
Jun 30, 2023, 12:04:32 PM6/30/23
to Gensim
What a legend!
I needed to also monkey-patch the numpyio module , to use smart_open instead of open, but it worked!

Kind regards,
Thanos
Reply all
Reply to author
Forward
0 new messages