How to restore an PostgreSQL vanilla dump to the database PostgreSQL+pg_shard?

437 views
Skip to first unread message

FattahRozzaq

unread,
Oct 15, 2015, 3:54:10 AM10/15/15
to pg_shard users
Hi dear,

I currently test to import some PostgreSQL vanilla 9.2.4 dump using psql.
The PostgreSQL 9.2.4 vanilla dump was made by pg_dump.

The exception messages are as below"
psql:/home/postgres/backup/20151013-150001/room.dump:206864: invalid command \N


Below is the command that:
COPY room (id, user_id, partner_id, rm_cr_ts, rm_up_ts, rm_sv_ts, value, last_dlvrd_to_user, last_dlvrd_to_prtnr, last_seen_by_user, last_seen_by_prtnr, num_unseen_by_user, num_unseen_by_prtnr, last_message_id, last_message_ts, legacy_id, legacy_last_message_id, inserted_at, updated_at) FROM stdin;
711675  1391103 2657167 -1429528649798556       1429528649798556        1429528649798556        created 0       0       0       0       1       0       19733451        1429528649798556        AUzWjCFTbRHQVpj5SYnD    AUzWjCR43gMouVUpyttw    2015-09-12 18:56:49     2015-09-12 18:56:49
19      40868   1191438 1426767773822538        1426767773822538        1426767773822538        closed  1426754122000000        0       1426754122000000        2       0       1       \N      0       550a8a4a73656d2e26160000                2015-09-12 14:05:32     2015-09-12 14:05:32


Below is the command I ran to create the table and distribute it:
CREATE TABLE room (
    id integer NOT NULL,
    user_id character varying(255),
    partner_id character varying(255),
    rm_cr_ts bigint,
    rm_up_ts bigint,
    rm_sv_ts bigint,
    value character varying(255),
    last_dlvrd_to_user bigint,
    last_dlvrd_to_prtnr bigint,
    last_seen_by_user bigint,
    last_seen_by_prtnr bigint,
    num_unseen_by_user integer,
    num_unseen_by_prtnr integer,
    last_message_id integer,
    last_message_ts bigint,
    legacy_id text,
    legacy_last_message_id text,
    inserted_at timestamp without time zone NOT NULL,
    updated_at timestamp without time zone NOT NULL
);
SELECT master_create_distributed_table(table_name := 'room', partition_column := 'id');
SELECT master_create_worker_shards(table_name := 'room', shard_count := 16, replication_factor := 2);


Below is the description of my system:
pg_shard:  v1.2.2 (I got by: git clone https://github.com/citusdata/pg_shard.git )

I use 3 servers with all the same specifications as below"
PostgreSQL version:  9.4.5
OS: Debian GNU/Linux 7.8 (wheezy)
RAM: 128GB
CPU: 24 cores
SSD: 128GB


So, can we pg_dump data from PostgreSQL 9.2.4 vanilla to PostgreSQL 9.4.5+pg_shard? 
How to properly do it?


Thanks in advance,
Fattah

FattahRozzaq

unread,
Oct 15, 2015, 5:35:20 AM10/15/15
to pg_shard users
Dear all,

When I type the COPY command in psql, the exception message are as below:
postgres=# COPY room (id, user_id, partner_id, rm_cr_ts, rm_up_ts, rm_sv_ts, value, last_dlvrd_to_user, last_dlvrd_to_prtnr, last_seen_by_user, last_seen_by_prtnr, num_unseen_by_user, num_unseen_by_prtnr, last_message_id, last_message_ts, legacy_id, legacy_last_message_id, inserted_at, updated_at) FROM stdin;
ERROR:  COPY commands on distributed tables are unsupported

I don't any idea how to solve this.
Please help me...


Regards,
Fattah
--

Onder Kalaci

unread,
Oct 15, 2015, 6:38:11 AM10/15/15
to pg_shard users
Hi,

In order to import an already existing table's data pg_shard, I can suggest the following method:

pg_shard provides a binary to facilitate loading many rows of data from a file, similar to the functionality provided by PostgreSQL's COPY command. So, in your case the simplest solution would be to copy the original table out to a CSV file via PostgreSQL's COPY command.  Then, use copy_to_distributed_table to load the same data to the distributed table. So basically, you'd have something similar to the following:
  • COPY non_distributed_table TO '/tmp/file.csv'
  • Create and distribute table as you already suggested above
  • Use copy_to_distributed table to load the contents of the '/tmp/file.csv' to the distributed table.
Note that, you'd need to take care of using the same COPY parameters such as delimiter, header and some other the options discussed here. Also, CitusDB's documentation page has an example usage of copy_to_distributed_table at step 4, which might help as well.

Does that help you?

Onder,
Thanks

Sanjana Sridhar

unread,
Oct 15, 2015, 10:44:14 AM10/15/15
to pg_shard users
Hello,

I have a csv file in /tmp/test.csv. I have a distributed table that I am trying to copy the data into. I have tried the copy_to distributed_table command but am getting the following error.

postgres@ae9485e8f833:/usr/lib/postgresql/9.3/bin$ ./copy_to_distributed_table -CH -d '|' -n NULL /tmp/test.csv test_table

ERROR
:  function create_insert_proxy_for_table(unknown, unknown) does not exist

LINE
1: SELECT create_insert_proxy_for_table('public.test_table', 'r...

               
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.


Also is there any command to specify the name of the database? I do not find anything here https://groups.google.com/forum/#!searchin/pg_shard-users/copy_to_distributed_table/pg_shard-users/40oxOGqN-6M/tV08X7KPeOkJ


Any help will be greatly appreciated.

Thanks,
Sanjana

Onder Kalaci

unread,
Oct 16, 2015, 10:24:38 AM10/16/15
to pg_shard users
Hi, 

First thing you should check is whether pg_shard is successfully created on the master: 
  • First, you will need to add pg_shard to shared_preload_libraries in your postgresql.conf:
    • shared_preload_libraries = 'pg_shard' # (change requires restart)
  • Connect to the master node worker via psql and run:
    • CREATE EXTENSION pg_shard;
Also, copy_to_distributed table script respects the typical environment variables. So, you can set database name with executing the following command on the command line (also there is an example usage of env. variables here):
  • EXPORT PGDATABASE=postgres
Does that fix your problem?

Onder,
Thanks

FattahRozzaq

unread,
Oct 19, 2015, 5:27:18 AM10/19/15
to pg_shard users, Onder Kalaci, sanjana....@gmail.com
At last, I can workaround by following to suggestion from all of you.

I export the table to .CSV format, via psql version 9.2.4 from
database vanilla PostgreSQL 9.2.4:
COPY room TO '/home/postgres/backup/room.csv' DELIMITER ',' CSV HEADER;

I modified the script copy_to_distributed_table to math my port and
database name (and name it copytopgshard.sh as you can see from my
snapshot)

And then I IMPORT the .CSV file to my database running PostgreSQL
9.4.5+pg_shard:
/opt/postgresql-9.4.5/bin/copytopgshard.sh -CH -d ','
/home/postgres/backup/room.csv

The copy was took around 15 hours to finished for a .CSV file contain
single table with size 3,2 GB.
I have another table that I have to copy to pg_shard with around size
17GB contain single table. I think with 17GB table .CSV file it will
took 90 hours....

Is this the normal pg_shard maximum performance?
Or we can do something about it?


Respect//
Fattah
copy-data-to-pgshard.png

Onder Kalaci

unread,
Oct 20, 2015, 2:42:21 AM10/20/15
to pg_shard users, on...@citusdata.com, sanjana....@gmail.com
Hi, 

Running concurrent INSERTs helps the write throughput of pg_shard. copy_to_distributed_table executes INSERTs sequentially. To get higher throughput with copy_to_distributed table script, you need to run it concurrently. In this post, we elaborate how copy_to_distributed_table script can be run in parallel.  

Could this help you to get your expected throughput?

FattahRozzaq

unread,
Oct 20, 2015, 3:46:24 AM10/20/15
to Onder Kalaci, pg_shard users, sanjana....@gmail.com
Hi Onder,

I haven't tried the suggestion yet.
Was the page cut somehow?
Because there's the last sentence at the bottom of the paragraphs
which is seems unfinished:
"Pg_shard can also be used in combination with"


Thank you,
Fattah
--
On 20/10/2015, Onder Kalaci <on...@citusdata.com> wrote:
> Hi,
>
> Running concurrent INSERTs helps the write throughput of pg_shard.
> copy_to_distributed_table executes INSERTs sequentially. To get higher
> throughput with copy_to_distributed table script, you need to run it
> concurrently. In this post
> <https://www.citusdata.com/blog/14-marco/164-parallel-data-loading-for-pg-shard>,

Onder Kalaci

unread,
Oct 20, 2015, 3:50:53 AM10/20/15
to pg_shard users, on...@citusdata.com, sanjana....@gmail.com
Hi,

Yes, I now realized that post has an incomplete sentence in the end:) I'll try to fix it soon.
 But, the post already includes necessary details for your use case (ie: there is enough information to run the script in parallel). So, you can try the suggestion.

Thanks

FattahRozzaq

unread,
Oct 20, 2015, 9:42:28 AM10/20/15
to Onder Kalaci, pg_shard users, sanjana....@gmail.com
Hi Onder,

Yes it works like a charm!
Excellent. Thank you sir.

However, if I set the max-procs to 999: : xargs -n 1 -P 999

the PostgreSQL will return error in the log:
"too many open files in system"

Even after I make change at all nodes & workers in postgresql.conf:
max_files_per_process = 99999

So the argument for 'xargs' was only: xargs -n 1 -P 99

Can you help me again how to workaround this error?


Thank you,
Fattah
--
On 20/10/2015, Onder Kalaci <on...@citusdata.com> wrote:
> Hi,
>

Onder Kalaci

unread,
Oct 21, 2015, 7:57:16 AM10/21/15
to pg_shard users, on...@citusdata.com, sanjana....@gmail.com
Hi,

There is one more postgresql config that you should increase: max_connections. However, there is an upper limit on most Linux systems as discussed here. So, you cannot open 99999 connections concurrently. 

Lastly, there is an e-mail thread on pgsql-admin list which might help you as well.

FattahRozzaq

unread,
Nov 3, 2015, 1:34:19 AM11/3/15
to Onder Kalaci, pg_shard users, sanjana....@gmail.com
Hi Onder,

Big thanks for your support.
You really helped us a lot through this mailing list.


Thank you,
Fattah

On 21/10/2015, Onder Kalaci <on...@citusdata.com> wrote:
> Hi,
>
> There is one more postgresql config that you should increase:
> max_connections
> <http://www.postgresql.org/docs/9.4/static/runtime-config-connection.html#GUC-MAX-CONNECTIONS>.
>
> However, there is an upper limit on most Linux systems as discussed here
> <https://cs.uwaterloo.ca/~brecht/servers/openfiles.html>. So, you cannot
> open 99999 connections concurrently.
>
> Lastly, there is an e-mail thread
> <http://www.postgresql.org/message-id/113ce31b0908120955w770...@mail.gmail.com>
>
> on pgsql-admin list which might help you as well.
>
>
> On Tuesday, October 20, 2015 at 4:42:28 PM UTC+3, FattahRozzaq wrote:
>>
>> Hi Onder,
>>
>> Yes it works like a charm!
>> Excellent. Thank you sir.
>>
>> However, if I set the max-procs to 999: : xargs -n 1 -P 999
>>
>> the PostgreSQL will return error in the log:
>> "too many open files in system"
>>
>> Even after I make change at all nodes & workers in postgresql.conf:
>> max_files_per_process = 99999
>>
>> So the argument for 'xargs' was only: xargs -n 1 -P 99
>>
>> Can you help me again how to workaround this error?
>>
>>
>> Thank you,
>> Fattah
>> --
>> On 20/10/2015, Onder Kalaci <on...@citusdata.com <javascript:>> wrote:
>> > Hi,
>> >
>> > Yes, I now realized that post has an incomplete sentence in the end:)
>> I'll
>> > try to fix it soon.
>> > But, the post already includes necessary details for your use case (ie:
>> >
>> > there is enough information to run the script in parallel). So, you can
>> >
>> try
>> >
>> > the suggestion.
>> >
>> > Thanks
>> >
>>
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "pg_shard users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/pg_shard-users/DfEjW1_C6_8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> pg_shard-user...@googlegroups.com.
> To post to this group, send email to pg_shar...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pg_shard-users/24a81b68-dde7-48f0-a6b5-ba705470a18a%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
Reply all
Reply to author
Forward
0 new messages