Delete/Update - functionality

36 views
Skip to first unread message

armand pirvu

unread,
Aug 12, 2015, 5:08:42 PM8/12/15
to pg_shard users
All


"Once you created your shards, you can start issuing queries against the cluster. Currently, UPDATE and DELETE commands require the partition column in the WHERE clause."

So how do I go about a range ? Do I need to build dynamic queries by passing in the where clause the values per shard so to speak, this way ensuring the range falls on one shard ? That of course using hashtext from what I gathered ?

Is this something that will be addressed somehow ? Can't speak for MySQL sharding but in MemSQL this is not a restriction 

Thanks
-- Armand

Jason Petersen

unread,
Aug 13, 2015, 5:43:07 PM8/13/15
to armand pirvu, pg_shard users
On Aug 12, 2015, at 3:08 PM, armand pirvu <armand...@gmail.com> wrote:

So how do I go about a range ?

Well, you can easily update any particular range of values, so long as that range is on some other column and you also have a WHERE column restricting the query to one shard.

This is clearly quite restrictive, but we have many use cases where sharding is being done by e.g. a customer_id column and queries only operate on one customer at a time.  In those cases, most UPDATEs are possible, as the calling application is restricting everything by customer_id to begin with.

Do I need to build dynamic queries by passing in the where clause the values per shard so to speak, this way ensuring the range falls on one shard ? That of course using hashtext from what I gathered ?

That’s one way of circumventing this restriction, though I must say it sounds arduous. Can you share more about what you’re trying to accomplish? Maybe a tweak to your schema or workflow could make using pg_shard more natural.

Is this something that will be addressed somehow ? Can't speak for MySQL sharding but in MemSQL this is not a restriction 

It’s not something currently in our sights but if your use case is compelling it’s something we could explore. Can you open a GitHub issue for your feature request if this is something that interests you?

--
Jason Petersen
Software Engineer | Citus Data

signature.asc

armand pirvu

unread,
Aug 14, 2015, 2:56:31 PM8/14/15
to pg_shard users, armand...@gmail.com
Hi Jason


Thank you for the explanations

- Well , the idea is that it may be a delete/update by a date range for example. An example is , table myfolks holds participants info for specific events for say 3 years. And before I start the 4th year, I need to move the data for the oldest year. I can do this either by specifying date only in the where clause or by specifying person_id from an IN list generated by a select based on date, but is this supported (did not see but I could have missed it as well) ?  Or is this supported in CitusDB ?  It can be any criteria for real . And speaking of time distribution , can one use a composite shard key ?
- Also since I can not truncate the table, I was using a delete to trash the data which I had no use for. I know not the most efficient way but not a whole lot of choice either


CREATE TABLE person
(
  person_id bigint NOT NULL,
  first_name character varying(64),
  middle_name character varying(64),
  last_name character varying(64),
  title character varying(64),
  prefix character varying(64),
  suffix character varying(64),
  CONSTRAINT person_pkey PRIMARY KEY (person_id)
)
;

4 shards on 2 workers

Thanks
Armand

Jason Petersen

unread,
Aug 17, 2015, 3:51:20 PM8/17/15
to armand pirvu, pg_shard users
On Aug 14, 2015, at 11:56 AM, armand pirvu <armand...@gmail.com> wrote:

Well , the idea is that it may be a delete/update by a date range for example.

It sounds like the best option here would be range partitioning, unfortunately not yet supported by pg_shard…

An example is , table myfolks holds participants info for specific events for say 3 years. And before I start the 4th year, I need to move the data for the oldest year.

It might be best to come up with some sort of function that performs this maintenance step and then run it as a cron job on each of the worker nodes, rather than trying to do it through a pg_shard master. Have you looked into something like pg_partman?

If you’re trying to manage (or “age out”) data based on a time dimension, you could somehow manually set that up on each worker and still use pg_shard to query and manipulate data. It’s unclear precisely what you need to do here so it’s hard to say.

And speaking of time distribution , can one use a composite shard key ?

Yes, though it would need to be one column using a composite type. The pg_shard v1.2 announcement in this mailing list links to a test suite that shows how to use a composite type as your key.

Also since I can not truncate the table, I was using a delete to trash the data which I had no use for. I know not the most efficient way but not a whole lot of choice either

Since each of the worker nodes is “just a PostgreSQL server”, you may find other solutions better fit this case. There’s nothing keeping you from writing a script that runs on each worker that truncates tables locally for you. pg_shard would still work great in such an environment: it composes well with all sorts of custom workflows.
signature.asc

armand pirvu

unread,
Aug 17, 2015, 4:22:38 PM8/17/15
to pg_shard users, armand...@gmail.com
Thank you Jason


Armand
Reply all
Reply to author
Forward
0 new messages