Filespaces, tablespaces

31 views
Skip to first unread message

Heikki Linnakangas

unread,
Dec 19, 2017, 8:32:40 AM12/19/17
to Greenplum Developers
As we switch from the current file replication to WAL-based replication,
and remove all the persistent table stuff and more, let's revisit the
filespaces feature. There would be some work needed to keep it working,
so now is a good time to consider how we'd want it to work. Or if we
could revert all that the way it is in the upstream.

Here's what I propose we do:

* Remove the concept of filespaces.
* Revert tablespaces the way they are in the upstream.
* Cherry-pick commit 16d8e594ac from PostgreSQL 9.2 to remove
spclocation field from pg_tablespace
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=16d8e594acd96661267cb7897834f9cba51a2ffd).

With these changes, we'll have one feature less to maintain. I believe
you can do all the things that users currently do with filespaces, with
the unmodified upstream tablespaces feature. Although the UI and the
details will be different.


If I understand correctly, the idea behind filespaces has been that you
can specify a different "mount point" on each server, for each
tablespace. With the upstream commit, to remove
pg_tablespace.spclocation, you can do that with plain tablespaces,
without the concept of filespaces. The mount point for each tablespace
is stored as a symlink in the data directory, and it can be different on
each server.

In order to make that nicer to use, maybe we need some embellishments.
Like, a UDF to set the symlink, so that you don't need to ssh into each
server and set it up manually. Or some python scripts. Something along
those lines.

Currently, even the location of the data directory on each segment is
stored in the master's pg_filespace_entry table. Do we need that? A lot
of the management tools depend on that currently, so I guess we do. I
propose that as we remove the pg_filespace_entry table, we add a field
to gp_segment_configuration instead, for the data directory's path. But
I don't think the master needs to have the paths of every tablespace on
every segment. When you know the path to the main data directory, you
can dig into the data directory and look at the tablespace's symlink, to
see where it points to.

One question is, what do you do in pg_dumpall, when dumping from an old
server that uses filespaces? If the concept of filespaces goes away
altogether, we can't restore them the way they were. Maybe we can find
some semi-intelligent mapping from filespaces+tablesapces to just
tablespaces, or maybe ask the user to give more information on how to
map it. And the same with pg_upgrade.

Thoughts?

- Heikki

Robert Eckhardt

unread,
Dec 19, 2017, 9:47:37 AM12/19/17
to Heikki Linnakangas, Greenplum Developers
On Tue, Dec 19, 2017 at 8:32 AM, Heikki Linnakangas <hlinna...@pivotal.io> wrote:
As we switch from the current file replication to WAL-based replication, and remove all the persistent table stuff and more, let's revisit the filespaces feature. There would be some work needed to keep it working, so now is a good time to consider how we'd want it to work. Or if we could revert all that the way it is in the upstream.

Here's what I propose we do:

* Remove the concept of filespaces.
* Revert tablespaces the way they are in the upstream.
* Cherry-pick commit 16d8e594ac from PostgreSQL 9.2 to remove spclocation field from pg_tablespace (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=16d8e594acd96661267cb7897834f9cba51a2ffd).

With these changes, we'll have one feature less to maintain. I believe you can do all the things that users currently do with filespaces, with the unmodified upstream tablespaces feature. Although the UI and the details will be different.


If I understand correctly, the idea behind filespaces has been that you can specify a different "mount point" on each server, for each tablespace. With the upstream commit, to remove pg_tablespace.spclocation, you can do that with plain tablespaces, without the concept of filespaces. The mount point for each tablespace is stored as a symlink in the data directory, and it can be different on each server.

The 2 use cases, as I understand them are:
  1. in a cloud environment being able to put your data on ephemeral disk while keeping your catalog data local
  2. Being able to specify hardware that is more suited to a workload. For instance having tables that are highly transactional on SSD while having historical data that is less frequently used on bigger HDD drives. 
 

In order to make that nicer to use, maybe we need some embellishments. Like, a UDF to set the symlink, so that you don't need to ssh into each server and set it up manually. Or some python scripts. Something along those lines.

Currently, even the location of the data directory on each segment is stored in the master's pg_filespace_entry table. Do we need that? A lot of the management tools depend on that currently, so I guess we do. I propose that as we remove the pg_filespace_entry table, we add a field to gp_segment_configuration instead, for the data directory's path. But I don't think the master needs to have the paths of every tablespace on every segment. When you know the path to the main data directory, you can dig into the data directory and look at the tablespace's symlink, to see where it points to.

As is typical, I fully support off loading logic from the master and shifting it to the segment. Let's wait for some responses to see if there are any unintended consequences to doing this. 
 

One question is, what do you do in pg_dumpall, when dumping from an old server that uses filespaces? If the concept of filespaces goes away altogether, we can't restore them the way they were. Maybe we can find some semi-intelligent mapping from filespaces+tablesapces to just tablespaces, or maybe ask the user to give more information on how to map it. And the same with pg_upgrade.

I think that how we deal with this is implementation dependent. Let's get to the point where this is a more impending problem. 
 

Thoughts?

- Heikki



Todd Sedano

unread,
Dec 19, 2017, 12:09:10 PM12/19/17
to Heikki Linnakangas, Greenplum Developers
On Tue, Dec 19, 2017 at 5:32 AM, Heikki Linnakangas <hlinna...@pivotal.io> wrote:
Currently, even the location of the data directory on each segment is stored in the master's pg_filespace_entry table. Do we need that? A lot of the management tools depend on that currently, so I guess we do. I propose that as we remove the pg_filespace_entry table, we add a field to gp_segment_configuration instead, for the data directory's path. But I don't think the master needs to have the paths of every tablespace on every segment. When you know the path to the main data directory, you can dig into the data directory and look at the tablespace's symlink, to see where it points to.

With these kinds of changes, I wonder how to sequence the changes incrementally so that we can go from green-to-green-to-green and avoid protracted states of red "big-bang" re-writes.  For instance, since many of the management tools rely on this table, if it is a read-only relationship, adding a view would enable us to de-couple the proposed removal from management tool changes. 


Jim Doty

unread,
Dec 19, 2017, 12:54:48 PM12/19/17
to Todd Sedano, Heikki Linnakangas, Greenplum Developers
Rob asked me to include the following information from another thread here:

Rob,
I can expand on the three use cases that I know well.
 
1. When running on the cloud, I want my data to survive a node failure, because I know that if any of the hardware fails, the cloud provider will just retire my node.  This means that I need to store all of my data on the "persistent" storage.  On AWS this is called EBS, on GCP - persistent disk, etc...  However, the bandwidth of these disks is limited, so temp/workfiles/spill would benefit from being place placed on a local storage method.  Most providers offer several types of disk, but nodes with local SSD and/or the NVMe card options are particularly well suited for this configuration.
 
2. I have a large table with many partitions (perhaps by date).  Only the newest partitions are accessed frequently for my reporting, but I am required for regulatory reasons to to retain records for a much longer period of time.  I would like in this case to have the newest partitions to be stored on my relatively expensive, performant internal SSD drives (or high RPM HDD's).  I would the like to store my "cold" data on my less expensive direct attach storage device ( something like an MD3460 ).  This approach allows me to scale my storage separately from my compute requirements.
 
3. I am interested in unleashing the full power of GPDB on off the shelf, commodity gear.  I buy a system with three tiers of storage, NVMe - internal SSD's - and direct attach storage.  My goal is to use each type of storage in the most cost efficient manner.  I put my spill files on the NMVe cards, I put my catalogue (and any tables that are likely to see random access patterns) on my internal SSD's, and I put all of my large fact tables on the direct attach storage arrays.
 
Hope this helps,
Jim

While this doesn't help with the how should we work on this question posed by Heiki, perhaps it can help evaluate if the delivered features enable the use cases seen in the wild.

Cheers,
Jim

--
Jim Doty | Toolsmith Team - Data R&D | jd...@pivotal.io

Asim Praveen

unread,
Dec 19, 2017, 2:56:10 PM12/19/17
to Heikki Linnakangas, Greenplum Developers


On Tue, Dec 19, 2017 at 5:32 AM, Heikki Linnakangas <hlinna...@pivotal.io> wrote:
>
> Here's what I propose we do:
>
> * Remove the concept of filespaces.
> * Revert tablespaces the way they are in the upstream.
> * Cherry-pick commit 16d8e594ac from PostgreSQL 9.2 to remove spclocation field from pg_tablespace (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=16d8e594acd96661267cb7897834f9cba51a2ffd).
>

First two points make sense.  Why is the removal spclocation necessary?  I can imagine segments storing the target location in this field in their own copy of pg_tablespace table.  Would that be a problem?


>
> If I understand correctly, the idea behind filespaces has been that you can specify a different "mount point" on each server, for each tablespace. With the upstream commit, to remove pg_tablespace.spclocation, you can do that with plain tablespaces, without the concept of filespaces. The mount point for each tablespace is stored as a symlink in the data directory, and it can be different on each server.

Having a symlink within data directory is a great simplification.  No need for pg_filespace_entry.  A datafile can then be uniquely identified by tablespace OID, database OID and relation OID.

>
> Currently, even the location of the data directory on each segment is stored in the master's pg_filespace_entry table. Do we need that? A lot of the management tools depend on that currently, so I guess we do. I propose that as we remove the pg_filespace_entry table, we add a field to gp_segment_configuration instead, for the data directory's path. But I don't think the master needs to have the paths of every tablespace on every segment. When you know the path to the main data directory, you can dig into the data directory and look at the tablespace's symlink, to see where it points to.

I like the idea of recording data directory in gp_segment_configuration and removing pg_filespace_entry table.  Data directory is all that a utility should need, unless I'm missing something.

>
> One question is, what do you do in pg_dumpall, when dumping from an old server that uses filespaces? If the concept of filespaces goes away altogether, we can't restore them the way they were. Maybe we can find some semi-intelligent mapping from filespaces+tablesapces to just tablespaces, or maybe ask the user to give more information on how to map it. And the same with pg_upgrade.

Not directly related to the above problem but somewhat related question - CreateTablespace() function in upstream creates a TABLESPACE_VERSION_DIRECTORY as a subdirectory within the target tablespace location.  How is that useful during upgrade?

Asim

Scott Kahler

unread,
Dec 19, 2017, 4:03:56 PM12/19/17
to Asim Praveen, Heikki Linnakangas, Greenplum Developers
I am thinking we would probably need to have the call out for all directory locations within the master for any spaces a segment has configured. 

Let's say 
primary 1 on seghost1 /data1
mirror of 1 on seghost2 /data2

decide we need speed so we add additional file/table space as /dataX/fast
primary 1 on seghost1 /data1/primary/gpseg0 (default) /data1/fast/primary/gpseg0 ( fastdisk )
mirror of 1 on seghost2 /data2/mirror/gpseg0 (default) /data2/fast/mirror/gpseg0 ( fastdisk )

if the mirror were to go away and we need to rebuild I don't think we can accurately say where the pointer should be from just primary information. We could jump out to the primary and see the space and where the spaces point, but those don't necessarily reflect where they should point on the mirror



--

Scott Kahler | Pivotal, Greenplum Product Management  | ska...@pivotal.io | 816.237.0610

Heikki Linnakangas

unread,
Dec 20, 2017, 4:53:53 AM12/20/17
to Jim Doty, Todd Sedano, Greenplum Developers
On 19/12/17 19:54, Jim Doty wrote:
> I can expand on the three use cases that I know well.
>
> [three use cases]
>
> While this doesn't help with the how should we work on this question posed
> by Heiki, perhaps it can help evaluate if the delivered features enable the
> use cases seen in the wild.

Thanks! Yep, that's useful. Those use cases are typical for Postgres
tablespaces, we should have no problem supporting them.

- Heikki

Heikki Linnakangas

unread,
Dec 20, 2017, 5:04:43 AM12/20/17
to Asim Praveen, Greenplum Developers
On 19/12/17 21:56, Asim Praveen wrote:
> On Tue, Dec 19, 2017 at 5:32 AM, Heikki Linnakangas <hlinna...@pivotal.io>
> wrote:
>>
>> Here's what I propose we do:
>>
>> * Remove the concept of filespaces.
>> * Revert tablespaces the way they are in the upstream.
>> * Cherry-pick commit 16d8e594ac from PostgreSQL 9.2 to remove spclocation
> field from pg_tablespace (
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=16d8e594acd96661267cb7897834f9cba51a2ffd
> ).
>>
>
> First two points make sense. Why is the removal spclocation necessary? I
> can imagine segments storing the target location in this field in their own
> copy of pg_tablespace table. Would that be a problem?

Hmm, good question. But yes, there's one gotcha: the primary and the
mirror cannot have different contents of the pg_tablespace table. Or any
other table, for that matter. WAL replication works at a physical level,
and will keep all tables in sync. That was why the change was made in
the upstream, too.

Different segments could have different spclocation values in the
catalog table, though.

>> Currently, even the location of the data directory on each segment is
>> stored in the master's pg_filespace_entry table. Do we need that? A lot of
>> the management tools depend on that currently, so I guess we do. I propose
>> that as we remove the pg_filespace_entry table, we add a field to
>> gp_segment_configuration instead, for the data directory's path. But I
>> don't think the master needs to have the paths of every tablespace on every
>> segment. When you know the path to the main data directory, you can dig
>> into the data directory and look at the tablespace's symlink, to see where
>> it points to.
>
> I like the idea of recording data directory in gp_segment_configuration and
> removing pg_filespace_entry table. Data directory is all that a utility
> should need, unless I'm missing something.

I think some of the tools look at other filespaces too, for various
reasons. I don't remember which utility it was, but I saw some code that
copies a data directory, for example. And it looked into all the
filespaces, to check whether there is enough disk space in each
filespace in the target system. Not sure how critical or robust that
check is, but there is stuff like that.

And the tools to manipulate filespaces themselves access them, of
course, but they'll need to be rewritten or removed anyway.

> Not directly related to the above problem but somewhat related question -
> CreateTablespace() function in upstream creates a
> TABLESPACE_VERSION_DIRECTORY as a subdirectory within the target tablespace
> location. How is that useful during upgrade?

Not sure. I don't think the version directory is used for anything by
the system itself, but it helps to reduce confusion.

In GDPB, we should perhaps also embed the 'dbid' in the path. That would
reduce risk of confusion and overwriting data files, if two segments run
on the same host. But can we assume that the dbid is always set
correctly, including when setting up a new cluster? Where is it set, in
the config file, or where? We should perhaps make sure it's permanently
"burned in" at initdb time, i.e. stored in the control file or some
other file in the data directory.

- Heikki

Heikki Linnakangas

unread,
Dec 20, 2017, 7:15:04 AM12/20/17
to Scott Kahler, Asim Praveen, Greenplum Developers
On 19/12/17 23:03, Scott Kahler wrote:
> I am thinking we would probably need to have the call out for all directory
> locations within the master for any spaces a segment has configured.
>
> Let's say
> primary 1 on seghost1 /data1
> mirror of 1 on seghost2 /data2
>
> decide we need speed so we add additional file/table space as /dataX/fast
> primary 1 on seghost1 /data1/primary/gpseg0 (default)
> /data1/fast/primary/gpseg0 ( fastdisk )
> mirror of 1 on seghost2 /data2/mirror/gpseg0 (default)
> /data2/fast/mirror/gpseg0 ( fastdisk )
>
> if the mirror were to go away and we need to rebuild I don't think we can
> accurately say where the pointer should be from just primary information.
> We could jump out to the primary and see the space and where the spaces
> point, but those don't necessarily reflect where they should point on the
> mirror

Hmm, yeah, you're right. Or we require/allow the user to specify the
location, when they create the new mirror. But yes, some kind of a
permanent record would be handy.

We could have a "gp_tablespace_entry" catalog table, to list the path
for each tablespace on each server. It would be similar to the curent
pg_filespace_entry, but it would work directly with tablespaces, instead
of filespaces. That might make for a relatively smooth transition for
the existing tools.

- Heikki

Karen Huddleston

unread,
May 11, 2018, 6:41:50 PM5/11/18
to Heikki Linnakangas, Scott Kahler, Asim Praveen, Greenplum Developers
It looks like we didn't reach a resolution for the backup and restore case between different versions. With the new tablespace syntax in GPDB6, restoring from gpdb4/5 backups to 6 will cause syntax errors. This also could make it more difficult to migrate between clusters with different numbers of segments since we now specify configuration for each segment in the tablespace command, not the filespace (which we didn't previously back up). All of our previous syntax has been forward compatible, even if the statements have no effect. Is there a plan for how users should migrate systems containing these objects?

- Karen and Chris (backup and restore team)

Robert Eckhardt

unread,
May 11, 2018, 8:11:24 PM5/11/18
to Karen Huddleston, Heikki Linnakangas, Scott Kahler, Asim Praveen, Greenplum Developers
The plan was to finish getting upgrade working between 4 -> 5 before
figuring out anything up to 6.

This will be addressed but you are correct, it has not yet been addressed.

-- Rob

Ashwin Agrawal

unread,
May 11, 2018, 8:32:13 PM5/11/18
to Robert Eckhardt, Karen Huddleston, Heikki Linnakangas, Scott Kahler, Asim Praveen, Greenplum Developers

Karen, Would like to understand this more "This also could make it more difficult to migrate between clusters with different numbers of segments since we now specify configuration for each segment in the tablespace command, not the filespace (which we didn't previously back up)." Please elaborate.

Karen Huddleston

unread,
May 14, 2018, 1:27:37 PM5/14/18
to Ashwin Agrawal, Robert Eckhardt, Heikki Linnakangas, Scott Kahler, Asim Praveen, Greenplum Developers
Sure thing. When we say it could be more difficult, we are more considering additional complexity for the backup and restore tool to handle. Previously, all of the logic for where segments map on the filesystem was stored in a FILESPACE object. We didn't back those up, so users would need to recreate one that worked in their new cluster and then restore would "just work". Now that the logic is stored in a TABLESPACE object, the SQL is tied more closely to the underlying segment configuration. If a user moved from a cluster with 100 segments to one with 50 for example, content99 would be undefined in their new cluster, so I assume they would have errors.
Reply all
Reply to author
Forward
0 new messages