Docstring recommendations for SQLalchemy models?

903 views
Skip to first unread message

Samuel Marks

unread,
Jul 28, 2020, 3:57:18 AM7/28/20
to sqlalchemy
I have created a little tool—at the AST level—to translate between docstrings, methods, classes, and argparse. https://github.com/SamuelMarks/doctrans

Now looking at adding SQLalchemy support.

Using the mock I've been using throughout, does this look like the 'right' kind of SQLalchemy code?

class Model(Base):
"""
Acquire from the official tensorflow_datasets model zoo, or the ophthalmology focussed ml-prepare library

:cvar dataset_name: name of dataset. Defaults to mnist
:cvar tfds_dir: directory to look for models in. Defaults to ~/tensorflow_datasets
:cvar K: backend engine, e.g., `np` or `tf`. Defaults to np
:cvar as_numpy: Convert to numpy ndarrays
:cvar data_loader_kwargs: pass this as arguments to data_loader function
"""
__tablename__ = 'model'

dataset_name = Column(String, primary_key=True, default='mnist',
comment='name of dataset', doc='name of dataset')
tfds_dir = Column(String, default='~/tensorflow_datasets',
comment='directory to look for models in', doc='directory to look for models in')
K = Column(String, default='np',
comment='backend engine, e.g., `np` or `tf`', doc='backend engine, e.g., `np` or `tf`')
as_numpy = Column(Boolean,
comment='Convert to numpy ndarrays', doc='Convert to numpy ndarrays')
data_loader_kwargs = Column('data_loader_kwargs', JSON,
comment='pass this as arguments to data_loader function',
doc='pass this as arguments to data_loader function')

# _return_type = 'Train and tests dataset splits. Defaults to (np.empty(0), np.empty(0))'

def __repr__(self):
"""
:returns: String representation of constructed object
:rtype: ```str```
"""
return '<Model(dataset_name={self[dataset_name]!r},' \
' tfds_dir={self[tfds_dir]!r},' \
' K={self[K]!r},' \
' as_numpy={self[as_numpy]!r},' \
' data_loader_kwargs={self[data_loader_kwargs]!r}' \
')>'.format(self=self)

If not, what should it look like?

Thanks for your suggestions

Samuel Marks

unread,
Jan 24, 2021, 12:15:11 AM1/24/21
to sqlal...@googlegroups.com
Would be great to have some insight here. If I'm going to start generating to/fro SQLalchemy models, then I need to get the column descriptions right

--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper
 
http://www.sqlalchemy.org/
 
To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to a topic in the Google Groups "sqlalchemy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sqlalchemy/xZAh5zPswM0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sqlalchemy+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/6576d789-d088-4e68-a7f7-a17b5c96a810o%40googlegroups.com.

Mike Bayer

unread,
Jan 24, 2021, 2:39:44 PM1/24/21
to noreply-spamdigest via sqlalchemy
hey there, sorry I hadn't responded to this.

is your tool reformatting Python code?   I don't see anything "wrong" with it other than the code looks kind of verbose.    This would be a matter of personal preference but if it were me I'd want each attribute to have a string description listed out only once in the source code so that it may be edited directly.   then as far as how it appears in Sphinx and/or DDL there would be transparent extensions that make that happen.
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.

Samuel Marks

unread,
Jan 24, 2021, 5:21:08 PM1/24/21
to sqlal...@googlegroups.com
Dear Mike,

My tool works at the AST level, and converts between:
  • different docstring formats;
  • having types in docstring or explicitly annotated;
  • argparse parser augmenting function, class [plain old python class], methods/functions
The next step is to add support for SQLalchemy models, routes, and tests.
As you saw from my example code above, the duplication in SQLalchemy is intense.

Columns can be documented in the docstring, and/or on a column itself with `comment` and/or `doc`.

So if I'm going to generate these SQLalchemy models, and generate classes &etc. from these SQLalchemy models, then I'll need a clean, consistent way of documenting each model.

What is that way?

Mike Bayer

unread,
Jan 24, 2021, 6:59:37 PM1/24/21
to noreply-spamdigest via sqlalchemy


On Sun, Jan 24, 2021, at 5:20 PM, Samuel Marks wrote:
Dear Mike,

My tool works at the AST level, and converts between:
  • different docstring formats;
  • having types in docstring or explicitly annotated;
  • argparse parser augmenting function, class [plain old python class], methods/functions
The next step is to add support for SQLalchemy models, routes, and tests.
As you saw from my example code above, the duplication in SQLalchemy is intense.

Columns can be documented in the docstring, and/or on a column itself with `comment` and/or `doc`.

So if I'm going to generate these SQLalchemy models, and generate classes &etc. from these SQLalchemy models, then I'll need a clean, consistent way of documenting each model.

What is that way?

the "comment" field applies to the DDL rendered to the database and is separate from the docstring that would be present in the ORM model.    So there is not currently any means for these to be "unified" because they are two different concerns.

Within the docstrings, I note the use of ":cvar:" which IIUC is a "class variable", SQLAlchemy ORM models do make use of the class variables at this level but they represent SQL expressions so terms like ":cvar K: backend engine, e.g., `np` or `tf`. Defaults to np" don't necessarily make sense unless they are documented as "instance variables".     Then I'm not really sure from a Sphinx pov why one would have both ":cvar:" in the top level docstring as well as per-attribute docstrings, which is what Column(... doc="doc") does.

My understanding of Sphinx is that it has a feature that extracts docstrings from source code in order to associate class attribute level docstrings.   so if I wanted ORM models that were documented, I'd want them to look like this:

class Model(Base):
    """
    Acquire from the official tensorflow_datasets model zoo, or the
    ophthalmology focussed ml-prepare library

    """

    __tablename__ = "model"

    dataset_name = Column(String, primary_key=True, default="mnist")
    """name of dataset"""

    tfds_dir = Column(String, default="~/tensorflow_datasets")
    "directory to look for models in"

    K = Column(String, default="np")
    "backend engine, e.g. np or tf"

that is, like a standard Python class, nothing special used.   if that isn't working for docs tools then the issue has to be fixed at that level.

I'm not actually sure why Column() has a "doc" keyword given that Sphinx should be able to scan these from the source.   The "doc" keyword assigns the given docstring to the "__doc__" attribute of the descriptor but IMO this should not be necessary, unless Sphinx is still buggy in this regard.  We added that parameter many years ago and it may have been perhaps to work around limitations in Sphinx, not really sure.








Samuel Marks

unread,
Jan 24, 2021, 7:53:25 PM1/24/21
to sqlal...@googlegroups.com
Again, my goal isn't related to Sphinx (although generating nice documentation is of course, a nice-to-have).

One advantage of having the generated SQL code be commented is that I could write parsers that go from SQL files to SQL alchemy models, complete with docs.

The issue with having loose strings in the middle of a class like you've done is that there is no built-in semantics, and it'll break all existing linters. Sure, I could extend the linters and traverse the body of the class, inferring out the semantics. But that would be incredibly non-standard. I'm trying to generate code that could be considered the standard.

So comment, doc, or an ivar/cvar [Sphinx treats these as the same: https://www.sphinx-doc.org/en/master/usage/restructuredtext/domains.html#info-field-lists] is what I'll generate to/from. I can generate them all, but that would be hard for a human to maintain. The idea with the generated code is that it needs to be human maintainable, as well as machine maintainable.

(all my code generators go both ways, so you can edit the generated [cli] code and generate [class] code from it, and edit the [class] code and generate [cli] code from it)

Mike Bayer

unread,
Jan 24, 2021, 10:14:56 PM1/24/21
to noreply-spamdigest via sqlalchemy


On Sun, Jan 24, 2021, at 7:53 PM, Samuel Marks wrote:
Again, my goal isn't related to Sphinx (although generating nice documentation is of course, a nice-to-have).

One advantage of having the generated SQL code be commented is that I could write parsers that go from SQL files to SQL alchemy models, complete with docs.

Right now people use sqlacodegen for this: https://pypi.org/project/sqlacodegen/    although it doesn't go from a plain SQL file first, the SQL would have to be run into a database first such as SQLite.



The issue with having loose strings in the middle of a class like you've done is that there is no built-in semantics, and it'll break all existing linters.

Docstrings beneath attributes are known as "attribute docstrings" and are explicitly mentioned in PEP-257 https://www.python.org/dev/peps/pep-0257/#what-is-a-docstring  :

"String literals occurring elsewhere in Python code may also act as documentation. They are not recognized by the Python bytecode compiler and are not accessible as runtime object attributes (i.e. not assigned to __doc__), but two types of extra docstrings may be extracted by software tools:
  1. String literals occurring immediately after a simple assignment at the top level of a module, class, or __init__ method are called "attribute docstrings".
  2. String literals occurring immediately after another docstring are called "additional docstrings".
"

Pep-257 led to Pep 258 https://www.python.org/dev/peps/pep-0258/#attribute-docstrings, which was rejected because it didn't become part of Python, but docutils is the standard tool used for documenting Python and is used to generate Python's own documentation:

"A string literal immediately following an assignment statement is interpreted by the docstring extraction machinery as the docstring of the target of the assignment statement, under the following conditions <etc>"

They are the only technique that is usable in all cases since the attribute may refer to a value such as ``None`` that's a singleton that does not have a ``__doc__`` attribute.   All Python linters I'm familiar with accept this style of docstring, formatters such as Black will format them, and this is what I use in all my projects including SQLAlchemy itself, since if you have something like "my_constant = 1" as your attribute, that's your only option really other than using a pound sign comment, which seems to actually be more common from my googling around but also is not runtime-discoverable.

There is of course the disadvantage that the bytecode doesn't have access to them but this is a limitation of Python itself that most projects I'm familiar with have learned to live with. 

To the extent that people talk about "how should we document attributes?"  using a string literal below the value is usually what you'll find, I googled a bit and found this styleguide for a major observatory for example: https://developer.lsst.io/python/numpydoc.html#py-docstring-attribute-constants-structure  . 





Sure, I could extend the linters and traverse the body of the class, inferring out the semantics. But that would be incredibly non-standard. I'm trying to generate code that could be considered the standard.

"attribute docstrings" are the standard as discussed in PEP-257, and there is no competing standard of any kind that I'm aware of.




So comment, doc, or an ivar/cvar [Sphinx treats these as the same: https://www.sphinx-doc.org/en/master/usage/restructuredtext/domains.html#info-field-lists] is what I'll generate to/from. I can generate them all, but that would be hard for a human to maintain. The idea with the generated code is that it needs to be human maintainable, as well as machine maintainable.

if you insist upon having runtime discoverability then you would use the "doc" parameter of Column and make use of a base declarative class that would copy out "doc" into "comment" on the Column objects, but from my end I would never call this a "standard".        There's no standard here unfortunately.   If I had to pick one I'd want string literals underneath attributes and then Python would include a means of collecting these in a similar way that PEP-526 variable annotations are collected.

The lack of runtime discoverability for attribute docstrings is an "unsolved" problem in Python itself, and we can see as evidenced by Python's very own "class declarative" system called dataclasses described at pep-557 https://www.python.org/dev/peps/pep-0557/ has absolutely nothing to say about how to inline-document declared attributes nor does the actual field() class (https://docs.python.org/3/library/dataclasses.html#dataclasses.field) have any notion of a "doc" parameter, though as always, you can put your own docstrings if you wanted into the "metadata" collection, where once again there's no "standard" here of any kind.   

This is definitely a bit of a problem in Python and I welcome efforts for a standard means of runtime discoverability of attribute assignments, but as a wise leader once said, it's "beyond my pay grade" as far as SQLAlchemy is concerned, anything we propose such as "Column(.. doc)" is just an API feature that projects may use if they wish for runtime discoverability, but in my own experience runtime discoverability is not necessary since the two purposes I have for docstrings are a. reading them in source code and b. generating docs with tools that know how to read them.




Samuel Marks

unread,
Jan 25, 2021, 1:14:50 AM1/25/21
to sqlal...@googlegroups.com
Good clarification. Didn't know about those PEPs.

So you're thinking that for runtime discoverability to use `doc` first, and fallback to `comment`?

My idea with this project is to translate between:
↔ SQL
↔ Python (public served API, models, tests)
↔ Rust (public served API, models, tests)
↔ TypeScript (web)
↔ Swift (iOS)
↔ Java/Kotlin (Android)

So naturally every layer needs to have sufficient information to recreate the semantics for every other layer. SQL has the strongest types, for the rest a small new syntax will need to be created, e.g., to specify that this property is a PK of VARCHAR(20). With the exception of this however, everything else should be standard, and it should be easy for any developer to jump in and develop following best-practices in their chosen language(s) and framework(s).

Any further tips here would be appreciated.

Mike Bayer

unread,
Jan 25, 2021, 9:39:06 AM1/25/21
to noreply-spamdigest via sqlalchemy


On Mon, Jan 25, 2021, at 1:14 AM, Samuel Marks wrote:
Good clarification. Didn't know about those PEPs.

So you're thinking that for runtime discoverability to use `doc` first, and fallback to `comment`?

My idea with this project is to translate between:
↔ SQL
↔ Python (public served API, models, tests)
↔ Rust (public served API, models, tests)
↔ TypeScript (web)
↔ Swift (iOS)
↔ Java/Kotlin (Android)

So naturally every layer needs to have sufficient information to recreate the semantics for every other layer. SQL has the strongest types, for the rest a small new syntax will need to be created, e.g., to specify that this property is a PK of VARCHAR(20). With the exception of this however, everything else should be standard, and it should be easy for any developer to jump in and develop following best-practices in their chosen language(s) and framework(s).

I don't totally know what you're targeting as far as those environments, as in, will those environments need to be tweaked to work with your tool specifically or not, I would expect that they definitely would be for non-trivial environments.  Assumptions like "SQL has the strongest types" you will find do not always hold up as folks will be doing things like putting JSON or UUID values into VARCHAR on the database, backends like Oracle that have no "boolean" datatype will have an integer in the database, etc.   

Certainly if you're reading a Python SQLAlchemy model for comments, having the users use "doc" would be the most direct means of being able to discover a comment for that attribute.      But people will have to know to use this to be compatible with your tool, for example a model like this: https://github.com/openstack/neutron/blob/master/neutron/db/models/agent.py#L21 uses "#" comments, you can't read those, etc.

within the realm of "create a web API for any model", there's a lot of work done on that already so you'd be in a crowded space, I'd look around at what's been done in that area to see what bases should be covered.







Samuel Marks

unread,
Jan 25, 2021, 3:11:04 PM1/25/21
to sqlal...@googlegroups.com
Thanks for your insights, will take them aboard, and I'll update here when an alpha is ready

SAMUEL MARKS
Sydney Medical School | Westmead Institute for Medical Research | https://linkedin.com/in/samuelmarks
Director | Offscale.io | Sydney Scientific Pty Ltd
THE UNIVERSITY OF SYDNEY

Reply all
Reply to author
Forward
0 new messages