Google Groups no longer supports new usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[DRIVER SUBMISSION] DRBD wants to go mainline

5 views
Skip to the first unread message

Lars Ellenberg

unread,
21 Jul 2007, 16:58:3621/07/2007
to Jens Axboe, Andrew Morton, lkml, drbd...@lists.linbit.com

DRBD wants to go mainline.
please have a look at the "for-linus" branch of
git://git.drbd.org/home/git/linux-drbd.git.

Longer story:

Hi there,

for lkml readers who don't know about DRBD yet:
DRBD is the distributed replicated block device,
a stacked block device driver as developed mainly
by Lars Ellenberg and Philipp Reisner
both employed at Linbit.

We implement shared-disk semantics in a shared-nothing cluster.

yes, we even support cluster file systems (OCFS2, GFS2)
[support for this is somewhat clumsy, still,
mainly userland/setup/convenience issues.
but that will improve.]

to actually use it you will need some userland tools
(drbdadm, drbdmeta, drbdsetup).

most unfortunate current limitations:
we support only 2 nodes directly;
but we are working on that, too :)

I'm going to give a presentation/BoF about it
at linux-conf.eu in September.

Currently we are at "stable version 8.0.y" (y == 4 right now,
8.0.5 to be released within a few days.

we have a project home page http://www.drbd.org
we have some mention on the linux-HA project http://www.linux-ha.org/DRBD
(both are slightly out-dated, unfortunately...)
our subversion repository (yeah, sorry, there has been no git back then)
is at http://svn.drbd.org/drbd/branches/drbd-8.0

and, we have a user base...
[drbd-user readers who are in a position to do so:
please help us get this into mainline! ]

Now, finally, we have a git repository of the kernel module part, too:
git://git.drbd.org/home/git/linux-drbd.git

it has two branches currently,
master, which tracks ...torvalds/linux-2.6.git, and
for-linus, which tracks the cleaned up patch meant for inclusin in
mainline.
so if interessted please do
git fetch git://git.drbd.org/home/git/linux-drbd.git +for-linus:drbd-8.0
to get a local branch named drbd-8.0 where you can inspect it.
see below for a diffstat of master..for-linus.

Jens, Andrew, anyone: please review,
and give me advice how to proceed from here.

technically, would you prefer
frequent rebase of the "for-linus" branch,
or rather merges back and forth?

Cheers,

Lars


The only not drbd specific files touched
are listed explicitly here:

=====================
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index a4a3119..76f84bc 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -272,6 +272,8 @@ config BLK_DEV_CRYPTOLOOP
instead, which can be configured to be on-disk compatible with the
cryptoloop device.

+source "drivers/block/drbd/Kconfig"
+
config BLK_DEV_NBD
tristate "Network block device support"
depends on NET
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 819c829..b27e16a 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_BLK_DEV_UB) += ub.o

obj-$(CONFIG_XEN_BLKDEV_FRONTEND) += xen-blkfront.o
obj-$(CONFIG_LGUEST_GUEST) += lguest_blk.o
+obj-$(CONFIG_BLK_DEV_DRBD) += drbd/
diff --git a/include/linux/major.h b/include/linux/major.h
index 0cb9805..2dbd7c1 100644
--- a/include/linux/major.h
+++ b/include/linux/major.h
@@ -145,6 +145,8 @@
#define UNIX98_PTY_MAJOR_COUNT 8
#define UNIX98_PTY_SLAVE_MAJOR (UNIX98_PTY_MASTER_MAJOR+UNIX98_PTY_MAJOR_COUNT)

+#define DRBD_MAJOR 147
+
#define RTF_MAJOR 150
#define RAW_MAJOR 162
=====================

(yes, block major 147 is assigned by LANANA,
as documented in Documentation/devices.txt already)

git-diff master..for-linus | diffstat
drivers/block/Kconfig | 2
drivers/block/Makefile | 1
drivers/block/drbd/Kconfig | 32
drivers/block/drbd/Makefile | 7
drivers/block/drbd/drbd_actlog.c | 1456 ++++++++++++++++
drivers/block/drbd/drbd_bitmap.c | 1184 +++++++++++++
drivers/block/drbd/drbd_buildtag.c | 6
drivers/block/drbd/drbd_int.h | 1894 ++++++++++++++++++++
drivers/block/drbd/drbd_main.c | 3249 +++++++++++++++++++++++++++++++++++
drivers/block/drbd/drbd_nl.c | 1847 ++++++++++++++++++++
drivers/block/drbd/drbd_proc.c | 267 ++
drivers/block/drbd/drbd_receiver.c | 3356 +++++++++++++++++++++++++++++++++++++
drivers/block/drbd/drbd_req.c | 1169 ++++++++++++
drivers/block/drbd/drbd_req.h | 320 +++
drivers/block/drbd/drbd_strings.c | 106 +
drivers/block/drbd/drbd_worker.c | 1046 +++++++++++
drivers/block/drbd/drbd_wrappers.h | 144 +
drivers/block/drbd/lru_cache.c | 370 ++++
drivers/block/drbd/lru_cache.h | 147 +
include/linux/drbd.h | 284 +++
include/linux/drbd_config.h | 55
include/linux/drbd_limits.h | 124 +
include/linux/drbd_nl.h | 105 +
include/linux/drbd_tag_magic.h | 78
include/linux/major.h | 2
25 files changed, 17251 insertions(+)

--
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jan Engelhardt

unread,
21 Jul 2007, 17:18:0521/07/2007
to Lars Ellenberg, Jens Axboe, Andrew Morton, lkml, drbd...@lists.linbit.com

On Jul 21 2007 22:38, Lars Ellenberg wrote:
>
>We implement shared-disk semantics in a shared-nothing cluster.

If nothing is shared, the disk is not shared, but got shared-disk
semantics? A little confusing.


Jan
--

Jesper Juhl

unread,
21 Jul 2007, 17:35:2921/07/2007
to Jens Axboe, Andrew Morton, lkml, drbd...@lists.linbit.com
On 21/07/07, Lars Ellenberg <la...@linbit.com> wrote:
>
> DRBD wants to go mainline.
> please have a look at the "for-linus" branch of
> git://git.drbd.org/home/git/linux-drbd.git.
>

I just fetched yourt branch and had a (very) quick look.

Some comments.

Try running your patches through scripts/checkpatch.pl - it shows a
lot of style problems. Fixing those up would probably be a good step
towards mainline. Remember, checkpatch.pl is not the law - in some
situations what it complains about can be totally valid, but usually
what it highlights is stuff that it is prefered to clean up (I'd say
especially now prior to inclusion so we don't have to do it post
inclusion).

A few of your files suffer from trailing whitespace at the end of lines.

It's interresting to build your code with -W. It shows up quite a few
signed vs unsigned comparisons, unused parameters, expressions that
are always false, etc. Although I doubt anyone is going to complain
too loudly about stuff that only shows up with "-W", cleaning it up
can't hurt.


--
Jesper Juhl <jespe...@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

Lars Ellenberg

unread,
21 Jul 2007, 18:43:3821/07/2007
to Jan Engelhardt, Jens Axboe, Andrew Morton, lkml
On Sat, Jul 21, 2007 at 11:17:43PM +0200, Jan Engelhardt wrote:
>
> On Jul 21 2007 22:38, Lars Ellenberg wrote:
> >
> >We implement shared-disk semantics in a shared-nothing cluster.
>
> If nothing is shared, the disk is not shared, but got shared-disk
> semantics? A little confusing.

Think of it as RAID1 over TCP.
Typically you have one Node in Primary, the other as Secondary,
replication target only.
But you can also have both Active, for use with a cluster file system.

So the semantics are like you have
two (to come: N) nodes and a shared disk.
only that there is not one shared disk,
but two (N) replicated ones.

btw, regarding linux kernel CodingStyle issues.
scripts/checkpatch.pl reports 2000+ lines of complaints.
I'm working on it now, it is mostly whitespace (lame excuse: phil grew
up with emacs and gnu coding style, sorry for that).

--
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :

__
please use the "List-Reply" function of your email client.

Jens Axboe

unread,
22 Jul 2007, 01:53:0222/07/2007
to Andi Kleen, Lars Ellenberg, Andrew Morton, lkml, drbd...@lists.linbit.com
On Sun, Jul 22 2007, Andi Kleen wrote:
> Lars Ellenberg <la...@linbit.com> writes:
> >
> > Jens, Andrew, anyone: please review,
> > and give me advice how to proceed from here.
>
> The standard procedure would be to post all the source code in logical
> pieces on the list for review. Then iterate until all comments are
> addressed.

Yep, cleanup the style issues (that make sense) from checkpatch and then
psot as a series of patches that can be reviewed. Linking to a git tree
wont get you very far.

--
Jens Axboe

Lars Ellenberg

unread,
22 Jul 2007, 02:09:4722/07/2007
to Andi Kleen, Jens Axboe, Andrew Morton, lkml, drbd...@lists.linbit.com
On Sun, Jul 22, 2007 at 01:50:09AM +0200, Andi Kleen wrote:
> Lars Ellenberg <la...@linbit.com> writes:
> >
> > Jens, Andrew, anyone: please review,
> > and give me advice how to proceed from here.
>
> The standard procedure would be to post all the source code in logical
> pieces on the list for review. Then iterate until all comments are addressed.
>
> -Andi

well.
there is just one logical piece, the drbd module...
it is all in its own subdirectory.
it does not really touch anything else.
I don't think it is feasible to split it further.

Lars

Sam Ravnborg

unread,
22 Jul 2007, 04:51:3522/07/2007
to Andi Kleen, Jens Axboe, Andrew Morton, lkml, drbd...@lists.linbit.com
> there is just one logical piece, the drbd module...
> it is all in its own subdirectory.

For review purposes the split is often file based.
Something along the lines of:

Makefile + Kconfig
include/linux/drdb* <= why 5 files for one module?
drdb_main.c
drdb_req.{c+h}
etc..

So essentially smaller pieces that can be revied but when ready for inclusion
it should go in as _one_ commit to avoid breaking bisect.

But please finish off all the CodingStyle issue before submission so the
review can concentrate on the actual content.

Quickly browsing drdb_int.h shows that there are a few issues yet to sort out.
typedef usage, funny macros, volatile etc.

Sam

Pekka Enberg

unread,
22 Jul 2007, 05:01:0622/07/2007
to Andi Kleen, Jens Axboe, Andrew Morton, lkml, drbd...@lists.linbit.com
Hi Lars,

On 7/22/07, Lars Ellenberg <lars.el...@linbit.com> wrote:
> there is just one logical piece, the drbd module...
> it is all in its own subdirectory.
> it does not really touch anything else.
> I don't think it is feasible to split it further.

I don't know your code at all but the diffstat suggests that "lru
cache", "worker", "receiver", and "bitmap" could possibly be made
independent patches with all Kconfig and Makefile changes appearing as
a final separate patch (so it's git-bisectable).

This is 17 KLOC of new code. You really want to make it easy to review
if you're serious about getting it merged.

Pekka

Pekka Enberg

unread,
22 Jul 2007, 05:05:5222/07/2007
to Sam Ravnborg, Andi Kleen, Jens Axboe, Andrew Morton, lkml, drbd...@lists.linbit.com
Hi Sam,

On 7/22/07, Sam Ravnborg <s...@ravnborg.org> wrote:
> So essentially smaller pieces that can be revied but when ready for inclusion
> it should go in as _one_ commit to avoid breaking bisect.

Or alternatively, put all Kconfig and Makefile changes in the last patch.

Jan Engelhardt

unread,
22 Jul 2007, 05:07:2922/07/2007
to Lars Ellenberg, Jens Axboe, Andrew Morton, lkml

On Jul 22 2007 00:43, Lars Ellenberg wrote:
>On Sat, Jul 21, 2007 at 11:17:43PM +0200, Jan Engelhardt wrote:
>> On Jul 21 2007 22:38, Lars Ellenberg wrote:
>> >
>> >We implement shared-disk semantics in a shared-nothing cluster.
>>
>> If nothing is shared, the disk is not shared, but got shared-disk
>> semantics? A little confusing.
>
>Think of it as RAID1 over TCP.

And what does it do better than raid1-over-NBD? (Which is already N-disk,
and, logically, seems to support cluster filesystems)

>Typically you have one Node in Primary, the other as Secondary,
>replication target only.
>But you can also have both Active, for use with a cluster file system.
>
>So the semantics are like you have
>two (to come: N) nodes and a shared disk.
>only that there is not one shared disk,
>but two (N) replicated ones.
>
>btw, regarding linux kernel CodingStyle issues.
>scripts/checkpatch.pl reports 2000+ lines of complaints.
>I'm working on it now, it is mostly whitespace (lame excuse: phil grew
>up with emacs and gnu coding style, sorry for that).
>

Jan
--

Tomasz Chmielewski

unread,
22 Jul 2007, 05:55:0922/07/2007
to LKML, jen...@computergmbh.de
Jan Engelhardt wrote

> On Jul 22 2007 00:43, Lars Ellenberg wrote:

>>Think of it as RAID1 over TCP.
>
> And what does it do better than raid1-over-NBD? (Which is already N-disk,
> and, logically, seems to support cluster filesystems)

I don't know about DRDB, but NBD doesn't handle network disconnections
at all (well, almost).

So basically, disconnect a NBD-connected system for a while (switch,
cabling problem, operator error etc.), and you need lots of effort,
perhaps restarts, to get the things to a functioning state (devices
offlined, kicked out etc.).

I wouldn't call such raid-over-NBD setup reliable.


A better question would be: what does it do better than raid1-over-iSCSI?

iSCSI can recover from disconnections very well when configured
properly; but when a disconnection is in place, most of the system will
just "hang/freeze" (that is, from the user perspective - the system will
be waiting for the I/O to complete, until the systems are connected again).


A brief reading of "official DRBD FAQ" didn't give me an answer to that
problem.


--
Tomasz Chmielewski
http://wpkg.org

Lars Ellenberg

unread,
22 Jul 2007, 09:58:3622/07/2007
to Jens Axboe, Andi Kleen, Andrew Morton, lkml
On Sun, Jul 22, 2007 at 07:52:36AM +0200, Jens Axboe wrote:
> On Sun, Jul 22 2007, Andi Kleen wrote:
> > Lars Ellenberg <la...@linbit.com> writes:
> > >
> > > Jens, Andrew, anyone: please review,
> > > and give me advice how to proceed from here.
> >
> > The standard procedure would be to post all the source code in logical
> > pieces on the list for review. Then iterate until all comments are
> > addressed.
>
> Yep, cleanup the style issues (that make sense) from checkpatch and then
> psot as a series of patches that can be reviewed. Linking to a git tree
> wont get you very far.

it got me far enough, for the first try, anyways :-)
I did not spam the lkml with patches, and still got some very useful
advice (no idea how I could overlook the checkpatch.pl complaints).

If each patch of a series needs to compile and work,
there will probably only one 17kB patch...
it is difficult to split one module into a series of patches.
Or am I missing something?

may I bother you again to comment on this, please:

I am now down to
5 CHECK: memory barrier without comment
these are directly connected with our homegrown kernel thread stuff.
will vanish as soon as we convert to kthread.h API.

4 CHECK: spinlock_t definition without comment
3 CHECK: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
3 CHECK: if this code is redundant consider removing it
2 CHECK: Use #include <linux/types.h> instead of <asm/types.h>
need to check those, still.

72 ERROR: braces {} are not necessary for single statement blocks
one branch needs them, the othe does not.
what now? CodingStyle and checkpatch.pl disagree.

13 ERROR: no space between function name and open parenthesis '('
this is about our ERR_IF() macro...
suggestions, anyone?
do need I to explicitly write these out?

8 ERROR: Macros with multiple statements should be enclosed in a do - while loop
1 ERROR: Macros with complex values should be enclosed in parenthesis
these are "netlink packet definition language" in drbd_nl.h,
which is sort of clean, and preprocessor magic in drbd_nl.c.
suggestions how to handle this cleanly,
without making it more ugly?
autogenerate code by other means?
write it out by hand, and lose the nice and clean drbd_nl.h?

1 ERROR: Don't use kernel_thread(): see Documentation/feature-removal-schedule.txt
yes. working on that.
will take some days, though.

1 ERROR: Missing Signed-off-by: line(s)

94 WARNING: declaring multiple variables together should be avoided
int snr, enr;
does this really need to become two lines?

33 WARNING: line over 80 characters
hmmm. get more ugly...
probably need some helper functions and temp variables?

4 WARNING: multiple assignments should be avoided
x = y = 0;
is sometimes a convenient initialization.
you don't like it?

--
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :

__
please use the "List-Reply" function of your email client.

Lars Ellenberg

unread,
22 Jul 2007, 10:04:3322/07/2007
to Jan Engelhardt, Jens Axboe, Andrew Morton, lkml
On Sun, Jul 22, 2007 at 11:06:59AM +0200, Jan Engelhardt wrote:
>
> On Jul 22 2007 00:43, Lars Ellenberg wrote:
> >On Sat, Jul 21, 2007 at 11:17:43PM +0200, Jan Engelhardt wrote:
> >> On Jul 21 2007 22:38, Lars Ellenberg wrote:
> >> >
> >> >We implement shared-disk semantics in a shared-nothing cluster.
> >>
> >> If nothing is shared, the disk is not shared, but got shared-disk
> >> semantics? A little confusing.
> >
> >Think of it as RAID1 over TCP.
>
> And what does it do better than raid1-over-NBD? (Which is already N-disk,
> and, logically, seems to support cluster filesystems)

DRBD has built-in logic
to track which copy of the data is the most recent.
DRBD has resource-level-fencing in place (though we can improve on that).
DRBD deals with concurrent writes of the participating nodes correctly.

"N-disk" is not the question,
you can stack drbd on top of md, lvm, anything.
N-nodes is the interessting thing.

--
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :

Sam Ravnborg

unread,
22 Jul 2007, 10:48:3422/07/2007
to Lars Ellenberg, Jens Axboe, Andi Kleen, Andrew Morton, lkml
On Sun, Jul 22, 2007 at 03:58:14PM +0200, Lars Ellenberg wrote:
> On Sun, Jul 22, 2007 at 07:52:36AM +0200, Jens Axboe wrote:
> > On Sun, Jul 22 2007, Andi Kleen wrote:
> > > Lars Ellenberg <la...@linbit.com> writes:
> > > >
> > > > Jens, Andrew, anyone: please review,
> > > > and give me advice how to proceed from here.
> > >
> > > The standard procedure would be to post all the source code in logical
> > > pieces on the list for review. Then iterate until all comments are
> > > addressed.
> >
> > Yep, cleanup the style issues (that make sense) from checkpatch and then
> > psot as a series of patches that can be reviewed. Linking to a git tree
> > wont get you very far.
>
> it got me far enough, for the first try, anyways :-)
> I did not spam the lkml with patches, and still got some very useful
> advice (no idea how I could overlook the checkpatch.pl complaints).
>
> If each patch of a series needs to compile and work,
> there will probably only one 17kB patch...
Thats not needed for reviewing..

> 33 WARNING: line over 80 characters
> hmmm. get more ugly...
> probably need some helper functions and temp variables?

Several people question this check. It gets ugly on text-mode but for
the price of readability for the rest.
So I suggest concentrate on other matters, but keep an eye for the
"many small functions that do exactly _one_ thing and few local variables"
approach.

Sam

Andi Kleen

unread,
22 Jul 2007, 10:56:5722/07/2007
to Lars Ellenberg, Jens Axboe, Andi Kleen, Andrew Morton, lkml
On Sun, Jul 22, 2007 at 03:58:14PM +0200, Lars Ellenberg wrote:
> On Sun, Jul 22, 2007 at 07:52:36AM +0200, Jens Axboe wrote:
> > On Sun, Jul 22 2007, Andi Kleen wrote:
> > > Lars Ellenberg <la...@linbit.com> writes:
> > > >
> > > > Jens, Andrew, anyone: please review,
> > > > and give me advice how to proceed from here.
> > >
> > > The standard procedure would be to post all the source code in logical
> > > pieces on the list for review. Then iterate until all comments are
> > > addressed.
> >
> > Yep, cleanup the style issues (that make sense) from checkpatch and then
> > psot as a series of patches that can be reviewed. Linking to a git tree
> > wont get you very far.
>
> it got me far enough, for the first try, anyways :-)
> I did not spam the lkml with patches, and still got some very useful
> advice (no idea how I could overlook the checkpatch.pl complaints).
>
> If each patch of a series needs to compile and work,

That's not needed for review. The final submission can then be
in a single patch. It's just impossible to read a really large
single patch. Ideally you space the posting also out over time.to not
tax the review resources too much.

Another standard trick for compileable patch series is to only add the
Kconfig at the end.

> 13 ERROR: no space between function name and open parenthesis '('
> this is about our ERR_IF() macro...
> suggestions, anyone?
> do need I to explicitly write these out?

Preferable yes.

>
> 8 ERROR: Macros with multiple statements should be enclosed in a do - while loop
> 1 ERROR: Macros with complex values should be enclosed in parenthesis
> these are "netlink packet definition language" in drbd_nl.h,
> which is sort of clean, and preprocessor magic in drbd_nl.c.
> suggestions how to handle this cleanly,
> without making it more ugly?
> autogenerate code by other means?
> write it out by hand, and lose the nice and clean drbd_nl.h?

No, you can ignore it then.

> 94 WARNING: declaring multiple variables together should be avoided
> int snr, enr;
> does this really need to become two lines?

Probably not.

>
> 33 WARNING: line over 80 characters
> hmmm. get more ugly...
> probably need some helper functions and temp variables?

Yes smaller functions are better in general.

-Andi

Satyam Sharma

unread,
22 Jul 2007, 11:31:3322/07/2007
to Lars Ellenberg, Jens Axboe, Andi Kleen, Andrew Morton, lkml
On 7/22/07, Lars Ellenberg <lars.el...@linbit.com> wrote:
> On Sun, Jul 22, 2007 at 07:52:36AM +0200, Jens Axboe wrote:
> > [...]

> > Yep, cleanup the style issues (that make sense) from checkpatch and then
> > psot as a series of patches that can be reviewed. Linking to a git tree
> > wont get you very far.
>
> it got me far enough, for the first try, anyways :-)
> I did not spam the lkml with patches, and still got some very useful
> advice (no idea how I could overlook the checkpatch.pl complaints).
>
> If each patch of a series needs to compile and work,
> there will probably only one 17kB patch...
> it is difficult to split one module into a series of patches.
> Or am I missing something?

If not too much bother, you could even present the patchset in a way
that reflects how the driver code evolved to its present state in v8.0.4.

> may I bother you again to comment on this, please:
>
> I am now down to
> 5 CHECK: memory barrier without comment
> these are directly connected with our homegrown kernel thread stuff.
> will vanish as soon as we convert to kthread.h API.
>
> 4 CHECK: spinlock_t definition without comment
> 3 CHECK: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
> 3 CHECK: if this code is redundant consider removing it
> 2 CHECK: Use #include <linux/types.h> instead of <asm/types.h>
> need to check those, still.
>
> 72 ERROR: braces {} are not necessary for single statement blocks
> one branch needs them, the othe does not.
> what now? CodingStyle and checkpatch.pl disagree.

Submit a patch to checkpatch.pl (or preferably CodingStyle, I dislike
with the way it blatantly disregards K&R convention in this matter :-)

> 13 ERROR: no space between function name and open parenthesis '('
> this is about our ERR_IF() macro...
> suggestions, anyone?

There shouldn't be any space between function/macro name and open
parenthesis. However, there *must* be a space between C language
keyword (if, while, for) and open parenthesis.

> do need I to explicitly write these out?

Yup, do consider that. Also consider making them functions. Macros
are _generally_ evil and *always* horrible.

> 8 ERROR: Macros with multiple statements should be enclosed in a do - while loop

You don't want to break if-else constructs.

> 1 ERROR: Macros with complex values should be enclosed in parenthesis

I'm not sure know when / why checkpatch.pl reports this. I don't want to
pull your tree so can't check for myself, but note some obvious rules:

1. Macro definition must always include parenthesis around the entire
definition (unless it's a do { somehting; } while (0) kind of macro).
2. Use parenthesis around wherever it uses the passed argument(s).

> these are "netlink packet definition language" in drbd_nl.h,
> which is sort of clean, and preprocessor magic in drbd_nl.c.
> suggestions how to handle this cleanly,
> without making it more ugly?
> autogenerate code by other means?
> write it out by hand, and lose the nice and clean drbd_nl.h?

1. Open-code it if it is trivial enough and there's no point
obfuscating anything.
2. Consider making the macro a function. Probably inline, most probably not.
3. Macros must NOT evaluate the passed argument twice -- consider the
possible damages of someone who doesn't know it's a macro and therefore
passing in an expression that has side-effects (*boom, crash*).

> 1 ERROR: Don't use kernel_thread(): see Documentation/feature-removal-schedule.txt
> yes. working on that.
> will take some days, though.
>
> 1 ERROR: Missing Signed-off-by: line(s)

Patches submitted to LKML should conform to guidelines in:

* Documentation/SubmittingPatches in kernel sources
* http://www.zipworld.com.au/~akpm/linux/patches/stuff/tpp.txt

> 94 WARNING: declaring multiple variables together should be avoided
> int snr, enr;
> does this really need to become two lines?

No, if these are some unimportant temporary/automatic variables in
some function.

Yes, if they are members of some struct (also comment them in
this case -- in fact give full kernel-doc style comments).

> 33 WARNING: line over 80 characters
> hmmm. get more ugly...
> probably need some helper functions and temp variables?

Yes, do consider breaking functions that go significantly beyond
80 lines into smaller ones. If it's just 4-5 columns beyond 80, and
breaking the line would actually hurt readability, don't bother.

> 4 WARNING: multiple assignments should be avoided
> x = y = 0;
> is sometimes a convenient initialization.
> you don't like it?

Yes, we don't appreciate such usage at all. Please fix this.

Satyam

Satyam Sharma

unread,
22 Jul 2007, 11:50:4222/07/2007
to Lars Ellenberg, Jens Axboe, Andi Kleen, Andrew Morton, lkml
On 7/22/07, Satyam Sharma <satyam...@gmail.com> wrote:
> On 7/22/07, Lars Ellenberg <lars.el...@linbit.com> wrote:
> [...]

> > 8 ERROR: Macros with multiple statements should be enclosed in a do - while loop
>
> You don't want to break if-else constructs.
>
> > 1 ERROR: Macros with complex values should be enclosed in parenthesis
>
> I'm not sure know when / why checkpatch.pl reports this. I don't want to
> pull your tree so can't check for myself, but note some obvious rules:
>
> 1. Macro definition must always include parenthesis around the entire
> definition (unless it's a do { somehting; } while (0) kind of macro).
> 2. Use parenthesis around wherever it uses the passed argument(s).
>
> > these are "netlink packet definition language" in drbd_nl.h,
> > which is sort of clean, and preprocessor magic in drbd_nl.c.
> > suggestions how to handle this cleanly,
> > without making it more ugly?
> > autogenerate code by other means?
> > write it out by hand, and lose the nice and clean drbd_nl.h?
>
> 1. Open-code it if it is trivial enough and there's no point
> obfuscating anything.
> 2. Consider making the macro a function. Probably inline, most probably not.
> 3. Macros must NOT evaluate the passed argument twice -- consider the
> possible damages of someone who doesn't know it's a macro and therefore
> passing in an expression that has side-effects (*boom, crash*).


Some more macro tips for you:

If the macro contains multiple statements but must also return a value
to the caller (i.e. say it can appear on RHS of an assignment), use the:

#define foo ({ something; lastthing; })

kind of idiom. So the return value of the last statement ("lastthing;"
above) in there (obviously) becomes the return value of the whole
macro. So if you need to define an internal variable inside the macro,
and the return value of the macro should be that internal variable itself,
use the:
#define bar ({ int __mvar; __mvar = something; otherstuff; __mvar; })
kind of idiom.

Of course, if you really try to pull off the above stunts, we'll probably
ask you to go back and make that a proper function in the first place :-)

Stefan Richter

unread,
22 Jul 2007, 12:15:0222/07/2007
to Satyam Sharma, Lars Ellenberg, Jens Axboe, Andi Kleen, Andrew Morton, lkml
Satyam Sharma wrote:
> On 7/22/07, Lars Ellenberg <lars.el...@linbit.com> wrote:
>> 94 WARNING: declaring multiple variables together should be avoided
>> int snr, enr;
>> does this really need to become two lines?
>
> No, if these are some unimportant temporary/automatic variables in
> some function.
>
> Yes, if they are members of some struct (also comment them in
> this case -- in fact give full kernel-doc style comments).

Don't overdo comments. Comments eventually get out of date and are then
dangerously misleading. Comment API elements. Comment on the Why if it
isn't obvious. It shouldn't be necessary to comment on the How, because
that should be coded in an obvious way in the first place.
--
Stefan Richter
-=====-=-=== -=== =-==-
http://arcgraph.de/sr/

DoktorPZ

unread,
22 Jul 2007, 14:04:5822/07/2007
to

Lars Ellenberg:

> DRBD wants to go mainline.
> please have a look at the "for-linus" branch of
> git://git.drbd.org/home/git/linux-drbd.git.
>
> Longer story:
>
> Hi there,
>
> for lkml readers who don't know about DRBD yet:
> DRBD is the distributed replicated block device,
> a stacked block device driver as developed mainly
> by Lars Ellenberg and Philipp Reisner
> both employed at Linbit.
>

doktor@doktor-mobpc:~/downloads/drbd-8.0.4$ pwd
/home/doktor/downloads/drbd-8.0.4
doktor@doktor-mobpc:~/downloads/drbd-8.0.4$ grep -R "FIXME" * | wc -l
165

I don`t think that DRBD ready for linux kernel...

Kyle Moffett

unread,
22 Jul 2007, 21:39:0522/07/2007
to Lars Ellenberg, Jens Axboe, Andrew Morton, lkml, drbd...@linbit.com
Ok, I didn't have a chance to get through anywhere near all of it, but
here's my comments so far. I didn't really go through things in any
particular order but most of these comments are about your drbd_int.h
header file. Hopefully a lot of the comments will be useful to fix
similar/identical problems in your C files.

First of all, if you could break this up into chunks (even if they
aren't useful individually) just to make it easier to review.
Divisions like "This code does on-disk bitmap management", and "This
code does network protocol encoding/decoding" would be extremely
helpful when digging through this stuff. I ended up only really doing
a cursory review for low-level/style issues, since without the bigger
picture (and without the fixed style issues and macro cleanups) it's
very hard to really give a good high-level review.

Cheers,
Kyle Moffett


+drbd-objs := drbd_buildtag.o drbd_bitmap.o drbd_proc.o \
+ drbd_worker.o drbd_receiver.o drbd_req.o drbd_actlog.o \
+ lru_cache.o drbd_main.o drbd_strings.o drbd_nl.o
+
+obj-$(CONFIG_BLK_DEV_DRBD) += drbd.o

Don't use foo-objs, use foo-y instead.


+#undef DEVICE_NAME
+#define DEVICE_NAME "drbd"

This is never actually defined/redefined anywhere else. Just hardcode the
"drbd" in your printk strings and save yourself 5-6 characters per use.


+/* I don't remember why XCPU ...
+ * This is used to wake the asender,
+ * and to interrupt sending the sending task
+ * on disconnect.
+ */
+#define DRBD_SIG SIGXCPU
+
+/* This is used to stop/restart our threads.
+ * Cannot use SIGTERM nor SIGKILL, since these
+ * are sent out by init on runlevel changes
+ * I choose SIGHUP for now.
+ *
+ * FIXME btw, we should register some reboot notifier.
+ */
+#define DRBD_SIGKILL SIGHUP

Don't use signals between kernel threads, use proper primitives like
notifiers and waitqueues, which means you should also probably switch away
from kernel_thread() to the kthread_*() APIs. Also you should fix this
FIXME or remove it if it no longer applies:-D.


+#ifdef PARANOIA
+# define PARANOIA_BUG_ON(x) BUG_ON(x)
+#else
+# define PARANOIA_BUG_ON(x)
+#endif

This is only ever used in one place for a simple != test:
+ PARANOIA_BUG_ON(w != &mdev->resync_work);

Just delete PARANOIA_BUG_ON and convert the above to a straight BUG_ON()


+#define STATIC
+#define STATIC static

These two lines are found in different files, but the symbol "STATIC" isn't
used anywhere. Just get rid of it.


+/* handy macro: DUMPP(somepointer) */
+#define DUMPP(A) ERR( #A " = %p in %s:%d\n", (A), __FILE__, __LINE__);
[...]
+/* Info: do not remove the spaces around the "," before ##
+ * Otherwise this is not portable from gcc-2.95 to gcc-3.3 */
+#define PRINTK(level, fmt, args...) \
+ printk(level DEVICE_NAME "%d: " fmt, \
+ mdev->minor , ##args)
+
+#define ALERT(fmt, args...) PRINTK(KERN_ALERT, fmt , ##args)
[...]

No more custom debugging macros please, we have plenty of standardized ones
in the kernel already (and all togther too many nonstandardized ones).
Please take a good look at dev_printk, etc to make some of your printk()s
shorter, but don't add more icky macros. Also gcc < 3.1 is now unsupported,
so please remove gcc-2.95 portability comments/cruft (although in this case
the code itself doesn't need changing, just the comments).


+/* see kernel/printk.c:printk_ratelimit
+ * macro, so it is easy do have independend rate limits at different locations
+ * "initializer element not constant ..." with kernel 2.4 :(
+ * so I initialize toks to something large
+ */
+#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \

Any particular reason you can't just use printk_ratelimit for this? Also you
should remove any linux 2.4-related code/comments as they won't apply for
code submitted to 2.6 mainline.


+#ifdef DBG_ASSERTS
+extern void drbd_assert_breakpoint(struct drbd_conf *, char *, char *, int );
+# define D_ASSERT(exp) if (!(exp)) \
+ drbd_assert_breakpoint(mdev, #exp, __FILE__, __LINE__)
+#else
+# define D_ASSERT(exp) if (!(exp)) \
+ ERR("ASSERT( " #exp " ) in %s:%d\n", __FILE__, __LINE__)
+#endif
+#define ERR_IF(exp) if (({ \
+ int _b = (exp) != 0; \
+ if (_b) ERR("%s: (" #exp ") in %s:%d\n", \
+ __func__, __FILE__, __LINE__); \
+ _b; \
+ }))

Yuck, more debugging macros.


+/* Defines to control fault insertion */
+enum {
+ DRBD_FAULT_MD_WR = 0, /* meta data write */
+ DRBD_FAULT_MD_RD, /* read */
+ DRBD_FAULT_RS_WR, /* resync */
+ DRBD_FAULT_RS_RD,
+ DRBD_FAULT_DT_WR, /* data */
+ DRBD_FAULT_DT_RD,
+ DRBD_FAULT_DT_RA, /* data read ahead */
+
+ DRBD_FAULT_MAX,
+};

We have some existing failure-injection code, any chance you could rip out
your custom logic and just plug that? I haven't looked over it, though, so
I can't really offer any useful suggestions about it.


+#include <linux/stringify.h>

Put the include at the top of the file, please?


+/* integer division, round _UP_ to the next integer */
+#define div_ceil(A, B) ( (A)/(B) + ((A)%(B) ? 1 : 0) )
+/* usual integer division */
+#define div_floor(A, B) ( (A)/(B) )

Your div_ceil function looks OK but I think we already have a standard one
somewhere. You might remove that div_floor function, though, as it isn't
used anywhere.

+/*
+ * Compatibility Section
+ *************************/
+
+#define LOCK_SIGMASK(task, flags) spin_lock_irqsave(&task->sighand->siglock, flags)
+#define UNLOCK_SIGMASK(task, flags) spin_unlock_irqrestore(&task->sighand->siglock, flags)
+#define RECALC_SIGPENDING() recalc_sigpending();

These aren't used anywhere, remove please?


+#if defined(DBG_SPINLOCKS) && defined(__SMP__)
+# define MUST_HOLD(lock) if (!spin_is_locked(lock)) { ERR("Not holding lock! in %s\n", __FUNCTION__ ); }
+#else
+# define MUST_HOLD(lock)
+#endif

Why not just open-code "WARN_ON_SMP(!spin_is_locked(lock))" in the few places
this is used? Alternatively if you need this to actually BUG(), you might
either add a BUG_ON_SMP() to include/asm-generic/bug.h or add a helper to
include/linux/spinlock.h:

#if defined(CONFIG_SMP) && defined(CONFIG_DEBUG_SPINLOCK)
# define spin_lock_musthold(lock) BUG_ON(!spin_is_locked(lock))
#else
# define spin_lock_musthold(lock) do { } while(0)
#endif


+#define SET_MDEV_MAGIC(x) \
+ ({ typecheck(struct drbd_conf*, x); \
+ (x)->magic = (long)(x) ^ DRBD_MAGIC; })
+#define IS_VALID_MDEV(x) \
+ ( typecheck(struct drbd_conf*, x) && \
+ ((x) ? (((x)->magic ^ DRBD_MAGIC) == (long)(x)):0))

SET_MDEV_MAGIC is only used in one place, can't you open-code it there? Even
if it were used lots of places a little inline function would handle the
"typecheck" for you and be easier to read. The IS_VALID_MDEV() isn't used
anywhere, so delete it.


+enum Drbd_thread_state {
+ None,
+ Running,
+ Exiting,
+ Restarting
+};
+
+struct Drbd_thread {
+ spinlock_t t_lock;
+ struct task_struct *task;
+ struct completion startstop;
+ enum Drbd_thread_state t_state;
+ int (*function) (struct Drbd_thread *);
+ struct drbd_conf *mdev;
+};

As somebody already mentioned, you need to switch from kernel_thread to
kthread_* helpers.


+/*
+ * Having this as the first member of a struct provides sort of "inheritance".
+ * "derived" structs can be "drbd_queue_work()"ed.
+ * The callback should know and cast back to the descendant struct.
+ * drbd_request and Tl_epoch_entry are descendants of drbd_work.
+ */
+struct drbd_work;
+typedef int (*drbd_work_cb)(struct drbd_conf *, struct drbd_work *, int cancel);
+struct drbd_work {
+ struct list_head list;
+ drbd_work_cb cb;
+};
+
+struct drbd_barrier;
+struct drbd_request {
+ struct drbd_work w;


Eww, please don't do opencoded casting of these. You should use the proper
container_of() macro instead:

> struct foo {
> int a;
> };
> struct bar {
> int b;
> struct foo thefoo;
> };
> void mycallback(struct foo *myfoo)
> {
> struct bar *mybar = container_of(myfoo, struct bar, thefoo);
> process_a_bar(mybar);
> }

It's not *much* better typechecking, but it's at least a little. It also
lets you put a struct drbd_work at a non-zero offset inside of some other
struct, as container_of() does internal subtraction of the offsetof().


+/* THINK maybe we actually want to use the default "event/%s" worker threads
+ * or similar in linux 2.6, which uses per cpu data and threads.
+ *
+ * To be general, this might need a spin_lock member.
+ * For now, please use the mdev->req_lock to protect list_head,
+ * see drbd_queue_work below.
+ */
+struct drbd_work_queue {
+ struct list_head q;
+ struct semaphore s; /* producers up it, worker down()s it */
+ spinlock_t q_lock; /* to protect the list. */
+};

Umm, how about fixing this to actually use proper workqueues or something
instead of this open-coded mess?


+/* for sync_conf and other types... */
+#define PACKET(name, number, fields) struct name { fields };
+#define INTEGER(pn, pr, member) int member;
+#define INT64(pn, pr, member) __u64 member;
+#define BIT(pn, pr, member) unsigned member : 1;
+#define STRING(pn, pr, member, len) \
+ unsigned char member[len]; int member ## _len;
+#include "linux/drbd_nl.h"

The only light at the end of this tunnel is the greedily burning fires of
Macro Hell(TM). Isn't there any better way to do this than all those horrid
macros? I can sorta see what drbd_nl.h is trying to do, but the way it's
included multiple times with all those insane macros defined makes it really
hard to mentally parse.


+#ifdef PARANOIA
+ long magic;
+#endif

Any particular reason you can't just unconditionally use a magic number?
It's little more than an extra word per device and a couple simple integer
tests.


+/* returns 1 if it was successfull,
+ * returns 0 if there was no data socket.
+ * so wherever you are going to use the data.socket, e.g. do
+ * if (!drbd_get_data_sock(mdev))
+ * return 0;
+ * CODE();
+ * drbd_put_data_sock(mdev);
+ */
+static inline int drbd_get_data_sock(struct drbd_conf *mdev)
+{
+ down(&mdev->data.mutex);
+ /* drbd_disconnect() could have called drbd_free_sock()
+ * while we were waiting in down()... */
+ if (unlikely(mdev->data.socket == NULL)) {
+ up(&mdev->data.mutex);
+ return 0;
+ }
+ return 1;
+}

Any reason this can't be open-coded? The extra unlock logic should just be
moved into the cleanup handlers for the appropriate function. It looks like
you are using the semaphore as a simple binary mutex, so please switch this
to "struct mutex" as well.


+#if BITS_PER_LONG == 32
+#define LN2_BPL 5
+#define cpu_to_lel(A) cpu_to_le32(A)
+#define lel_to_cpu(A) le32_to_cpu(A)
+#elif BITS_PER_LONG == 64
+#define LN2_BPL 6
+#define cpu_to_lel(A) cpu_to_le64(A)
+#define lel_to_cpu(A) le64_to_cpu(A)
+#else
+#error "LN2 of BITS_PER_LONG unknown!"
+#endif

Hrm, any reason you can't just make the bitmap an array of __u64 and always
use the cpu_to_le64()/le64_to_cpu() functions?


+/* I want the packet to fit within one page
+ * THINK maybe use a special bitmap header,
+ * including offset and compression scheme and whatnot
+ * Do not use PAGE_SIZE here! Use a architecture agnostic constant!
+ */
+#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof(long))

Yuck. Definitely use PAGE_SIZE here, so at least if it's broken on an arch
with multiple page sizes, somebody can grep for PAGE_SIZE to fix it. It also
means that on archs/configs with 8k or 64k pages you won't waste a bunch of
memory.


+/* Dynamic tracing framework */
+#ifdef ENABLE_DYNAMIC_TRACE
+
+extern int trace_type;
+extern int trace_devs;
[...]

AGH!! Not another dynamic tracing framework! Please use one of the existing
solutions like kprobes or something. Maybe define some static tracepoints
if that would be useful?


+static inline void drbd_tcp_cork(struct socket *sock)
+{
+#if 1
+ mm_segment_t oldfs = get_fs();
+ int val = 1;
+
+ set_fs(KERNEL_DS);
+ tcp_setsockopt(sock->sk, SOL_TCP, TCP_CORK, (char *)&val, sizeof(val) );
+ set_fs(oldfs);
+#else
+ tcp_sk(sock->sk)->nonagle |= TCP_NAGLE_CORK;
+#endif
+}

Yuck, why'd you do it this way? Probably because your tcp_sk(sock->sk) stuff
doesn't have proper locking, I'll bet. You can avoid all the extra wrapper
crap by just looking in "do_tcp_setsockopt" and taking the appropriate lock:

static inline void drbd_tcp_cork(struct socket *sock)
{
struct sock *sk = sock->sk;

lock_sock(sk);
tcp_sk(sk)->nonagle |= TCP_NAGLE_CORK;
release-sock(sk);
}

On the other hand, none of the currently in-tree network block devices or
network filesystems seem to feel the need to try to TCP_CORK their output,
why does drbd?


+#define peer_mask role_mask
+#define pdsk_mask disk_mask
+#define susp_mask 1
+#define user_isp_mask 1
+#define aftr_isp_mask 1
+
+#define NS(T, S) \
+ ({ union drbd_state_t mask; mask.i = 0; mask.T = T##_mask; mask; }), \
+ ({ union drbd_state_t val; val.i = 0; val.T = (S); val; })
+#define NS2(T1, S1, T2, S2) \
+ ({ union drbd_state_t mask; mask.i = 0; mask.T1 = T1##_mask; \
+ mask.T2 = T2##_mask; mask; }), \
+ ({ union drbd_state_t val; val.i = 0; val.T1 = (S1); \
+ val.T2 = (S2); val; })
+#define NS3(T1, S1, T2, S2, T3, S3) \
+ ({ union drbd_state_t mask; mask.i = 0; mask.T1 = T1##_mask; \
+ mask.T2 = T2##_mask; mask.T3 = T3##_mask; mask; }), \
+ ({ union drbd_state_t val; val.i = 0; val.T1 = (S1); \
+ val.T2 = (S2); val.T3 = (S3); val; })
+
+#define _NS(D, T, S) \
+ D, ({ union drbd_state_t ns; ns.i = D->state.i; ns.T = (S); ns; })
+#define _NS2(D, T1, S1, T2, S2) \
+ D, ({ union drbd_state_t ns; ns.i = D->state.i; ns.T1 = (S1); \
+ ns.T2 = (S2); ns; })
+#define _NS3(D, T1, S1, T2, S2, T3, S3) \
+ D, ({ union drbd_state_t ns; ns.i = D->state.i; ns.T1 = (S1); \
+ ns.T2 = (S2); ns.T3 = (S3); ns; })

Grumble. When I earlier said I thought I was in macro hell, well, I was
wrong. *THIS* is macro hell. What the fsck is that supposed to do? And it
doesn't even include a *SINGLE* comment!!!


+static inline void drbd_state_lock(struct drbd_conf *mdev)
+{
+ wait_event(mdev->misc_wait,
+ !test_and_set_bit(CLUSTER_ST_CHANGE, &mdev->flags));
+}

Umm, what's wrong with a mutex here?


+static inline int semaphore_is_locked(struct semaphore *s)
+{
+ if (!down_trylock(s)) {
+ up(s);
+ return 0;
+ }
+ return 1;
+}

This would be better implemented in the mutex/semaphore generic code, instead
of stuffed in a network blockdev. On the other hand, since the only user is
this:
+ D_ASSERT(semaphore_is_locked(&mdev->md_io_mutex));

Can't you just either get rid of the check or open-code it? It would also
give you a chance to substitute D_ASSERT(foo) for a proper BUG_ON(!foo) :-D


+static inline void
+_drbd_queue_work(struct drbd_work_queue *q, struct drbd_work *w)
[...]

Yeah, this stuff should really be fixed to use generic workqueues or
something.


+#define ERR_IF_CNT_IS_NEGATIVE(which) \
+ if (atomic_read(&mdev->which) < 0) \
+ ERR("in %s:%d: " #which " = %d < 0 !\n", \
+ __func__ , __LINE__ , \
+ atomic_read(&mdev->which))

Another macro with an implicit parameter (mdev). Please fix all of these


+#define dec_unacked(mdev) do { \
+ typecheck(struct drbd_conf *, mdev); \
+ atomic_dec(&mdev->unacked_cnt); \
+ ERR_IF_CNT_IS_NEGATIVE(unacked_cnt); } while (0)
[...]

More macros that should be inline functions.


+ if (!have_net_conf) dec_net(mdev);

There's lots of coding-style violations like these in there. You should put
the "dec_net(mdev);" indented on the line *AFTER* the if statememt. Please
read linux/Documentation/CodingStyle for full details.


+ int mxb = 1000000; /* arbitrary limit on open requests */

Arbitrary limits are usually bad. Derive it from system properties, make it
user-configurable, etc.


+/* I'd like to use wait_event_lock_irq,
+ * but I'm not sure when it got introduced,
+ * and not sure when it has 3 or 4 arguments */
+static inline void inc_ap_bio(struct drbd_conf *mdev)
[...]
+ spin_lock_irq(&mdev->req_lock);
+ while (!__inc_ap_bio_cond(mdev)) {
+ prepare_to_wait(&mdev->misc_wait, &wait, TASK_UNINTERRUPTIBLE);
+ spin_unlock_irq(&mdev->req_lock);
+ schedule();
+ finish_wait(&mdev->misc_wait, &wait);
+ spin_lock_irq(&mdev->req_lock);
+ }
+ spin_unlock_irq(&mdev->req_lock);

Since you're submitting for upstream inclusion you can just delete the extra
backwards-compatible opencoded stuff and use the generic helper.


+#define ARRY_SIZE(A) (sizeof(A)/sizeof(A[0]))

There's a generic ARRAY_SIZE(foo) somewhere in the kernel sources. Use that
one please.


+/* this is developers aid only! */
+#define PARANOIA_ENTRY() BUG_ON(test_and_set_bit(__LC_PARANOIA, &lc->flags))
+#define PARANOIA_LEAVE() do { clear_bit(__LC_PARANOIA, &lc->flags); smp_mb__after_clear_bit(); } while (0)
+#define RETURN(x...) do { PARANOIA_LEAVE(); return x ; } while (0)

What is the PARANOIA flag intended to verify? This needs better commenting
or removal. Don't use macros which take implicit arguments (IE: the lc
value is used in the macro without it being passed as a parameter). Since
it's debugging code you could also replace the open-coded bit trylock with a
blocking spinlock. If something hangs, just turn on lockdep and it will
report *MUCH* more useful information about the deadlock. Macros which do a
"return" are discouraged, but since this one is called RETURN() it might be
OK. On the other hand, you should probably rename it as PARANOIA_RETURN().


+int _drbd_md_sync_page_io(struct drbd_conf *mdev,
+ struct drbd_backing_dev *bdev,
+ struct page *page, sector_t sector,
+ int rw, int size)
+{
[...]
+ ok = (bio_add_page(bio, page, size, 0) == size);
+ if (!ok) goto out;

Ewwww, can somebody say "CodingStyle"? Don't put an if() and its result
on a single line, and just use a straight if instead:
> if (bio_add_page(bio, page, size, 0) != size)
> goto out;


+ mask = ( hardsect / MD_HARDSECT ) - 1;
+ D_ASSERT( mask == 1 || mask == 3 || mask == 7 );
+ D_ASSERT( hardsect == (mask+1) * MD_HARDSECT );

Problematic spaces and parentheses. Again, see Documentation/CodingStyle
for the preferred forms. On the other hand, sometimes little code-style
things like that are left to the maintainer if they really strongly prefer
it one particular way.

Christoph Hellwig

unread,
23 Jul 2007, 04:49:2923/07/2007
to Kyle Moffett, Lars Ellenberg, Jens Axboe, Andrew Morton, lkml, drbd...@linbit.com
On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:
> First of all, if you could break this up into chunks (even if they
> aren't useful individually) just to make it easier to review.

No, please don't do that I have no idea why people say that except for
drinking too much "small patches" koolaid. Small patches are the
most optimal form if you can split patches logically. There is absolutely
no point in splitting up a large driver into multiple patches. It just
means you have to download and apply all of them to get the global picture
needed for a proper review again. If a driver is too large for lkml a
pointer to a large patch is much much better for a proper review.

Sam Ravnborg

unread,
23 Jul 2007, 04:59:4523/07/2007
to Christoph Hellwig, Kyle Moffett, Lars Ellenberg, Jens Axboe, Andrew Morton, lkml, drbd...@linbit.com
On Mon, Jul 23, 2007 at 09:49:03AM +0100, Christoph Hellwig wrote:
> On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:
> > First of all, if you could break this up into chunks (even if they
> > aren't useful individually) just to make it easier to review.
>
> No, please don't do that I have no idea why people say that except for
> drinking too much "small patches" koolaid.

It is much more preferred to have code available in the mailer than
being forced to copy it from some big patch somewhere when you
want to point out something.

Obviously the big picture is needed too. But the introductory mail
describing the overall design and functionality should provide most of this.
This submission obviously also failed to provide such an overall design intro.

> If a driver is too large for lkml a
> pointer to a large patch is much much better for a proper review.

Then we should request both...

Sam

Christoph Hellwig

unread,
23 Jul 2007, 05:01:3423/07/2007
to Sam Ravnborg, Christoph Hellwig, Kyle Moffett, Lars Ellenberg, Jens Axboe, Andrew Morton, lkml, drbd...@linbit.com
On Mon, Jul 23, 2007 at 11:00:38AM +0200, Sam Ravnborg wrote:
> On Mon, Jul 23, 2007 at 09:49:03AM +0100, Christoph Hellwig wrote:
> > On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:
> > > First of all, if you could break this up into chunks (even if they
> > > aren't useful individually) just to make it easier to review.
> >
> > No, please don't do that I have no idea why people say that except for
> > drinking too much "small patches" koolaid.
>
> It is much more preferred to have code available in the mailer than
> being forced to copy it from some big patch somewhere when you
> want to point out something.

While that's a valid point I totally agree on for small patches it just
doesn't matter for a complex driver where you need to sit down for an
hour or more to actually do a useful review.

Sam Ravnborg

unread,
23 Jul 2007, 05:18:4823/07/2007
to Christoph Hellwig, Kyle Moffett, Lars Ellenberg, Jens Axboe, Andrew Morton, lkml, drbd...@linbit.com
On Mon, Jul 23, 2007 at 10:01:01AM +0100, Christoph Hellwig wrote:
> On Mon, Jul 23, 2007 at 11:00:38AM +0200, Sam Ravnborg wrote:
> > On Mon, Jul 23, 2007 at 09:49:03AM +0100, Christoph Hellwig wrote:
> > > On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:
> > > > First of all, if you could break this up into chunks (even if they
> > > > aren't useful individually) just to make it easier to review.
> > >
> > > No, please don't do that I have no idea why people say that except for
> > > drinking too much "small patches" koolaid.
> >
> > It is much more preferred to have code available in the mailer than
> > being forced to copy it from some big patch somewhere when you
> > want to point out something.
>
> While that's a valid point I totally agree on for small patches it just
> doesn't matter for a complex driver where you need to sit down for an
> hour or more to actually do a useful review.

We do review on different levels.
I for once will look at the Makefile + Kconfig bits and some trivial
CodingStyle issues. Here it will help.
You on the other hand I assume will look at the functionality and overall
design and then for you the full patch is the best way.

Well I could try to do the same as you but that would take me maybe
10 hours or more just to come up with some limited input on the overall design.
So therefore instead of backing out I give inputs on the smaller scale
which I know is less usefull but in the hope that this can be resolved
so when you or others take your time to review at least the majority
of the simple CodingStyle things are fixed so you do not get sidetracked
and can concentrate on the important bits.

The message is quite clear.
For a driver like this do something like:
- Describe the functionality / the problem is solved
- Describe the overall design (maybe as a file in Documentation/?)
- Make a single patch available
- Submit mails in smaller chunks for on-the-list reviews

This will then be useable bot for the "triviality" reviewers like me
and for the "knowledge able" reviewers like you.

Sam

Lars Ellenberg

unread,
23 Jul 2007, 07:09:2923/07/2007
to Sam Ravnborg, Christoph Hellwig, Kyle Moffett, Jens Axboe, Andrew Morton, lkml, drbd...@lists.linbit.com
On Mon, Jul 23, 2007 at 11:19:40AM +0200, Sam Ravnborg wrote:
> The message is quite clear.
> For a driver like this do something like:
> - Describe the functionality / the problem is solved
> - Describe the overall design (maybe as a file in Documentation/?)

I'm currently writing the paper for linux-conf.eu,
and will make that available to you asap.

> - Make a single patch available

will do.

> - Submit mails in smaller chunks for on-the-list reviews
>
> This will then be useable bot for the "triviality" reviewers like me
> and for the "knowledge able" reviewers like you.

will do.
now working on the remaining "triviality" issues already mentioned.

Lars

Lars Ellenberg

unread,
23 Jul 2007, 09:32:3123/07/2007
to Kyle Moffett, Jens Axboe, Andrew Morton, lkml
On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:
> Ok, I didn't have a chance to get through anywhere near all of it, but
> here's my comments so far. I didn't really go through things in any
> particular order but most of these comments are about your drbd_int.h
> header file. Hopefully a lot of the comments will be useful to fix
> similar/identical problems in your C files.

Thank you for your time.

I want to give some lame excuses, still:

> +#undef DEVICE_NAME
> +#define DEVICE_NAME "drbd"

a left-over.
first prototypes of drbd have been around in linux 2.0 time...
"in the old days", as a block device, you had to define this,
and then include some header after that.
will clean that up.

> +/* I don't remember why XCPU ...
> + * This is used to wake the asender,
> + * and to interrupt sending the sending task
> + * on disconnect.
> + */
> +#define DRBD_SIG SIGXCPU

> Don't use signals between kernel threads, use proper primitives like


> notifiers and waitqueues, which means you should also probably switch away
> from kernel_thread() to the kthread_*() APIs. Also you should fix this
> FIXME or remove it if it no longer applies:-D.

right.
but how to I tell a network thread in tcp_recvmsg to stop early,
without using signals?

> +/* see kernel/printk.c:printk_ratelimit
> + * macro, so it is easy do have independend rate limits at different locations
> + * "initializer element not constant ..." with kernel 2.4 :(
> + * so I initialize toks to something large
> + */
> +#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \
>
> Any particular reason you can't just use printk_ratelimit for this?

I want to be able to do a rate-limit per specific message/code fragment,
without affecting other messages or execution paths.

> +/*
> + * Compatibility Section
> + *************************/
> +
> +#define LOCK_SIGMASK(task, flags) spin_lock_irqsave(&task->sighand->siglock, flags)
> +#define UNLOCK_SIGMASK(task, flags) spin_unlock_irqrestore(&task->sighand->siglock, flags)
> +#define RECALC_SIGPENDING() recalc_sigpending();
>
> These aren't used anywhere,

I just recently cleaned up the use of them.
when the first drbd prototypes have been developed
it was linux 2.0 times...
along the way many things have changed in incompatible ways,
so we had to abstract it in macros.

> remove please?

yes.

> +/* THINK maybe we actually want to use the default "event/%s" worker threads
> + * or similar in linux 2.6, which uses per cpu data and threads.
> + *
> + * To be general, this might need a spin_lock member.
> + * For now, please use the mdev->req_lock to protect list_head,
> + * see drbd_queue_work below.
> + */
> +struct drbd_work_queue {
> + struct list_head q;
> + struct semaphore s; /* producers up it, worker down()s it */
> + spinlock_t q_lock; /* to protect the list. */
> +};
>
> Umm, how about fixing this to actually use proper workqueues or something
> instead of this open-coded mess?

unlikely to happen "right now".
but it is on our todo list...

> +/* I want the packet to fit within one page
> + * THINK maybe use a special bitmap header,
> + * including offset and compression scheme and whatnot
> + * Do not use PAGE_SIZE here! Use a architecture agnostic constant!
> + */
> +#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof(long))
>
> Yuck. Definitely use PAGE_SIZE here, so at least if it's broken on an arch
> with multiple page sizes, somebody can grep for PAGE_SIZE to fix it. It also
> means that on archs/configs with 8k or 64k pages you won't waste a bunch of
> memory.

No. This is not to allocate anything, but defines the chunk size with
which we transmit the bitmap, when we have to.
We need to be able to talk from one arch
(say i586) to some other (say s390, or sparc, or whatever).
The receiving side has a one-page buffer, from which it may or may not
to endian-conversion.
The hardcoded 4096 is the minimal common denominator here.

> +/* Dynamic tracing framework */

guess we have to explain first what this is for.
it probably is not exactly what you think it is...
but maybe I am just too ignorant about what you suggest here.

we'll have to defer discussion about this for when I'm done
doing the trivial fix-ups, and provided an overall design overview.

> +static inline void drbd_tcp_cork(struct socket *sock)

..


> On the other hand, none of the currently in-tree network block devices or
> network filesystems seem to feel the need to try to TCP_CORK their output,
> why does drbd?

it had performance improvements somewhen. I doubt it has still.
maybe we just remove this.

> +#define peer_mask role_mask
> +#define pdsk_mask disk_mask
> +#define susp_mask 1
> +#define user_isp_mask 1
> +#define aftr_isp_mask 1

> +#define NS(T, S) \


> +#define NS2(T1, S1, T2, S2) \

> +#define NS3(T1, S1, T2, S2, T3, S3) \

> +#define _NS(D, T, S) \


> +#define _NS2(D, T1, S1, T2, S2) \

> +#define _NS3(D, T1, S1, T2, S2, T3, S3) \
>

> Grumble. When I earlier said I thought I was in macro hell, well, I was
> wrong. *THIS* is macro hell. What the fsck is that supposed to do? And it
> doesn't even include a *SINGLE* comment!!!

uhm. basically you are right.
Phil will take over to explain why it is done that way...


> + int mxb = 1000000; /* arbitrary limit on open requests */
>
> Arbitrary limits are usually bad. Derive it from system properties, make it
> user-configurable, etc.

it is user configurable.
it is just the upper cap on user configurability.

most other points I just snipped off of this mail
will be handled as you suggested asap.

Lars

Jens Axboe

unread,
23 Jul 2007, 09:37:2123/07/2007
to Lars Ellenberg, Kyle Moffett, Andrew Morton, lkml
On Mon, Jul 23 2007, Lars Ellenberg wrote:
> > +/* THINK maybe we actually want to use the default "event/%s" worker threads
> > + * or similar in linux 2.6, which uses per cpu data and threads.
> > + *
> > + * To be general, this might need a spin_lock member.
> > + * For now, please use the mdev->req_lock to protect list_head,
> > + * see drbd_queue_work below.
> > + */
> > +struct drbd_work_queue {
> > + struct list_head q;
> > + struct semaphore s; /* producers up it, worker down()s it */
> > + spinlock_t q_lock; /* to protect the list. */
> > +};
> >
> > Umm, how about fixing this to actually use proper workqueues or something
> > instead of this open-coded mess?
>
> unlikely to happen "right now".
> but it is on our todo list...

But stuff like that is definitely a merge show stopper, jfyi.

--
Jens Axboe

Satyam Sharma

unread,
23 Jul 2007, 09:41:2423/07/2007
to Lars Ellenberg, Kyle Moffett, Jens Axboe, Andrew Morton, lkml
On 7/23/07, Lars Ellenberg <lars.el...@linbit.com> wrote:
> On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:
> [...]

> > Don't use signals between kernel threads, use proper primitives like
> > notifiers and waitqueues, which means you should also probably switch away
> > from kernel_thread() to the kthread_*() APIs. Also you should fix this
> > FIXME or remove it if it no longer applies:-D.
>
> right.
> but how to I tell a network thread in tcp_recvmsg to stop early,
> without using signals?


Yup, kthreads API cannot handle (properly stop) kernel threads that want
to sleep on possibly-blocking-forever-till-signalled-functions such as
tcp_recvmsg or skb_recv_datagram etc etc.

There are two workarounds:
1. Use sk_rcvtimeo and related while-continue logic
2. force_sig(SIGKILL) to your kernel thread just before kthread_stop
(note that you don't need to allow / write code to handle / etc signals
in your kthread code -- force_sig will work automatically)

> > +/* THINK maybe we actually want to use the default "event/%s" worker threads
> > + * or similar in linux 2.6, which uses per cpu data and threads.
> > + *
> > + * To be general, this might need a spin_lock member.
> > + * For now, please use the mdev->req_lock to protect list_head,
> > + * see drbd_queue_work below.
> > + */
> > +struct drbd_work_queue {
> > + struct list_head q;
> > + struct semaphore s; /* producers up it, worker down()s it */
> > + spinlock_t q_lock; /* to protect the list. */
> > +};
> >
> > Umm, how about fixing this to actually use proper workqueues or something
> > instead of this open-coded mess?
>
> unlikely to happen "right now".
> but it is on our todo list...


It should be easier to do it now (if you defer it for later, the code will
only grow more and more complex). Also, removing this gunk from
your driver will clearly make it smaller, and easier for us to review :-)

> > +#define peer_mask role_mask
> > +#define pdsk_mask disk_mask
> > +#define susp_mask 1
> > +#define user_isp_mask 1
> > +#define aftr_isp_mask 1
> > +#define NS(T, S) \
> > +#define NS2(T1, S1, T2, S2) \
> > +#define NS3(T1, S1, T2, S2, T3, S3) \
> > +#define _NS(D, T, S) \
> > +#define _NS2(D, T1, S1, T2, S2) \
> > +#define _NS3(D, T1, S1, T2, S2, T3, S3) \
> >
> > Grumble. When I earlier said I thought I was in macro hell, well, I was
> > wrong. *THIS* is macro hell. What the fsck is that supposed to do? And it
> > doesn't even include a *SINGLE* comment!!!
>
> uhm. basically you are right.
> Phil will take over to explain why it is done that way...


The simplistic problems I initially thought your driver suffered from turned
out to be not so true, actually -- like Kyle mentioned, your driver is in
proper macro hell. Please get rid of as many as possible, and replace
with functions, as applicable.


Satyam

Jesper Juhl

unread,
23 Jul 2007, 17:00:3723/07/2007
to Jens Axboe, Andrew Morton, lkml, drbd...@lists.linbit.com
On 21/07/07, Lars Ellenberg <la...@linbit.com> wrote:
>
> DRBD wants to go mainline.
> please have a look at the "for-linus" branch of
> git://git.drbd.org/home/git/linux-drbd.git.
>
A few more comments.

In drbd_main.c::after_state_ch() there's this code :

..
if ( mdev->p_uuid ) {
kfree(mdev->p_uuid);
mdev->p_uuid = NULL;
}
..

Since kfree(NULL) if a noop, that should just become :

..
kfree(mdev->p_uuid);
mdev->p_uuid = NULL;
..


Same for these bits in drbd_main.c::drbd_cleanup()

..
if(mdev->app_reads_hash) {
kfree(mdev->app_reads_hash);
mdev->app_reads_hash = NULL;
}
if ( mdev->p_uuid ) {
kfree(mdev->p_uuid);
mdev->p_uuid = NULL;
}
..

They should just become

..
kfree(mdev->app_reads_hash);
mdev->app_reads_hash = NULL;
kfree(mdev->p_uuid);
mdev->p_uuid = NULL;
..

And then there's drbd_main.c::drbd_new_device() :

..
if(mdev->app_reads_hash) kfree(mdev->app_reads_hash);
..

That should just be

..
kfree(mdev->app_reads_hash);
..

Btw; You shouldn't put multiple statements on the same line. That is
seen all over the place - like

mdev = kzalloc(sizeof(drbd_dev),GFP_KERNEL);
if(!mdev) goto Enomem;

That should be

mdev = kzalloc(sizeof(drbd_dev),GFP_KERNEL);
if(!mdev)
goto Enomem;

There are lots of other examples.


Here are a few more pointless NULL checks before kfree() :

drbd_nl.c::drbd_nl_net_conf() :
..
if(new_tl_hash) {
if (mdev->tl_hash) kfree(mdev->tl_hash);
mdev->tl_hash_s = mdev->net_conf->max_epoch_size/8;
mdev->tl_hash = new_tl_hash;
}

if(new_ee_hash) {
if (mdev->ee_hash) kfree(mdev->ee_hash);
mdev->ee_hash_s = mdev->net_conf->max_buffers/8;
mdev->ee_hash = new_ee_hash;
}

if ( mdev->cram_hmac_tfm ) {
crypto_free_hash(mdev->cram_hmac_tfm);
}
mdev->cram_hmac_tfm = tfm;
..

should become :

..
if(new_tl_hash) {
kfree(mdev->tl_hash);
mdev->tl_hash_s = mdev->net_conf->max_epoch_size/8;
mdev->tl_hash = new_tl_hash;
}

if(new_ee_hash) {
kfree(mdev->ee_hash);
mdev->ee_hash_s = mdev->net_conf->max_buffers/8;
mdev->ee_hash = new_ee_hash;
}

crypto_free_hash(mdev->cram_hmac_tfm);
mdev->cram_hmac_tfm = tfm;
..

still in drbd_nl.c::drbd_nl_net_conf() :

..
fail:
if (tfm) crypto_free_hash(tfm);
if (new_tl_hash) kfree(new_tl_hash);
if (new_ee_hash) kfree(new_ee_hash);
if (new_conf) kfree(new_conf);
..

should become

..
fail:
crypto_free_hash(tfm);
kfree(new_tl_hash);
kfree(new_ee_hash);
kfree(new_conf);
..

drbd_nl.c::ensure_mdev() :

..
if(mdev->app_reads_hash) kfree(mdev->app_reads_hash);
.

becomes

..
kfree(mdev->app_reads_hash);
..


In drbd_receiver.c::receive_uuids() we find this :

..
if ( mdev->p_uuid ) kfree(mdev->p_uuid);
..

which should be just

..
kfree(mdev->p_uuid);
..

There is also drbd_receiver.c::drbd_disconnect() - where

..
if ( mdev->p_uuid ) {
kfree(mdev->p_uuid);
mdev->p_uuid = NULL;
}
..

needs to become just

..
kfree(mdev->p_uuid);
mdev->p_uuid = NULL;
..


And in drbd_receiver.c::drbd_do_auth()

..
fail:
if(peers_ch) kfree(peers_ch);
if(response) kfree(response);
if(right_response) kfree(right_response);

..

needs to be turned into

..
fail:
kfree(peers_ch);
kfree(response);
kfree(right_response);
..


drbd_req.c::drbd_make_request_common() also needs to get rid of

..
if (b) kfree(b); /* if someone else has beaten us to it... */
..
fail_and_free_req:
if (b) kfree(b);
..


and just turn it into

..
kfree(b); /* if someone else has beaten us to it... */
..
fail_and_free_req:
kfree(b);
..


--
Jesper Juhl <jespe...@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

Lars Ellenberg

unread,
23 Jul 2007, 17:14:0323/07/2007
to Jens Axboe, Kyle Moffett, Andrew Morton, lkml
On Mon, Jul 23, 2007 at 03:37:04PM +0200, Jens Axboe wrote:
> On Mon, Jul 23 2007, Lars Ellenberg wrote:
> > > +/* THINK maybe we actually want to use the default "event/%s" worker threads
> > > + * or similar in linux 2.6, which uses per cpu data and threads.
> > > + *
> > > + * To be general, this might need a spin_lock member.
> > > + * For now, please use the mdev->req_lock to protect list_head,
> > > + * see drbd_queue_work below.
> > > + */
> > > +struct drbd_work_queue {
> > > + struct list_head q;
> > > + struct semaphore s; /* producers up it, worker down()s it */
> > > + spinlock_t q_lock; /* to protect the list. */
> > > +};
> > >
> > > Umm, how about fixing this to actually use proper workqueues or something
> > > instead of this open-coded mess?
> >
> > unlikely to happen "right now".
> > but it is on our todo list...
>
> But stuff like that is definitely a merge show stopper, jfyi.

I see.
but it is not that easy to do away with kernel threads
(the open-coded mess) in favor of "proper workqueues or something".

Lars

Lars Ellenberg

unread,
23 Jul 2007, 17:20:1023/07/2007
to Satyam Sharma, Kyle Moffett, Jens Axboe, Andrew Morton, lkml
On Mon, Jul 23, 2007 at 07:10:58PM +0530, Satyam Sharma wrote:
> On 7/23/07, Lars Ellenberg <lars.el...@linbit.com> wrote:
> >On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:
> >[...]
> >> Don't use signals between kernel threads, use proper primitives like
> >> notifiers and waitqueues, which means you should also probably switch
> >away
> >> from kernel_thread() to the kthread_*() APIs. Also you should fix this
> >> FIXME or remove it if it no longer applies:-D.
> >
> >right.
> >but how to I tell a network thread in tcp_recvmsg to stop early,
> >without using signals?
>
>
> Yup, kthreads API cannot handle (properly stop) kernel threads that want
> to sleep on possibly-blocking-forever-till-signalled-functions such as
> tcp_recvmsg or skb_recv_datagram etc etc.
>
> There are two workarounds:
> 1. Use sk_rcvtimeo and related while-continue logic
> 2. force_sig(SIGKILL) to your kernel thread just before kthread_stop
> (note that you don't need to allow / write code to handle / etc signals
> in your kthread code -- force_sig will work automatically)

this is not only at stop time.
for example our "drbd_asender" thread
does receive as well as send, and the sending
latency is crucial to performance, while the recv
will not timeout for the next few seconds.

> >> +/* THINK maybe we actually want to use the default "event/%s" worker
> >threads
> >> + * or similar in linux 2.6, which uses per cpu data and threads.
> >> + *
> >> + * To be general, this might need a spin_lock member.
> >> + * For now, please use the mdev->req_lock to protect list_head,
> >> + * see drbd_queue_work below.
> >> + */
> >> +struct drbd_work_queue {
> >> + struct list_head q;
> >> + struct semaphore s; /* producers up it, worker down()s it */
> >> + spinlock_t q_lock; /* to protect the list. */
> >> +};
> >>
> >> Umm, how about fixing this to actually use proper workqueues or something
> >> instead of this open-coded mess?
> >
> >unlikely to happen "right now".
> >but it is on our todo list...
>
> It should be easier to do it now (if you defer it for later, the code will
> only grow more and more complex). Also, removing this gunk from
> your driver will clearly make it smaller, and easier for us to review :-)

and will poison the generic work queues with stuff that might block
somewhere deep in the tcp stack. and where we are not able to cancel it.
not exactly desirable, either.
but maybe I am missing something?

Lars

Kyle Moffett

unread,
23 Jul 2007, 20:48:3523/07/2007
to Lars Ellenberg, Jens Axboe, Andrew Morton, LKML Kernel, net...@vger.kernel.org
For the guys on netdev, would you please look at the tcp_recvmsg-
threading and TCP_NAGLE_CORK issues below and give opinions on the
best way to proceed?

One thing to remember, you don't necessarily have to merge every
feature right away. As long as the new code is configured "off" by
default with an "(EXPERIMENTAL)" warning, you can start getting the
core parts and the cleanups upstream before you have to resolve all
the issues with low-latency, dynamic-tracing-frameworks, etc.

On Jul 23, 2007, at 09:32:02, Lars Ellenberg wrote:
> On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:
>> +/* I don't remember why XCPU ...
>> + * This is used to wake the asender,
>> + * and to interrupt sending the sending task
>> + * on disconnect.
>> + */
>> +#define DRBD_SIG SIGXCPU
>
>> Don't use signals between kernel threads, use proper primitives
>> like notifiers and waitqueues, which means you should also
>> probably switch away from kernel_thread() to the kthread_*()
>> APIs. Also you should fix this FIXME or remove it if it no longer
>> applies:-D.
>
> right.
> but how to I tell a network thread in tcp_recvmsg to stop early,
> without using signals?

I'm not really a kernel-networking guy, so I can't answer this
definitively, but I'm pretty sure the problem has been solved in many
network filesystems and such, so I've added a netdev CC. The way I'd
do it in userspace is with nonblocking IO and epoll(), that way I
don't actually have to "stop" or "signal" the thread, I can just add
a socket to epoll fd when I want to pay attention to it, and remove
it from my epoll fd when I'm done with it. I'd assume there's some
equivalent way in kernelspace based around the "struct kiocb *iocb"
and "int nonblock" parameters to the tcp_recvmsg() kernel function.

>> +/* see kernel/printk.c:printk_ratelimit
>> + * macro, so it is easy do have independend rate limits at
>> different locations
>> + * "initializer element not constant ..." with kernel 2.4 :(
>> + * so I initialize toks to something large
>> + */
>> +#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \
>>
>> Any particular reason you can't just use printk_ratelimit for this?
>
> I want to be able to do a rate-limit per specific message/code
> fragment, without affecting other messages or execution paths.

Ok, so could you change your patch to modify __printk_ratelimit() to
also accept a "struct printk_rate" datastructure and make
printk_ratelimit() call "__printk_ratelimit(&global_printk_rate);"??

Typically if $KERNEL_FEATURE is insufficient for your needs you
should fix $KERNEL_FEATURE instead of duplicating a replacement in
your driver. This applies to basically all of the things I'm talking
about, kernel-threads, workqueues (BTW: I believe you can make your
own custom workqueue thread(s) instead of using the default "events/
*" ones), debugging macros, fault-insertion, integer math, lock-
checking, dynamic tracing, etc. If you find some reason that some
generic code won't work for you, please try to fix it first so we can
all benefit from it.

>> Umm, how about fixing this to actually use proper workqueues or
>> something instead of this open-coded mess?
>
> unlikely to happen "right now". but it is on our todo list...

Unfortunately problems like these need to be fixed before a mainline
merge. Merging duplicated code is a big no-no, and historically
there have been problems with people who merge code and never
properly maintain it once it's in tree. As a result the rule is your
code has to be easily maintainable before anybody will even
*consider* merging it.

>> +/* I want the packet to fit within one page
>> + * THINK maybe use a special bitmap header,
>> + * including offset and compression scheme and whatnot
>> + * Do not use PAGE_SIZE here! Use a architecture agnostic constant!
>> + */
>> +#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof
>> (long))
>>
>> Yuck. Definitely use PAGE_SIZE here, so at least if it's broken
>> on an arch with multiple page sizes, somebody can grep for
>> PAGE_SIZE to fix it. It also means that on archs/configs with 8k
>> or 64k pages you won't waste a bunch of memory.
>
> No. This is not to allocate anything, but defines the chunk size
> with which we transmit the bitmap, when we have to. We need to be
> able to talk from one arch (say i586) to some other (say s390, or
> sparc, or whatever). The receiving side has a one-page buffer,
> from which it may or may not to endian-conversion. The hardcoded
> 4096 is the minimal common denominator here.

Ahhh. Please replace the constant "4096" with:
/* This is the maximum amount of bitmap we will send per packet */
# define MAX_BITMAP_CHUNK_SIZE 4096
# define BM_PACKET_WORDS \
((MAX_BITMAP_CHUNK_SIZE - sizeof(struct Drbd_Header))/sizeof(long))

It's more text but dramatically improves the readability by
eliminating more magic numbers. This is a much milder case than I've
seen in the past, so it's not that big of a deal.


>> +/* Dynamic tracing framework */
> guess we have to explain first what this is for. it probably is
> not exactly what you think it is... but maybe I am just too
> ignorant about what you suggest here.
>
> we'll have to defer discussion about this for when I'm done doing
> the trivial fix-ups, and provided an overall design overview.

From just poking at the code it looks vaguely similar to blktrace or
something. We do have a lot of things in the kernel which fall under
the name of "Dynamic tracing framework", so if you could look those
over and see if you can't reuse them (or just remove the code for now
and merge it later), that would be really useful.

>> +static inline void drbd_tcp_cork(struct socket *sock)

>> +{
>> +#if 1
>> + mm_segment_t oldfs = get_fs();
>> + int val = 1;
>> +
>> + set_fs(KERNEL_DS);
>> + tcp_setsockopt(sock->sk, SOL_TCP, TCP_CORK, (char *)&val,
>> sizeof(val) );
>> + set_fs(oldfs);
>> +#else
>> + tcp_sk(sock->sk)->nonagle |= TCP_NAGLE_CORK;
>> +#endif
>> +}
>>
>> Yuck, why'd you do it this way? Probably because your tcp_sk(sock-
>> >sk) stuff
>> doesn't have proper locking, I'll bet. You can avoid all the
>> extra wrapper
>> crap by just looking in "do_tcp_setsockopt" and taking the
>> appropriate lock:
>>

>> static inline void drbd_tcp_cork(struct socket *sock)

>> {
>> struct sock *sk = sock->sk;
>>
>> lock_sock(sk);
>> tcp_sk(sk)->nonagle |= TCP_NAGLE_CORK;
>> release-sock(sk);
>> }
>

> it had performance improvements somewhen. I doubt it has still.
> maybe we just remove this.

NOTE: netdev guys, any chance you could comment on whether or not a
RAID1-over-IP block device should be using TCP_NAGLE_CORK? If not,
what should they be using instead?

>> +#define peer_mask role_mask
>> +#define pdsk_mask disk_mask
>> +#define susp_mask 1
>> +#define user_isp_mask 1
>> +#define aftr_isp_mask 1
>> +#define NS(T, S) \
>> +#define NS2(T1, S1, T2, S2) \
>> +#define NS3(T1, S1, T2, S2, T3, S3) \
>> +#define _NS(D, T, S) \
>> +#define _NS2(D, T1, S1, T2, S2) \
>> +#define _NS3(D, T1, S1, T2, S2, T3, S3) \
>>
>> Grumble. When I earlier said I thought I was in macro hell, well,
>> I was
>> wrong. *THIS* is macro hell. What the fsck is that supposed to
>> do? And it
>> doesn't even include a *SINGLE* comment!!!
>
> uhm. basically you are right. Phil will take over to explain why
> it is done that way...

Thanks, I'm looking forward to a good bout of hysterical
laughter ;-) :-D.


> most other points I just snipped off of this mail will be handled
> as you suggested asap.

Great! Thanks. I'll take another deep dive into your git tree in a
week or so when I've got more free time, but it would really help to
have some smaller-ish patches which do any necessary preparatory
cleanups. Good luck with your submission!

Cheers,
Kyle Moffett

Jens Axboe

unread,
24 Jul 2007, 03:37:0124/07/2007
to Lars Ellenberg, Satyam Sharma, Kyle Moffett, Andrew Morton, lkml
On Mon, Jul 23 2007, Lars Ellenberg wrote:

Create your own (single threaded) work queue using
create_singlethread_workqueue().

--
Jens Axboe

Satyam Sharma

unread,
24 Jul 2007, 19:12:2624/07/2007
to Lars Ellenberg, Kyle Moffett, Jens Axboe, Andrew Morton, lkml
Hi Lars,


On 7/24/07, Lars Ellenberg <lars.el...@linbit.com> wrote:
> On Mon, Jul 23, 2007 at 07:10:58PM +0530, Satyam Sharma wrote:
> > On 7/23/07, Lars Ellenberg <lars.el...@linbit.com> wrote:
> > >On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:
> > >[...]
> > >> Don't use signals between kernel threads, use proper primitives like
> > >> notifiers and waitqueues, which means you should also probably switch
> > >away
> > >> from kernel_thread() to the kthread_*() APIs. Also you should fix this
> > >> FIXME or remove it if it no longer applies:-D.
> > >
> > >right.
> > >but how to I tell a network thread in tcp_recvmsg to stop early,
> > >without using signals?
> >
> > Yup, kthreads API cannot handle (properly stop) kernel threads that want
> > to sleep on possibly-blocking-forever-till-signalled-functions such as
> > tcp_recvmsg or skb_recv_datagram etc etc.
> >
> > There are two workarounds:
> > 1. Use sk_rcvtimeo and related while-continue logic
> > 2. force_sig(SIGKILL) to your kernel thread just before kthread_stop
> > (note that you don't need to allow / write code to handle / etc signals
> > in your kthread code -- force_sig will work automatically)
>
> this is not only at stop time.
> for example our "drbd_asender" thread
> does receive as well as send,

That's normal -- in fact it would've been surprising if your kthread only
did recvs but no sends!

But where does the "send" come into the picture over here -- a send
won't block forever, so I don't foresee any issues whatsoever w.r.t.
kthreads conversion for that. [ BTW I hope you're *not* using any
signals-based interface for your kernel thread _at all_. Kthreads
disallow (ignore) all signals by default, as they should, and you really
shouldn't need to write any logic to handle or do-certain-things-on-seeing
a signal in a well designed kernel thread. ]

> and the sending
> latency is crucial to performance, while the recv
> will not timeout for the next few seconds.

Again, I don't see what sending latency has to do with a kernel_thread
to kthread conversion. Or with signals, for that matter. Anyway, as
Kyle Moffett mentioned elsewhere, you could probably look at other
examples (say cifs_demultiplexer_thread() in fs/cifs/connect.c).

[ I didn't really want to give that example, because I get a nervous
breakdown when looking at that code myself, and would actively
like to save other fellow developers from a similar fate. To know
what I'm talking about, set your xterm to display 40 rows, and
then look at the line numbers 3139-3218 in that file, especially
3190-3212. Yes, what you see there is a map of Sulawesi [1]
subliminally hidden in Linux kernel code :-) ]

Anyway, cifs_demultiplexer_thread() is just your normal kthread that:

(1) Ignores all signals
(2) Calls perma-blocking-till-signalled functions such as tcp_recvmsg
(via kernel_recvmsg)
(3) Calls send-to-socket kind of functions

Hence, it could get into trouble when the umount(2) code wants to stop
it with kthread_stop() and it happens to be blocked in tcp_recvmsg()
with noblock = 0 (hence sk_rcvtimeo == MAX_SCHEDULE_TIMEOUT), thus
would handle the wake_up_process() internally, and not break out, hence
not check kthread_should_stop() which it should -- all this ensuring that
the kthread never gets killed, kthread_stop() hangs, and the umount(2)
from userspace never returns ...

But they've solved it as follows (as I suggested earlier):

(1) First, set sock->sk_rcvtimeo to some "magical value" in your code
that sets up the socket params after socket->proto_ops->connect().
See ipv4_connect(), f.e. in CIFS they've set it up to 7 seconds. But
that's arbitrarily chosen -- this'll ensure your tcp_recvmsg() isn't
perma-blocking in the first place, but will unblock/return every 7 secs,
and thus get a chance to check kthread_should_stop().

(2) From the code that wants to kill/stop the kthread (module exit, or
umount(2) most probably), just ensure you make a call to force_sig()
before kthread_stop() on that kthread -- see cifs_umount() in the
same file. This'll ensure that even if the kthread is currently sleeping
in tcp_recvmsg(), it'll be signalled to break out from there, and thus
check kthread_should_stop().

(3) Note that not a single line of code needs to be written extra in the
kthread itself for this to work -- nothing to allow / handle signals ...

Just this, should be enough for a smooth conversion to kthreads, IMHO.


> > >> +/* THINK maybe we actually want to use the default "event/%s" worker
> > >threads
> > >> + * or similar in linux 2.6, which uses per cpu data and threads.
> > >> + *
> > >> + * To be general, this might need a spin_lock member.
> > >> + * For now, please use the mdev->req_lock to protect list_head,
> > >> + * see drbd_queue_work below.
> > >> + */
> > >> +struct drbd_work_queue {
> > >> + struct list_head q;
> > >> + struct semaphore s; /* producers up it, worker down()s it */
> > >> + spinlock_t q_lock; /* to protect the list. */
> > >> +};
> > >>
> > >> Umm, how about fixing this to actually use proper workqueues or something
> > >> instead of this open-coded mess?
> > >
> > >unlikely to happen "right now".
> > >but it is on our todo list...
> >
> > It should be easier to do it now (if you defer it for later, the code will
> > only grow more and more complex). Also, removing this gunk from
> > your driver will clearly make it smaller, and easier for us to review :-)
>
> and will poison the generic work queues

You could create your own workqueue as Jens Axboe suggested.

> with stuff that might block
> somewhere deep in the tcp stack. and where we are not able to cancel it.
> not exactly desirable, either.
> but maybe I am missing something?

What would that workqueue be for, in the first place? Are we talking of
the same (which is presently the) kernel_thread() here? Sorry, but I'm
only looking at the struct / snippet above, and reading words like
"worker" / "producer" and although you're calling that struct a
drbd_work_queue, it isn't quite obvious to me you're actually /using/
it as one.

Frankly, a high-level design document is a must, here (with lower
level implementation details in the individual changelogs of the
patches you finally post to this list). Working that out from 17 kloc
would otherwise be too difficult for any reviewer.

Thanks,
Satyam

[1] Sulawesi is the oddly-shaped island in the Indonesian archipelago:
http://upload.wikimedia.org/wikipedia/commons/1/16/Sulawesi_map.PNG

Lars Ellenberg

unread,
25 Jul 2007, 05:47:0625/07/2007
to Satyam Sharma, Kyle Moffett, Jens Axboe, Andrew Morton, lkml

the basic problem, and what we use signals for, is:

it is waiting in recv, waiting for the peer to say something.
but I want it to stop recv, and go send something "right now".
I don't want to have two threads for that.

yes we have timeo in place, anyways: we need to detect a failed peer
node in time. we even aim for "sub-second failover" sometimes (which is
not exactly feasible; but failover times of 15 seconds and less are
requirement for useable HA-iSCSI deployments).
but that does not cut it, timeo is seconds.
you don't want seconds latency for IO operations.

so I signal it, it breaks out of recv, then sends, and goes back to recv.

in-kernel epoll would probably solve this.
I don't know how to do that properly, though.

> >> only grow more and more complex). Also, removing this gunk from
> >> your driver will clearly make it smaller, and easier for us to review :-)
> >
> >and will poison the generic work queues
>
> You could create your own workqueue as Jens Axboe suggested.

will do.
I was not aware of the "create_singlethread_workqueue",
it does fit our useage good enough.

> Frankly, a high-level design document is a must, here (with lower
> level implementation details in the individual changelogs of the
> patches you finally post to this list). Working that out from 17 kloc
> would otherwise be too difficult for any reviewer.

sure. will be available soon.
this was just a "bust in and see what lkml does about it",
I don't expect to be merged within days :)
I think it is realistic to be merged this year, though...
[as chrismas present, maybe :-)]

--
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :

Satyam Sharma

unread,
25 Jul 2007, 08:12:4425/07/2007
to Lars Ellenberg, Satyam Sharma, Kyle Moffett, Jens Axboe, Andrew Morton, lkml
On 7/25/07, Lars Ellenberg <lars.el...@linbit.com> wrote:
> On Wed, Jul 25, 2007 at 04:41:53AM +0530, Satyam Sharma wrote:
> > [...]

> >
> > But where does the "send" come into the picture over here -- a send
> > won't block forever, so I don't foresee any issues whatsoever w.r.t.
> > kthreads conversion for that. [ BTW I hope you're *not* using any
> > signals-based interface for your kernel thread _at all_. Kthreads
> > disallow (ignore) all signals by default, as they should, and you really
> > shouldn't need to write any logic to handle or do-certain-things-on-seeing
> > a signal in a well designed kernel thread. ]
> >
> > >and the sending
> > >latency is crucial to performance, while the recv
> > >will not timeout for the next few seconds.
> >
> > Again, I don't see what sending latency has to do with a kernel_thread
> > to kthread conversion. Or with signals, for that matter. Anyway, as
> > Kyle Moffett mentioned elsewhere, you could probably look at other
> > examples (say cifs_demultiplexer_thread() in fs/cifs/connect.c).
>
> the basic problem, and what we use signals for, is:
>
> it is waiting in recv, waiting for the peer to say something.
> but I want it to stop recv, and go send something "right now".

That's ... weird. Most (all?) communication between any two parties
would follow a protocol where someone recv's stuff, does something
with it, and sends it back ... what would you send "right now" if you
didn't receive anything?

> I don't want to have two threads for that.

I really think you should -- you clearly should. From the above, it does
appear that you're mixing in multiple kinds of stuff into a single thread,
and thus mucking up the entire design (and implementation).

> yes we have timeo in place, anyways: we need to detect a failed peer
> node in time. we even aim for "sub-second failover" sometimes (which is
> not exactly feasible; but failover times of 15 seconds and less are
> requirement for useable HA-iSCSI deployments).
> but that does not cut it, timeo is seconds.
> you don't want seconds latency for IO operations.

Ok, that's a reasonable goal.

> so I signal it, it breaks out of recv, then sends, and goes back to recv.
>
> in-kernel epoll would probably solve this.
> I don't know how to do that properly, though.

Hmm, probably I don't understand what you're doing, how you're doing etc.
Will wait for the design & implementation docs.

Thanks,
Satyam

da...@lang.hm

unread,
25 Jul 2007, 22:06:0725/07/2007
to Satyam Sharma, Lars Ellenberg, Kyle Moffett, Jens Axboe, Andrew Morton, lkml

becouse even though you didn't receive anything you now have something
important to send.

remember that both sides can be sitting in receive mode. this puts them
both in a position to respond to the other if the other has something to
say.

David Lang

Kyle Moffett

unread,
25 Jul 2007, 23:44:2825/07/2007
to da...@lang.hm, Satyam Sharma, Lars Ellenberg, Jens Axboe, Andrew Morton, LKML Kernel, net...@vger.kernel.org

Why not just have 2 threads, one for "sending" and one for
"receiving". When your receiving thread gets data it takes
appropriate locks and processes it, then releases the locks and goes
back to waiting for packets. Your sending thread would take
appropriate locks, generate data to send, release locks, and transmit
packets. You don't have to interrupt the receive thread to send
packets, so where's the latency problem, exactly?

If I were writing that in userspace I would have:

(A) The pool of IO-generating threads (IE: What would ordinarily be
userspace)
(B) One or a small number of data-reception threads.
(C) One or a small number of data-transmission threads.

When you get packets to process in your network-reception thread(s),
you queue appropriate disk IOs and any appropriate responses with
your transmission thread(s). You can basically just sit in a loop on
tcp_recvmsg=>demultiplex=>do-stuff. When your IO-generators actually
make stuff to send you queue such data for disk IO, then packetize it
and hand it off to your data-transmission threads.

If you made all your sockets and inter-thread pipes nonblocking then
in userspace you would just epoll_wait() on the sockets and pipes and
be easily able to react to any IO from anywhere.

In kernel space there are similar nonblocking interfaces, although it
would probably be easier just to use a couple threads.

Cheers,
Kyle Moffett

Evgeniy Polyakov

unread,
26 Jul 2007, 05:19:3926/07/2007
to Kyle Moffett, da...@lang.hm, Satyam Sharma, Lars Ellenberg, Jens Axboe, Andrew Morton, LKML Kernel, net...@vger.kernel.org
Hi Kyle.

On Wed, Jul 25, 2007 at 11:43:38PM -0400, Kyle Moffett (mrlin...@mac.com) wrote:
> If you made all your sockets and inter-thread pipes nonblocking then
> in userspace you would just epoll_wait() on the sockets and pipes and
> be easily able to react to any IO from anywhere.
>
> In kernel space there are similar nonblocking interfaces, although it
> would probably be easier just to use a couple threads.

There are no such interfaces in kernel - one must create own state
machine on top of ->poll() callback, or use sys_poll()/epoll, but likely
it is not what one wants for high-performance in-kernel event processing
engine. Having two threads does not solve the problem - eventually one
needs to send a header before receiving data. So, the solution would
either to use always-blocking mode like in NBD, or create own state
machine using ->poll() callbacks.

> Cheers,
> Kyle Moffett

--
Evgeniy Polyakov

Pavel Machek

unread,
27 Jul 2007, 14:47:2227/07/2007
to Lars Ellenberg, Jan Engelhardt, Jens Axboe, Andrew Morton, lkml
Hi!

> > >We implement shared-disk semantics in a shared-nothing cluster.
> >
> > If nothing is shared, the disk is not shared, but got shared-disk
> > semantics? A little confusing.
>
> Think of it as RAID1 over TCP.
> Typically you have one Node in Primary, the other as Secondary,
> replication target only.

I guess TCP means people should not swap over it?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Jan Engelhardt

unread,
30 Jul 2007, 15:42:0630/07/2007
to Lars Ellenberg, Pavel Machek, Jens Axboe, Andrew Morton, lkml

On Jul 30 2007 21:35, Lars Ellenberg wrote:

>On Fri, Jul 27, 2007 at 06:46:17PM +0000, Pavel Machek wrote:
>> Hi!
>>
>> > > >We implement shared-disk semantics in a shared-nothing cluster.
>> > >
>> > > If nothing is shared, the disk is not shared, but got shared-disk
>> > > semantics? A little confusing.
>> >
>> > Think of it as RAID1 over TCP.

Getting back at that... How does it compare (in terms of "do we need it?")
to RAIF1 over NFS?


Jan
--

Lars Ellenberg

unread,
30 Jul 2007, 16:12:4130/07/2007
to Pavel Machek, Jan Engelhardt, Jens Axboe, Andrew Morton, lkml
On Fri, Jul 27, 2007 at 06:46:17PM +0000, Pavel Machek wrote:
> Hi!
>
> > > >We implement shared-disk semantics in a shared-nothing cluster.
> > >
> > > If nothing is shared, the disk is not shared, but got shared-disk
> > > semantics? A little confusing.
> >
> > Think of it as RAID1 over TCP.
> > Typically you have one Node in Primary, the other as Secondary,
> > replication target only.
>
> I guess TCP means people should not swap over it?

people should not swap over DRBD,
because it would not be useful.

DRBD is to have applicaction data available for more than one node,
without a single point of failure; when the node the app currently runs
on crashes, the data is there so some other node can take over from there.

what would you do with the swap of a crashed node, apart from,
well, crash analysis?
you don't need it to be highly available for that.

besides, yes, when you have network io in the block io path,
with linux (and probably most other OSes),
there is the posibility of vm starvation or even deadlock.

situation is improving, though - iirc, there has been talk to
have a emergency memory pool very low level in the network stack,
and some special "I am doing block-io" socket flag;
what is the status of that, anyone?

I belive DRBD behaves very good even in oom situations.
we considered these things from the very beginning.

that said,
I did not see a DRBD cluster hanging hard in OOM, yet.
and we do operate quite a few busy
database/mail/web/file/application/iSCSI/whatever clusters.

Lars

0 new messages