Suspend to {mem,disk} broken in 2.6.15-rc6/rc7 on my T42

Jules Villard

non lue,

26 déc. 2005, 15:00:1126/12/2005

à

Hi all,

I'm afraid I've been quite a naughty boy this year because Santa
brought me what looks like a kernel bug for Christmas ;)

I use basic acpi support to suspend my box. I perform eg
# echo mem > /sys/power/state
to suspend to ram.

Resuming from a suspend on my ThinkPad T42 is broken in both -rc6 and
-rc7 releases. When X is not launched, everything goes fine, but when
resuming a running X, X looks frozen. I can ssh to my box and the
sysrq keys are still working, but I'm unable to kill the X process.
If I suspend from a vt (but still with a X running), the resume goes
fine until I switch back from the vt to X.
Nothing shows up in dmesg (anyway as I said, everything still works
fine when I ssh to my box, I just can't use my computer directly), nor
in the Xorg logs (I use Xorg 6.8.2).

Everything was ok with -rc5.

Please find my .config and the lspci output attached (my graphic card
is a AGP plugged ATI Radeon Mobility 7500 and I use the "radeon"
driver from xorg).

Best regards,

Jules

lspci

.config

Jules Villard

non lue,

26 déc. 2005, 16:30:1026/12/2005

à

Le lun, 26 déc 2005 12:04:54 -0800, Linus Torvalds a écrit :

>
>
> On Mon, 26 Dec 2005, Jules Villard wrote:
> >
> > Resuming from a suspend on my ThinkPad T42 is broken in both -rc6 and
> > -rc7 releases. When X is not launched, everything goes fine, but when
> > resuming a running X, X looks frozen. I can ssh to my box and the
> > sysrq keys are still working, but I'm unable to kill the X process.
> > If I suspend from a vt (but still with a X running), the resume goes
> > fine until I switch back from the vt to X.
>

> Since you have sysrq working, can you do SysRQ-T and send us the output?
> With CONFIG_KALLSYMS (which is on by default unless you do something
> really strange).
>
> At least that should tell _where_ X is frozen, assuming it is frozen in
> the kernel (which is not necessarily a safe assumption, of course).

Attached.

Investigating a bit further, I found out that resume is quite innocent
about all this: what hangs X is switching from a vt to X. Moreover, When I
launch X only by typing "X" in a vt, switching back and forth makes
the box hang hard (ie no sysrq), so I had to do a startx to see a call
trace with sysrq-t (I know, it may sound like black art).

Regards,

Jules

PS: Sorry for messing up with the lkml's email...

syslog_sysrq_t_rc7_startx_frozen

Jules Villard

non lue,

26 déc. 2005, 20:30:2026/12/2005

à

Le lun, 26 déc 2005 15:55:21 -0800, Linus Torvalds a écrit :
>
>
> On Mon, 26 Dec 2005, Jules Villard wrote:
> >

> > Please find my .config and the lspci output attached (my graphic card
> > is a AGP plugged ATI Radeon Mobility 7500 and I use the "radeon"
> > driver from xorg).
>

> Ok, from the sysrq-T stuff it _looks_ like X is just busy-looping in user
> space. So it's probably some disagreement between radeonfb and X.org
>
> The fact that everything was ok in -rc5 would imply that it's likely one
> of the radeon aperture size issue patches.

>
> > Investigating a bit further, I found out that resume is quite innocent
> > about all this: what hangs X is switching from a vt to X.
>

> I'm cc'ing BenH and DaveA, but in the meantime, while waiting for the
> professionals, can you try to revert the two attachments (revert "diff-1"
> first, try that, and revert "diff-2" after that if it didn't start
> working after the first revert).

First revert wasn't enough, but the second one made it! Everything is
working now.

Thanks,

Jules

Benjamin Herrenschmidt

non lue,

26 déc. 2005, 20:40:1026/12/2005

à

> First revert wasn't enough, but the second one made it! Everything is
> working now.

That is not good. See my other mail. I need more infos to understand
what's up.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Benjamin Herrenschmidt

non lue,

26 déc. 2005, 20:50:0726/12/2005

à

> > Also, does it work if you don't use radeonfb ? radeonfb shouldn't touch
> > MC_AGP_LOCATION and the DRM change only affects that, so I'm a bit
> > surprised.
> >
> > Ben.
> >
>
> Do you still want me to try that now that reverting the two patches
> made the job?

Definitely, and we need to figure out why the patch cause a regression.
Those patches fixes a serious issues with a number of machines.

The problem is very nasty as all the various parties involved (radeonfb,
X radeon driver, radeon DRM, etc...) all try to reconfigure the card
memory map in differently bogus ways...

Can you add printk's to the kernel to check the values in
CONFIG_MEMSIZE, CONFIG_APER_SIZE, priv->fb_location and the values
calculated for gart_vm_start ? Then tell me what that printk gets on X
start and when switching consoles.

Thanks,

Jules Villard

non lue,

26 déc. 2005, 20:50:1026/12/2005

à

Le mar, 27 déc 2005 11:00:18 +1100, Benjamin Herrenschmidt a écrit :

> On Mon, 2005-12-26 at 15:55 -0800, Linus Torvalds wrote:
> >
> > On Mon, 26 Dec 2005, Jules Villard wrote:
> > >

> > > Please find my .config and the lspci output attached (my graphic card
> > > is a AGP plugged ATI Radeon Mobility 7500 and I use the "radeon"
> > > driver from xorg).
> >

> > Ok, from the sysrq-T stuff it _looks_ like X is just busy-looping in user
> > space. So it's probably some disagreement between radeonfb and X.org
> >
> > The fact that everything was ok in -rc5 would imply that it's likely one
> > of the radeon aperture size issue patches.
> >
> > > Investigating a bit further, I found out that resume is quite innocent
> > > about all this: what hangs X is switching from a vt to X.
> >
> > I'm cc'ing BenH and DaveA, but in the meantime, while waiting for the
> > professionals, can you try to revert the two attachments (revert "diff-1"
> > first, try that, and revert "diff-2" after that if it didn't start
> > working after the first revert).
>

> Also, does it work if you don't use radeonfb ? radeonfb shouldn't touch
> MC_AGP_LOCATION and the DRM change only affects that, so I'm a bit
> surprised.
>
> Ben.
>

Do you still want me to try that now that reverting the two patches
made the job?

Jules

Benjamin Herrenschmidt

non lue,

26 déc. 2005, 20:50:1126/12/2005

à

On Mon, 2005-12-26 at 15:55 -0800, Linus Torvalds wrote:
>
> On Mon, 26 Dec 2005, Jules Villard wrote:
> >

> > Please find my .config and the lspci output attached (my graphic card
> > is a AGP plugged ATI Radeon Mobility 7500 and I use the "radeon"
> > driver from xorg).
>

> Ok, from the sysrq-T stuff it _looks_ like X is just busy-looping in user
> space. So it's probably some disagreement between radeonfb and X.org
>
> The fact that everything was ok in -rc5 would imply that it's likely one
> of the radeon aperture size issue patches.
>
> > Investigating a bit further, I found out that resume is quite innocent
> > about all this: what hangs X is switching from a vt to X.
>
> I'm cc'ing BenH and DaveA, but in the meantime, while waiting for the
> professionals, can you try to revert the two attachments (revert "diff-1"
> first, try that, and revert "diff-2" after that if it didn't start
> working after the first revert).

Also, does it work if you don't use radeonfb ? radeonfb shouldn't touch
MC_AGP_LOCATION and the DRM change only affects that, so I'm a bit
surprised.

Ben.

-

Marc Koschewski

non lue,

27 déc. 2005, 00:10:1027/12/2005

à

* Jules Villard <jvil...@ens-lyon.fr> [2005-12-26 22:23:39 +0100]:

Did you use the nVidia module? Several people reported machine hangs when
doing the vt <-> X switching. This, however, should be fixed with the latest
drivers.

I had the same problem some time ago. Though I knew a have reached a console
where I was logged in the keyboard seems to deny any service when coming from X.
Since i upgraded X to some CVS version and the nVidia driver 8174 (8178 working
as well) anything is OK.

Marc

Marc Koschewski

non lue,

27 déc. 2005, 00:10:0727/12/2005

à

* Marc Koschewski <ma...@osknowledge.org> [2005-12-27 03:58:48 +0100]:

> Did you use the nVidia module? Several people reported machine hangs when
> doing the vt <-> X switching. This, however, should be fixed with the latest
> drivers.
>
> I had the same problem some time ago. Though I knew a have reached a console
> where I was logged in the keyboard seems to deny any service when coming from X.
> Since i upgraded X to some CVS version and the nVidia driver 8174 (8178 working
> as well) anything is OK.
>
> Marc

Doh! Damn tooooo late over here... I just managed to find some 'radeonfb'-like
string in your mail. :)

Good night,

Benjamin Herrenschmidt

non lue,

27 déc. 2005, 07:20:1127/12/2005

à

Also, while we are at it, can you try this patch on top of current
-git ? What I _think_ might be happening is that the X server is also
trying to muck around with the card memory map and is forcing it back
into a wrong setting that also happens to no longer match what the DRM
wants to do and blows up. There are bugs all over the place in that code
(and still some bugs in the DRM as well anyway). This patch attempts to
avoid that by using the largest of the 2 values, which I think will
cause it to behave as it used to for you and will still fix the problem
with machines that have an aperture size smaller than the video memory.

That might be good enough until I fully fix X and the DRM (work in progress
but there are other "issues").

Index: linux-work/drivers/char/drm/radeon_cp.c
===================================================================
--- linux-work.orig/drivers/char/drm/radeon_cp.c 2005-12-24 10:07:22.000000000 +1100
+++ linux-work/drivers/char/drm/radeon_cp.c 2005-12-27 12:48:02.000000000 +1100
@@ -1312,7 +1312,7 @@
static int radeon_do_init_cp(drm_device_t * dev, drm_radeon_init_t * init)
{
drm_radeon_private_t *dev_priv = dev->dev_private;
- unsigned int mem_size;
+ unsigned int mem_size, aper_size;

DRM_DEBUG("\n");

@@ -1527,7 +1527,9 @@
mem_size = RADEON_READ(RADEON_CONFIG_MEMSIZE);
if (mem_size == 0)
mem_size = 0x800000;
- dev_priv->gart_vm_start = dev_priv->fb_location + mem_size;
+ aper_size = max(RADEON_READ(RADEON_CONFIG_APER_SIZE), mem_size);
+
+ dev_priv->gart_vm_start = dev_priv->fb_location + aper_size;

#if __OS_HAS_AGP
if (!dev_priv->is_pci)

Jules Villard

non lue,

27 déc. 2005, 07:30:1227/12/2005

à

Yes, (2.6.15-rc7-git1 + this patch) fixes it.

Jules Villard

non lue,

27 déc. 2005, 08:10:1327/12/2005

à

Le mar, 27 déc 2005 12:27:08 +1100, Benjamin Herrenschmidt a écrit :
>
> > > Also, does it work if you don't use radeonfb ? radeonfb shouldn't touch
> > > MC_AGP_LOCATION and the DRM change only affects that, so I'm a bit
> > > surprised.
> > >
> > > Ben.
> > >
> >
> > Do you still want me to try that now that reverting the two patches
> > made the job?
>
> Definitely, and we need to figure out why the patch cause a regression.
> Those patches fixes a serious issues with a number of machines.

Removing radeonfb from the kernel only makes things worse: the box
gets completly frozen when reproducing the bug (no more ssh access nor
sysrq).

>
> The problem is very nasty as all the various parties involved (radeonfb,
> X radeon driver, radeon DRM, etc...) all try to reconfigure the card
> memory map in differently bogus ways...
>
> Can you add printk's to the kernel to check the values in
> CONFIG_MEMSIZE, CONFIG_APER_SIZE, priv->fb_location and the values
> calculated for gart_vm_start ? Then tell me what that printk gets on X
> start and when switching consoles.

I get these figures when I first start X:
[ 104.399101] ### fb_location is now e0000000
[ 104.399104] ### mem_size is 2000000
[ 104.399107] ### aper_size is 4000000
[ 104.399109] ### gart_vm_start is e2000000

The sad thing is that it looks like the crash occurs *before* entering
the radeon_do_init_cp function, assuming it should enter it again when
I switch back from a tty to X (I've put some printk's at the
beginning of the function but didn't see them in dmesg although other
things showed up), so I don't know where to put the printk's in order
to get other figures...

Thanks,

Jules

Benjamin Herrenschmidt

non lue,

27 déc. 2005, 18:30:1227/12/2005

à

On Tue, 2005-12-27 at 13:55 +0100, Jules Villard wrote:

> The sad thing is that it looks like the crash occurs *before* entering
> the radeon_do_init_cp function, assuming it should enter it again when
> I switch back from a tty to X (I've put some printk's at the
> beginning of the function but didn't see them in dmesg although other
> things showed up), so I don't know where to put the printk's in order
> to get other figures...

I think the problem is actually a bug in the X server that we are
triggering indirectly. It's very difficult to fix things properly
because of various bugs that depends on each other side effects in X and
the DRM. I may have to back it all off for now and add some version test
to both DRM and X so that they only try to "do the right thing" once
they detect that the other hand has been fixed too...

Let's see if the latest patch I posted that fixes things for you also
helps others though.

Benjamin Herrenschmidt

non lue,

28 déc. 2005, 01:20:0828/12/2005

à

Linus, please back out those 2 DRM patches of me for 2.6.15. It seems
that they cause more problems than they solve due to bugs in the X
server. I need to rethink the solution. I'll come up with new patches
after 2.6.15 along with matching X.org patches that will check each
other driver version to avoid mixing issues. In the meantime, users will
have to live with the current problem (typically some machines not
waking up from sleep too).

Linus Torvalds

non lue,

28 déc. 2005, 17:00:1628/12/2005

à

On Wed, 28 Dec 2005, Benjamin Herrenschmidt wrote:
>
> Linus, please back out those 2 DRM patches of me for 2.6.15. It seems
> that they cause more problems than they solve due to bugs in the X
> server. I need to rethink the solution.

Hmm.. How many other problem reports do we have? Jules reported that your
patch to use the max() of the aperture size and memsize fixed the problem
for him (and I merged it). Does it have other downsides?

Linus

Benjamin Herrenschmidt

non lue,

28 déc. 2005, 18:20:1428/12/2005

à

On Wed, 2005-12-28 at 13:49 -0800, Linus Torvalds wrote:
>
> On Wed, 28 Dec 2005, Benjamin Herrenschmidt wrote:
> >
> > Linus, please back out those 2 DRM patches of me for 2.6.15. It seems
> > that they cause more problems than they solve due to bugs in the X
> > server. I need to rethink the solution.
>
> Hmm.. How many other problem reports do we have? Jules reported that your
> patch to use the max() of the aperture size and memsize fixed the problem
> for him (and I merged it). Does it have other downsides?

It doesn't, but I've got one confirmed report of failure that isn't
fixed by the latest patch and 2 other ones still dubious.

I'm not entirely sure what's going on yet. On console switch (EnterVT()
in the X driver), it will restore the mode and set back the wrong value
in MC_AGP_LOCATION. It will then re-enable AGP and call the "resume"
ioctl to the DRM which should then "fix" MC_AGP_LOCATION to the
"correct" value we calculated. However, it's possible that the chip
dislikes those constant changes of these memory controller settings
especially while it's currently pumping pixels out.

Also, if using dual head, it's possible that the X server radeon driver
goes back writing the wrong value _again_ after the first head has been
re-initialized, and while the engine is actively pumping command from
AGP, which would be deadly. The radeon driver in X is one of the worst
mess I've ever dealt with so far...

So I think at this point, the best is that we keep the old bogus code
that at least is consistent with the bug in the server. I'm working on a
big patch to X that reworks the memory map stuff completely and fixes
those issues on the server side, I'll do a DRM patch matching this X fix
as well so that the memory map is only ever set in one place and with
what I hope is a correct algorithm...

Ben.