From this thread, it seems like turning off locking completely won't
work, as we rely on it.
Some more ideas.
1) Sledgehammer: refuse to run on NFS. Aside from this bug, network
file systems hurt performance in all sorts of unexpected ways.
(See e.g. "https connections are very slow to establish (5-30
seconds)" http://code.google.com/p/chromium/issues/detail?id=48585 )
This would force some small fraction of our users to manually modify
their profiles to be on local disk, which is pretty brutal.
On the other hand, the sort of user who has an NFS homedir is more
likely the sort of user who can understand how to do this.
Probably a non-starter but it is tempting. I hate that for some of
our users we just have slow+unreliable performance without them having
any way of knowing the reason (or that it is in fact a problem).
Here's a real unprompted quote from a coworker on this subject: "I
just moved Chrome's profile directory and oh my God is it zippy. I
never knew..."
2) Attempt to clean up locks when we detect we're in a hosed state,
either automatically at startup or by amending our --diagnostics mode
to do the above unlocking manually (and have our "your profile is
corrupt" dialog recommend running in --diagnostics mode). This would
also allow us to recommend moving the profile (and NSS database, for
that matter) to local disk at that point.
2a) Using the existing locking mechanism: if you're on NFS,
recursively copy your profile to a temporary location; delete your
profile; move the copy back to the original location. If you're on an
another file system, we've got something else going wrong.
2b) Use an alternative locking mechanism that allows us to clean up
the locks. sqlite supports creating a dotfile on the side whenever it
needs to access a file. It says this makes performance much worse
(repeatedly creating and deleting a file every time you access the
database, and you can't have any simultaneous readers), but perhaps
that is acceptable.
3) Implement some complex locking structure of our own on the side
like what Scott suggested. Realistically I'm not going to do this; I
estimate the fraction of our users to be affected by the above
problems to be somewhere around 0.01%, I have other things to worry
about, and we need less code, not more.
My current plan is:
- Along the lines of (1), add an UMA stat related to network file
systems, just to evaluate how much of a problem this really is for
users.
- Implement 2a under the diagnostics mode.
On Thu, Aug 19, 2010 at 3:29 PM, Evan Martin <ev...@chromium.org> wrote:
> 1) Sledgehammer: refuse to run on NFS. Aside from this bug, network
> file systems hurt performance in all sorts of unexpected ways.
> (See e.g. "https connections are very slow to establish (5-30
> seconds)" http://code.google.com/p/chromium/issues/detail?id=48585 )
> This would force some small fraction of our users to manually modify
> their profiles to be on local disk, which is pretty brutal.
> On the other hand, the sort of user who has an NFS homedir is more
> likely the sort of user who can understand how to do this.
Could we have an infobar yell at the user on every startup? "Hello,
you are using NFS and NFS is broken. Move your profile off, or your
profile will be corrupted periodically?"
-- Elliot
On Thu, Aug 19, 2010 at 3:29 PM, Evan Martin <ev...@chromium.org> wrote:
> --
> Chromium Developers mailing list: chromi...@chromium.org
> View archives, change email options, or unsubscribe:
> http://groups.google.com/a/chromium.org/group/chromium-dev
>
1) Sledgehammer: refuse to run on NFS. Aside from this bug, network
file systems hurt performance in all sorts of unexpected ways.
(See e.g. "https connections are very slow to establish (5-30
seconds)" http://code.google.com/p/chromium/issues/detail?id=48585 )
This would force some small fraction of our users to manually modify
their profiles to be on local disk, which is pretty brutal.
On the other hand, the sort of user who has an NFS homedir is more
likely the sort of user who can understand how to do this.
Probably a non-starter but it is tempting. I hate that for some of
our users we just have slow+unreliable performance without them having
any way of knowing the reason (or that it is in fact a problem).
Here's a real unprompted quote from a coworker on this subject: "I
just moved Chrome's profile directory and oh my God is it zippy. I
never knew..."
As an aside, I don't think winning converts from Firefox is the goal.
But certainly we should emulate the good things Firefox does.
> Regarding the underlying problem, *every single time* my desktop at home
> (which mounts my home directory from my server via NFS) has been shutdown
> uncleanly (usually extended power outage, once a video driver bug), my
> Chrome profile is totally corrupted and I have to delete it and start fresh
> (not shared with any other Chrome instance, not leftover Chrome processes
> running, etc). I have taken to making known-good copies of it to avoid
> losing everything. I think in the 5 years of using Firefox, I have had a
> profile corrupted in that way exactly once.
> If Oracle supports running an RDBMS over NFS on a NetApp, surely you can
> store profile information.
You would think so! The frustrating thing about this is that sqlite
is actually very carefully constructed to be resilient against these
kinds of failures, so it is unlikely that it is actually corrupted.
So could you elaborate on what you mean by "totally corrupted"?
The next time this happens, try this recipe:
sudo apt-get install sqlite3
cd ~/.config/google-chrome/Default # or wherever it lives
find . -print0 | xargs -0 file | grep -i sqlite | cut -d: -f1 | while read f; do
echo -n "Checking $f... "
sqlite3 "$f" "pragma integrity_check"
done
That might point at the culprit.
The problem this thread is describing seems to just a bug in an NFS
implementation. At least on the machine I tested at Google, the
profile file was reported as locked, and no amount of fiddling would
fix it (including rebooting the client machine) -- it seemed like the
NFS server was hanging onto a lock, and as far as I understand the
POSIX locking API we can't tell it to let go. I suggested upthread
(proposal 2b) one way to work around this, but it seems kind of
pathetic to me.
Hopefully you're encountering some other sort of corruption, because
this type is nearly out of our hands. (I would love to be corrected if
I'm wrong.)