SSD persistent disk issue

201 views
Skip to first unread message

Rafael Campos Las Heras

unread,
Apr 16, 2015, 6:13:13 PM4/16/15
to gce-dis...@googlegroups.com
Hi folks!

We have been testing gce for deploying aerospike instances.

We had setted up a virtual machine with an SSD Persistent Disk.

Yesterday, it stopped working, and now we have an invalid partition on this disk, and we are not able to mount, neither recovery data.

Is there any issues concerning SSD Persistent Disk on us-central1-a?
There is a mail [1] from gce-operations group telling about "Degraded network performance", could this be related?

We have a very big concern about this, as we are going to put most of our data on those gce instances.

[1] https://groups.google.com/forum/#!topic/gce-operations/_GqhyTFEXUs

Kind Regards,
--
Rafael Campos Las Heras
GetWi

David Newgas

unread,
Apr 17, 2015, 12:13:13 PM4/17/15
to gce-dis...@googlegroups.com
Hi Rafael,

We don't have any ongoing SSD PD issue, but we would like to investigate. Could you let me know (off list if you want) what project and disk you are seeing the issue on?

Thanks,
David Newgas
Google Cloud Platform

David Newgas

unread,
Apr 17, 2015, 1:59:09 PM4/17/15
to gce-dis...@googlegroups.com
Hi Rafael,

Thanks for the details. It would be great to have some more background on the issue, for example:
  • What error message do you get when you mount the disk?
  • Where do you see "invalid partition"?
  • Have you resized or repartitioned the disk at all? If so what commands did you use?
  • What are you trying to do to recover data, and errors are you encountering?
Thanks,
David

Rafael Campos Las Heras

unread,
Apr 17, 2015, 3:03:49 PM4/17/15
to David Newgas, gce-dis...@googlegroups.com
Hi David,

See my comments in line

On Fri, 17 Apr 2015 at 14:59 'David Newgas' via gce-discussion <gce-dis...@googlegroups.com> wrote:
Hi Rafael,

Thanks for the details. It would be great to have some more background on the issue, for example:
  • What error message do you get when you mount the disk?
We are not able to mount the disk due to invalid partition table.
  • Where do you see "invalid partition"?
You could see the message at the boot up:
Mounting local filesystems...[    7.494773] EXT4-fs (sdb): VFS: Can't find ext4 filesystem
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

And if you do dmesg, you could see thsi relevant information:
[    2.215515] sd 0:0:1:0: [sda] Attached SCSI disk
[    2.216651]  sdb: unknown partition table

  • Have you resized or repartitioned the disk at all? If so what commands did you use?
We created the compute engine and SSD, later on we did not resize neither repartitioned the ssd.
The Aerospike community edition was installed for start migrating our database, and started to do some  tests.
Sometime yesterday afternoon (BRT time) stopped to work, and we analysed what happen.
  • What are you trying to do to recover data, and errors are you encountering?
We are not taking any action for recovering the data, as we could lost the SSD cloud integrity.
The instance was a test instance, so is not critical.

We are evaluating the architecture before putting any production data.
 
Thanks,
David

On Fri, Apr 17, 2015 at 9:13 AM, David Newgas <dne...@google.com> wrote:
Hi Rafael,

We don't have any ongoing SSD PD issue, but we would like to investigate. Could you let me know (off list if you want) what project and disk you are seeing the issue on?

Thanks,
David Newgas
Google Cloud Platform

On Thursday, April 16, 2015 at 3:13:13 PM UTC-7, Rafael Campos Las Heras wrote:
Hi folks!

We have been testing gce for deploying aerospike instances.

We had setted up a virtual machine with an SSD Persistent Disk.

Yesterday, it stopped working, and now we have an invalid partition on this disk, and we are not able to mount, neither recovery data.

Is there any issues concerning SSD Persistent Disk on us-central1-a?
There is a mail [1] from gce-operations group telling about "Degraded network performance", could this be related?

We have a very big concern about this, as we are going to put most of our data on those gce instances.

[1] https://groups.google.com/forum/#!topic/gce-operations/_GqhyTFEXUs

Kind Regards,
--
Rafael Campos Las Heras
GetWi

--
© 2014 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-dis...@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to a topic in the Google Groups "gce-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gce-discussion/NPD5-pF8tfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gce-discussio...@googlegroups.com.
To post to this group, send email to gce-dis...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/CAJZK_bZorGbrYb7_qPaSW9XPNnhi1SWy0QgJ0U-OFeq1vGqsAA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

David Newgas

unread,
Apr 17, 2015, 3:32:37 PM4/17/15
to Rafael Campos Las Heras, gce-dis...@googlegroups.com
+cc gce-discussion (oops)

Also could you let me know what the exact mount command you ran was?

On Fri, Apr 17, 2015 at 12:31 PM, David Newgas <dne...@google.com> wrote:
Ok, our team has not seen signs of a fault on the PD side. There are numerous ways disks can become unmountable, I'd like to collect yet more info for diagnostics. Specifically:
  • Please send the contents of /etc/fstab
  • Please run "for disk in /dev/sda*; do echo ***$disk; sudo fsck -nf $disk; done" and send the output
  • Please run "sudo fdisk -l /dev/sdb" and send the output
  • Please run "sudo dd if=/dev/sdb bs=512 count=2 of=/tmp/sdb_dump" and then attch /tmp/sdb_dump
  • For each partition listed by fdisk please note the "start" number. For each partition then run "sudo dd if=/dev/sdb bs=512 count=4 skip=<START_NUMBER> of=/tmp/sdb_part_<PARTITION_NUMBER>". Please send all the /tmp/sdb_part_* files.
  • If errors occur with any of the above commands, please include them.
  • Aerospike has a page on SSD setup that advises disk partitioning. Did you do such partitioning?
Apologies for the flood of questions.

David

David Newgas

unread,
Apr 17, 2015, 3:37:25 PM4/17/15
to Rafael Campos Las Heras, gce-dis...@googlegroups.com
In fact could you do count=3 on my 4th bullet instead of count=2 also.

David Newgas

unread,
Apr 17, 2015, 3:44:55 PM4/17/15
to Rafael Campos Las Heras, gce-dis...@googlegroups.com
One final request: Please try snapshotting your disk. This will help us test whether the error is in the PD service itself or on the VM.  You can do this by running the command:

gcloud compute disks snapshot your_disk_name --snapshot-names test_snapshot

David Newgas

unread,
Apr 23, 2015, 1:50:30 PM4/23/15
to gce-dis...@googlegroups.com, raf...@getwi.com
Hi Rafael,

Are you still having trouble with your disk? We are glad to try and help. Myself and our engineering team need a little more info to narrow down the issue; here is a list data to collect that will help us:
  • Can you snapshot the disk? You can do this by running the command: gcloud compute disks snapshot your_disk_name --snapshot-names test_snapshot
  • Please send the contents of /etc/fstab
  • Please run "for disk in /dev/sda*; do echo ***$disk; sudo fsck -nf $disk; done" and send the output
  • Please run "sudo fdisk -l /dev/sdb" and send the output
  • Please run "sudo dd if=/dev/sdb bs=512 count=3 of=/tmp/sdb_dump" and then attch /tmp/sdb_dump
  • For each partition listed by fdisk please note the "start" number. For each partition then run "sudo dd if=/dev/sdb bs=512 count=4 skip=<START_NUMBER> of=/tmp/sdb_part_<PARTITION_NUMBER>". Please send all the /tmp/sdb_part_* files.
  • If errors occur with any of the above commands, please include them.
  • Aerospike has a page on SSD setup that advises disk partitioning. Did you do such partitioning?
    Thanks,
    David
    Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-discussion@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.

    ---
    You received this message because you are subscribed to a topic in the Google Groups "gce-discussion" group.
    To unsubscribe from this topic, visit https://groups.google.com/d/topic/gce-discussion/NPD5-pF8tfE/unsubscribe.
    To unsubscribe from this group and all its topics, send an email to gce-discussion+unsubscribe@googlegroups.com.
    To post to this group, send email to gce-discussion@googlegroups.com.

    David Newgas

    unread,
    Apr 27, 2015, 9:21:47 PM4/27/15
    to gce-dis...@googlegroups.com, Rafael Campos Las Heras
    Hi Rafael,

    Thanks for that additional data (off list), here are the conclusions I can draw:
    • There is no PD corruption, as this would result in the snapshot failing.
    • Your fstab implies the disk was previously configured to have the filesystem directly on the disk (with no partition table). In this configuration the "sdb: unknown partition table" message is perfectly normal.
    • The filesystem the fstab imples should be present is not there - the dump you sent me is not a valid filesystem.
    • The dump you sent looks like a valid aerospike data file. I think your system is configured to have aerospike directly read and write to the disk without at filesystem in between. This is described in the "Recipe for an SSD Storage Engine" at https://www.aerospike.com/docs/operations/configure/namespace/storage/.
    There are things you should do:
    1. Ignore "unknown partition table" messages.
    2. Remove the fstab entry for your data disk.
    3. Ensure the Aerospike configuration still points at the correct location for your disk. You mentioned it moved between sdb and sdc; you should probably update the configuration to point using the /dev/disk/by-id path.
    You should then find aerospike successfully uses your data. You will still not find any mountable filesystem on that drive, but that is by design.

    Yours,
    David

    On Thu, Apr 23, 2015 at 10:50 AM, David Newgas <dne...@google.com> wrote:
    Hi Rafael,

    Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-dis...@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.

    ---
    You received this message because you are subscribed to a topic in the Google Groups "gce-discussion" group.
    To unsubscribe from this topic, visit https://groups.google.com/d/topic/gce-discussion/NPD5-pF8tfE/unsubscribe.
    To unsubscribe from this group and all its topics, send an email to gce-discussio...@googlegroups.com.
    To post to this group, send email to gce-dis...@googlegroups.com.

    Rafael Campos Las Heras

    unread,
    Apr 30, 2015, 10:54:37 AM4/30/15
    to David Newgas, gce-dis...@googlegroups.com
    Hi David,

    Sorry for the late response, we have been quite busy.
    Thank you very much for helping us with the possible issue.
    We learn some things in the way.

    We are going to keep testing the cloud engine.


    Kind Regards,
    --
    Rafael Campos Las Heras
    GetWi


    Reply all
    Reply to author
    Forward
    0 new messages