Announcement

Collapse
No announcement yet.

Frequent OS crashes - looking for troubleshooting ideas

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    [SOLVED] Frequent OS crashes - looking for troubleshooting ideas

    Hello all, for a while now I experience sudden OS crashes. Sometimes the screen just turns all black and no other action is possible, can't even witch to another tty. Sometimes applications will die one by one and Plasma will freeze, when switching to another tty I see error messages indicating that the "/" drive is not writable or not available at all. In all cases the CPUs seem to go into 100% mode as the laptop will heat up quickly with all fans on and the only way out is a hard reset.

    I may get these crashes multiple times a day, or not at all for a day or two. I keep on checking the syslog after every crash but there is nothing which looks suspicious and it's always something different which shows as the last entry before the crash. With the crash just now OneDrive shows up as last entry before the regular boot entries and no error messages anywhere close.

    Code:
    Sep 23 09:55:36 hermes org.kde.kpasswdserver[2422]: message repeated 2 times: [ org.kde.kio.kpasswdserver: User = "" , WindowId = 0]
    Sep 23 09:55:42 hermes onedrive[1108]: Syncing changes from OneDrive ...
    Sep 23 10:01:51 hermes kernel: [    0.000000] microcode: microcode updated early to revision 0xc6, date = 2018-04-17
    Sep 23 10:01:51 hermes kernel: [    0.000000] Linux version 4.15.0-34-generic (buildd@lgw01-amd64-047) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #37-Ubuntu SMP Mon A$
    I had the bios scan the hardware, checked the SMART values of my SSD but nothing obvious.
    For a while I had the suspicion that Virtualbox on btrfs is the reason so I went and changed everything recommended here and elsewhere, moved to static image, disabled COW and after none of that helped, installed a second drive with ext4 for the VMs.
    Well, I had a another crash just now which I guess rules out Virtualbox on btrfs.

    I am now running out of ideas as to what to change next and what to do to identify the root cause, so if anyone has got any thoughts/ideas as to what to test I would really appreciate if you could share.

    Given that it's always a full OS crash I suspect that whatever's causing the issue is running kernel level, so on my list of things to look into next, are changing graphics driver, remove drive encryption.

    Laptop specs:
    Kubuntu 18.04, current
    Dell Precision 7510 with 512 GB SSD (crypt + btrfs) + 2TB HDD (crypt + ext4)
    Last edited by Thomas00; Sep 23, 2018, 02:47 AM. Reason: Typo

    #2
    A laptop. When was the last time you had it cleaned? Dust builds up inside, and can seriously interfere with cooling. Also, the CPU in a laptop has a heatsink, and the paste that secures it to the CPU can degrade and the heatsink loosen. An overheating CPU can bring a PC to a halt and/or cause any number of issues with the OS and/or running applications.
    Windows no longer obstructs my view.
    Using Kubuntu Linux since March 23, 2007.
    "It is a capital mistake to theorize before one has data." - Sherlock Holmes

    Comment


      #3
      what dose

      Code:
      df -h
      show

      VINNY

      EDIT" + what @Snowhog asked/said
      i7 4core HT 8MB L3 2.9GHz
      16GB RAM
      Nvidia GTX 860M 4GB RAM 1152 cuda cores

      Comment


        #4
        Ya, that's what I was thinking -- ssd filling up. Too many old extants. When was the last time a trim was done?
        "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
        – John F. Kennedy, February 26, 1962.

        Comment


          #5
          Thanks guys for jumping in, I really appreciate the ideas and the fact that you act as a sounding board!
          A full disk was my first thought as well but this looks good to me?

          Code:
          [FONT=system]Filesystem                  Type      Size  Used Avail Use% Mounted on
          udev                        devtmpfs  7.8G     0  7.8G   0% /dev
          tmpfs                       tmpfs     1.6G  1.7M  1.6G   1% /run
          [B]/dev/mapper/nvme0n1p3_crypt btrfs     452G  309G  142G  69% /[/B]
          tmpfs                       tmpfs     7.8G   60M  7.8G   1% /dev/shm
          tmpfs                       tmpfs     5.0M  4.0K  5.0M   1% /run/lock
          tmpfs                       tmpfs     7.8G     0  7.8G   0% /sys/fs/cgroup
          /dev/loop0                  squashfs  142M  142M     0 100% /snap/slack/8
          /dev/loop1                  squashfs   88M   88M     0 100% /snap/core/5328
          /dev/loop2                  squashfs   87M   87M     0 100% /snap/core/4486
          /dev/mapper/nvme0n1p3_crypt btrfs     452G  309G  142G  69% /mnt
          /dev/nvme0n1p2              ext4      705M  150M  505M  23% /boot
          /dev/nvme0n1p1              vfat      511M  6.1M  505M   2% /boot/efi
          [B]/dev/mapper/nvme0n1p3_crypt btrfs     452G  309G  142G  69% /home[/B]
          tmpfs                       tmpfs     1.6G   12K  1.6G   1% /run/user/1000
          /dev/sda1                   fuseblk    55G   25G   30G  45% /media/thomas/Windows10[/FONT]
          Now trim, I indeed never ran this one, thought the OS is going to take care of this?
          Ran the below just now from the command line:
          Code:
          thomas@hermes:~$ sudo fstrim / -v  
          /: 105.8 GiB (113556119552 bytes) trimmed[
          thomas@hermes:~$ sudo fstrim /home -v
          /home: 100.8 GiB (108206579712 bytes) trimmed
          Well, that could indeed mean I did run out of space without noticing.
          I just went through this https://turriebuntu.wordpress.com/ubuntu-pages/precise-specific-pages/using-fstrim-to-trim-your-ssd-instead-of-delete-in-fstab to make sure trim is taken care of and from now on will run the script once a week.

          Cleaning the laptop is also a good idea although I'd say it's less likely to be my issue as the heating up happens always after things went bad.

          Again, really good input! Now that the trim ran, maybe things are sorted already? I'll keep you posted!

          -----------------------------
          Edit: Well, three days later and not a single crash! I think I can safely say the trim fixed it! Again, thanks a lot guys to help me straightening things out!
          Last edited by Thomas00; Sep 26, 2018, 12:56 PM. Reason: Final update

          Comment


            #6
            Well, been rejoicing too quickly. Not a single crash for more than a week but 4 crashes today.
            First one while downloading a Ubuntu server ISO, second one when plugging in a USB stick, and the others more or less out of the blue as usual.

            My last fstrim ran this afternoon:

            *** Wed, 03 Oct 2018 13:29:16 +0200 ***
            /: 193.2 GiB (207453040640 bytes) trimmed
            /home: 181.8 GiB (195147513856 bytes) trimmed

            Got a couple of screenshots, well photos, in case that's of help: https://1drv.ms/f/s!Avv04SyB_Fr_jelaDjdX5sn9ZQz0SQ

            Comment


              #7
              Saw your photos.

              You were shutting down and the target had almost been reached when encryption failed ag 117 seconds. That locked the system into read only and none of the subsequent attempts to save journals succeeded.

              Time to roll back to your most recent @ and @home snapshot. Then take a look at your encryption protocols.
              Last edited by GreyGeek; Oct 03, 2018, 02:22 PM.
              "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
              – John F. Kennedy, February 26, 1962.

              Comment


                #8
                The encryption one is just one example, actually I wasnt shutting down but all of a sudden my apps crashed one after another which made me switch to tty2. And when I saw the errors floating by I knew there’s no way to rescue the situation so hit the power button which led to the shutdown and encryption message.

                The other crashes today were e.g a sudden freeze of the ui, and switching to tty2 just gave me a blinking cursor.
                It’s never like a power outage though, power is always there and with all crashes the cpus seem to go into 100% mode as the laptop will heat up and the fans will start blowing.

                Something seems to happen which makes the file system switch to ro, or disapear completely with the effect that there are no logs showing anything. I checked dmesg in previous crashes and found only standard stuff...and then the boot sequence.

                I wonder if I am dealing with an issue related to running btrfs and encryption together. Maybe I have to simplify things step by step...

                Comment


                  #9
                  So, you don't have snapshot backups?
                  "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
                  – John F. Kennedy, February 26, 1962.

                  Comment


                    #10
                    I got snapshots and could rollback, but simply rolling back won’t fix my issue as I might do what I did again?
                    Nothing big happened, like an installation or a setting gone wrong, just email, browser, and a couple of new files here and there.

                    The only disk related change with significance I can think of is that I created a folder for my production vm, switched off CoW and copied the 80 GB file onto my SSD. I didn’t even open it since.
                    I might give it today and see if the crashes continue and if so move the folder back onto the second drive.

                    Comment


                      #11
                      Originally posted by Thomas00 View Post
                      I got snapshots and could rollback, but simply rolling back won’t fix my issue as I might do what I did again?
                      Is there an alien forcing you to repeat previous mistakes? Rolling back may not fix your issues if the snapshots you have were not made before you made the mistakes you say you can't avoid. In that case, a fresh install wouldn't help either.

                      But, assuming at least one set of your previous snapshots is pristine then how can you lose? You'd still have the pristine snapshots to replace your next bungle up until you finally get your mind wrapped around what btrfs is and how it works. Rome wasn't built over night, but btrfs is a lot easier to learn than rsync.

                      Originally posted by Thomas00 View Post
                      Nothing big happened, like an installation or a setting gone wrong, just email, browser, and a couple of new files here and there.

                      The only disk related change with significance I can think of is that I created a folder for my production vm, switched off CoW and copied the 80 GB file onto my SSD. I didn’t even open it since.
                      I might give it today and see if the crashes continue and if so move the folder back onto the second drive.
                      IF you are going to switch of CoW then why use btrfs? Also, shutting off CoW as a means of avoiding "btrfs send ... | btrfs receive" won't work because the copy command can't copy the subvolume correctly. It doesn't handle extents. The send & receive commands include a "-f" switch to send subvolumes as ASCII files for the purposes you had intended. Using "btrfs -f somesub.txt" will recreate the subvolume, if the target is a btrfs <ROOT_FS>, otherwise it will fail.
                      "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
                      – John F. Kennedy, February 26, 1962.

                      Comment


                        #12
                        Originally posted by GreyGeek View Post
                        Is there an alien forcing you to repeat previous mistakes? Rolling back may not fix your issues if the snapshots you have were not made before you made the mistakes you say you can't avoid. In that case, a fresh install wouldn't help either.

                        But, assuming at least one set of your previous snapshots is pristine then how can you lose? You'd still have the pristine snapshots to replace your next bungle up until you finally get your mind wrapped around what btrfs is and how it works. Rome wasn't built over night, but btrfs is a lot easier to learn than rsync.

                        IF you are going to switch of CoW then why use btrfs? Also, shutting off CoW as a means of avoiding "btrfs send ... | btrfs receive" won't work because the copy command can't copy the subvolume correctly. It doesn't handle extents. The send & receive commands include a "-f" switch to send subvolumes as ASCII files for the purposes you had intended. Using "btrfs -f somesub.txt" will recreate the subvolume, if the target is a btrfs <ROOT_FS>, otherwise it will fail.
                        No alien at work here, my point was that I don't know if and what mistake I made. And if I don't know what I did wrong how can I avoid doing it again?
                        At the moment I suspect that my problem is more likely related to "hardware/driver" or "encryption + btrfs" or something of that nature, and you are right even a fresh install won't help as I will end up with the same faulty setup over and over again.

                        I am still trying to narrow down what may cause these problems, e.g. no crash at all yesterday, only difference is that the machine was in the docking station for most of the day. Can't find a pattern, really strange...

                        I switched off CoW for the VirtualBox folder only, as apparently this is required to avoid VM corruption in btrfs environments, I also switched my VM from dynamic to fixed size which is the second btrfs specific recommendation I read about.
                        But you raise a very important point here, I didn't see the connection between CopyOnWrite and btrfs send. I currently use "btrfs send -f" as my backup target is not (yet) on btrfs, does that mean the folder which has got CoW switched off will not be part of the backup.txt file?

                        By the way, I sense some frustration from your post here and the other thread both of us are active. Please don't get me wrong, you have already helped me a great deal and I am not ignoring any of your comments, at least not knowingly :-). The reason why I am hesitating to roll back/reinstall is in both cases because I'd like to understand what went wrong first. To learn from it and to avoid stepping onto the same trap again.

                        Comment


                          #13
                          https://btrfs.wiki.kernel.org/index...._encryption.3F
                          [quote]
                          Btrfs itself does not support native file encryption (yet), and there's nobody actively working on it ...

                          As an alternative, it is possible to use a stacked filesystem (eg. ecryptfs) with btrfs. In this mode, the stacked encryption layer is mounted over a portion of a btrfs volume and transparently applies the security before the data is sent to btrfs. Another similar option is to use the fuse-based filesystem encfs as a encrypting layer on top of btrfs.
                          Note that a stacked encryption layer (especially using fuse) may be slow, and because the encryption happens before btrfs sees the data, btrfs compression won't save space (encrypted data is too scrambled). From the point of view of btrfs, the user is just writing files full of noise.
                          Also keep in mind that if you use partition level encryption and btrfs RAID on top of multiple encrypted partitions, the partition encryption will have to individually encrypt each copy. This may result in somewhat reduced performance compared to a traditional RAID setup where the encryption might be done on top of RAID. Whether the encryption has a significant impact depends on the workload, and note that many newer CPUs have hardware encryption support.
                          [quote]
                          "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
                          – John F. Kennedy, February 26, 1962.

                          Comment


                            #14
                            Yes, I saw that, and in fact I went for the first of the three options mentioned in your link:
                            • It can operate on top of an encrypted partition (dm-crypt / LUKS) scheme.
                            • It can be used as a component of a stacked approach (eg. ecryptfs) where a layer above the filesystem transparently provides the encryption.
                            • It can natively attempt to encrypt file data and associated information such as the file name.
                            Loosely following the advice here: https://albertodonato.net/blog/posts/full-disk-encryption-with-btrfs-on-ubuntu-xenial.html
                            He's a Canonical engineer, so I thought he will know what he's doing :-)

                            Comment


                              #15
                              Thought I'd give you an update on my quest to a stable system. Did not make much progress I am afraid, and slowly but steadily frustration is settling in.

                              Over the last couple of weeks I have tried to peel the onion and simplify layer by layer to see if my crashes go away or if at least I can get to a reproducible behaviour. Unfortunately without success. If anything the crashes occur more frequently now.
                              My laptop will run for half a day during which I purposely run backup after backup, zip large files, start and close VMs, basically put as much stress on the system as I can while still doing some work in parallel. Nothing will crash the system but when I e.g. open up the file menu in Kate the system will crash with the menu half way painted.
                              It will crash after running without an issue for 2 days and it will crash maybe 1 minute after login in after a reboot. Totally unpredictable...

                              I saw that someone in a different thread had issues with a NVM2 drive and btrfs, I meanwhile removed btrfs.
                              Elsewhere I read that switching the bios to AHCI may help, mine was on AHCI.
                              Another hint was checking power savings flags in the bios which I couldn't find in my bios, and I had plenty off crashes while I was active.

                              As far as potentially critical, kernel level services are concerned:
                              - Virtualbox, I am using VB since day one on Linux but I can't see a correlation between VB launched and the crashes
                              - nVidia drivers, not installed any longer
                              That's it, the rest is all higher level stuff

                              With all the installs and changes I made I think I can conclude
                              - It's not btrfs, crashes happen on ext4 as well
                              - It's not encryption, I am running without encryption and had crashes
                              - It's not nVidia, had crashes with native Kubuntu drivers as well
                              - It's not related to system temperature, had crashes right after boot and can happily work when most CPUs are at 100% and with active file transfers in and out
                              - It's not related to power management, had crashes with power plugged in and on battery
                              - It's not related to the apps I use, had crashes before managing to logon, in the middle of almost every application I use day in day out, as well as walking up to a crashed system after leaving it alone for an hour.

                              So what's left?
                              - NVM SSD, driver or hardware? Spend hours on looking for updated firmware but I now know because it's a Samsung OEM drive (SM961 NVMe SAMSUNG 512GB) the update needs to come from Dell and there is none.
                              - Kernel / Driver incompatibility with my Precision 7510 hardware? Not impossible, but things were stable for two years until 2+ months ago

                              Some times I can still get to my Konsole when Plasma freezes, at least for a while. Can't run sudo though as it will give me an IO error, but when I run mount I can see that / and /home are set to ro and other mounts like @ in /mnt disappeared. My machine has a second drive mounted to /media/data, this mount will also disappear. Other mounts like loop? and

                              I have now started to always start another session in tty2 and up-level myself with sudo -i right away so I can do some checks whenever the system freezes again.
                              Is there anything I could check? The log files on disk are no good as with the disks switching to ro no more logs are recorded. Maybe there is something kept in memory which I could check? Something which would indicate why the filesystems went ro?
                              Is there a debug mode for the kernel or systemd which I could turn on?

                              Things I was hoping to avoid but am considering next
                              - Resizing HDD partition so I can put another Kubuntu instance on it and work all HDD / disabling NVM drive
                              - Disabling NVM drive and reinstall the OS on a USB attached SSD drive
                              - Downgrade to Kubuntu 17.10 and take things up with Dell
                              - Install Windows

                              Sorry for venting my frustration here, but sometimes writing things down helps clarify things for oneself...

                              Code:
                              08.11. After early morning crash, no issues for 12 hours and counting...
                              08.11. Undocked, no power, fresh start after an all-night system check, machine ran for 5 minutes that day, started to work inside VM, nothing special. After reboot the update indicator was shown
                              07.11. Ran BIOS system check again all night, no issues found
                              07.11. Undocked, crash during backup to homeserver, nothing special. After reboot the discovery update indicator was shown
                              07.11. Undocked, crash while machine was unattended. 2 more crashes after that 
                              06.11. New install, ext4 instead of btrfs
                              06.11.
                              
                              Switched off Legacy boot in BIOS
                              
                              05.11. Fresh install without encrypted root partition / after series of crashes, no more nVidia drivers from here on
                              05.11. Undocked, crash after fresh install was completed. Crash right after @ was copied to backup drive 
                              
                              04.11. Undocked, nothing special, crash when Slack message popped up, followed by 3 crashes in a row. All btrfs / crypts mount points gone 
                              03.11. added swap partition
                              29.10. Undocked, crash a few minutes after coming out of hibernation, when changing audio volume
                              29.10. BIOS upgrade
                              29.10. Second crash right after reboot, while updating this document -> Updated firmware afterwards
                              
                              24.10. Undocked, one more crash, during backup to homeserver
                              24.10. Undocked, crashed while left alone. Apps missing/dying as it came back out of powersave mode
                              08.10. First crash in a while, ran all day, crashed after suspend/undock. Also saw an error message from SquashFS? 
                              05.10. no issues all day, crash right after restarting after undocking
                              04.10. no crash, docked all day
                              03.10  4 crashes, undocked all day. One crash even before able to type in credentials, one when putting in a USB drive
                              ReadOnly mount and strange ls -al after a crash:
                              Click image for larger version

Name:	ro.jpg
Views:	1
Size:	123.4 KB
ID:	644044Click image for larger version

Name:	ls.png
Views:	1
Size:	947.8 KB
ID:	644045

                              Comment

                              Working...
                              X