May/likely have already been stated, but, issues like you are describing, especially in a laptop, the things to suspect are: dirty/dusty air intake/exhaust ports; faulty RAM; overheating CPU/GPU (heatsinks separating/loosing contact).
Announcement
Collapse
No announcement yet.
Frequent OS crashes - looking for troubleshooting ideas
Collapse
This topic is closed.
X
X
-
Windows no longer obstructs my view.
Using Kubuntu Linux since March 23, 2007.
"It is a capital mistake to theorize before one has data." - Sherlock Holmes
- Top
- Bottom
-
Originally posted by Snowhog View PostMay/likely have already been stated, but, issues like you are describing, especially in a laptop, the things to suspect are: dirty/dusty air intake/exhaust ports; faulty RAM; overheating CPU/GPU (heatsinks separating/loosing contact).
CPU is at 80° right now
Since writing the message above I had 3 more crashes, after the system ran all day
08.11. Next crash 2 minutes after reboot. Not able to switch to tty2. Drive must have disappeared again right after the boot launcher as it was complaining about a missing cryptodrive, but the code showing the message must have booted from the same drive
08.11. Next crash 2 minutes after reboot. Was thrown over to tty2 which was frozen already, 80GB backup was running again.
08.11. Evening, system ran all day after the initial crash this morning. Crash happened during rsync of 80GB VM to second disk, not able to switch to tty2 any longer
...
- Top
- Bottom
Comment
-
Originally posted by SpecialEd View PostHow about a swap issue, like none or not enough?
How OLD is this laptop?Windows no longer obstructs my view.
Using Kubuntu Linux since March 23, 2007.
"It is a capital mistake to theorize before one has data." - Sherlock Holmes
- Top
- Bottom
Comment
-
How many errors does your SDD report?
Code:[FONT=courier new][B]$ sudo smartctl -a /dev/sda[/B][/FONT] [sudo] password for jerry: smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.18.0-11-generic] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: *****************GB Serial Number: ************** LU WWN Device Id: 5 002538 e4032b7df Firmware Version: RVT01B6Q User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: Unknown(0x09fc), ACS-4 T13/BSR INCITS 529 revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Nov 8 12:19:17 2018 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 85) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1161 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 155 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 4 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 074 047 000 Old_age Always - 26 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 6 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 4625517805 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
https://www.virten.net/2016/12/ssd-t...ten-calculatorLast edited by GreyGeek; Nov 08, 2018, 12:26 PM."A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
– John F. Kennedy, February 26, 1962.
- Top
- Bottom
Comment
-
Guys, I really appreciate your help, feels good to see I am not alone :-)
Opened the bugger, and yes there was quite some dust in there. Temperature looks like it's lower now 35°, can't remember seeing anything below 40° and it feels more quiet now, I'll have a close eye on this.
It's these slow creepy changes which are easy to miss!
Just pulled out the invoice, purchased the machine in March last year.
The smartclt doesn't seem to work for NVMe drives:
thomas@hermes:~$ sudo smartctl -a /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.18.0-10-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SM961 NVMe SAMSUNG 512GB
Serial Number: S34YNX0HC01903
Firmware Version: CXA74D0Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 512.110.190.592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512.110.190.592 [512 GB]
Namespace 1 Utilization: 304.360.701.952 [304 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Mon Nov 5 19:09:13 2018 CET
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Warning Comp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 73 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.80W - - 0 0 0 0 0 0
1 + 4.90W - - 1 1 1 1 0 0
2 + 3.20W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1500
4 - 0.0050W - - 4 4 4 4 2200 6000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x2002
thomas@hermes:~$
thomas@hermes:~$ sudo nvme -list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S34YNX0HC01903 SM961 NVMe SAMSUNG 512GB 1 344,74 GB / 512,11 GB 512 B + 0 B CXA74D0Q
thomas@hermes:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 32 C
available_spare : 100%
available_spare_threshold : 50%
percentage_used : 1%
data_units_read : 180.612.670
data_units_written : 39.295.503
host_read_commands : 922.103.392
host_write_commands : 427.185.726
controller_busy_time : 2.012
power_cycles : 3.376
power_on_hours : 1.451
unsafe_shutdowns : 250
media_errors : 0
num_err_log_entries : 406
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 32 C
Temperature Sensor 2 : 36 C
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
thomas@hermes:~$
- Top
- Bottom
Comment
-
Dust in Desktop units isn't as much of a problem (not that it isn't one, just less of one) as there is so much air space. In a laptop, air space is extremely limited (has to be), so obstructions (dust/dirt/lint/etc) blocking intake/exhaust ports can have significant impacts on performance and operation. IF either/both are user accessible, checking the heat sinks is another thing to inspect. The thermal paste used to secure them to the CPU/GPU sometimes comes loose after so many on/off conditions (expansion/contraction cycles due to heating up and cooling down). Often, either inferior thermal paste is used, or enough of it isn't used to make contact fully with the heat sink and the surface of the CPU/GPU, leaving spots that heat up more than the rest. Bottom line: Laptops need regular, proper maintenance to get the longest use out of them.Windows no longer obstructs my view.
Using Kubuntu Linux since March 23, 2007.
"It is a capital mistake to theorize before one has data." - Sherlock Holmes
- Top
- Bottom
Comment
-
Thanks Snowhog, I knew about this...in theory...I started my laptop journey with this massive Compaq toaster https://en.wikipedia.org/wiki/Compaq_Portable_386 and had laptops ever since, but for some reason I never ran into anything like this. Maybe I exchanged the machines more frequently when traveling globally and now that I work from home I may collect more dust. In any case the backup is running with a 80GB zip file being created in parallel and the CPU is showing ~60° that's a good 15° less than before...
I really hope that was it...
- Top
- Bottom
Comment
-
I would look into these:
unsafe_shutdowns : 250
media_errors : 0
num_err_log_entries : 406"A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
– John F. Kennedy, February 26, 1962.
- Top
- Bottom
Comment
-
No idea :-)
I would expect the NVM driver to manage this but in my case it may not have been possible, so far no crash since I blew out the inwards of my machine...maybe, hopefully this was it.
I will in any case check the number a few times to see if the number goes up.
num_err_log_entries went up to 410, the rest remained the same.
- Top
- Bottom
Comment
-
It's multiple entries of those two which I get with sudo nvme error-log /dev/nvme0:
Entry[60]
...
error_count : 350
sqid : 0
cmdid : 0x3b
status_field : 0x4212(INVALID_LOG_PAGE: The log page indicated is invalid. This error condition is also returned if a reserved log page is requested)
parm_err_loc : 0x28
lba : 0
nsid : 0xffffffff
vs : 0
cs : 0
...
Entry[61]
...
error_count : 349
sqid : 0
cmdid : 0x35
status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0x2c
lba : 0
nsid : 0
vs : 0
cs : 0
- Top
- Bottom
Comment
-
It looks to me that the firmware on your NVME is not matching the structure of the NVME. IOW, the NVME has a buggy interface. Time to take this to Dell or the SSD manufacturer."A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
– John F. Kennedy, February 26, 1962.
- Top
- Bottom
Comment
-
Originally posted by GreyGeek View PostIt looks to me that the firmware on your NVME is not matching the structure of the NVME. IOW, the NVME has a buggy interface. Time to take this to Dell or the SSD manufacturer.
To rule out the NVMe drive I installed Kubuntu on a USB attached HDD, thanks to btrfs it's been very easy to create an exact copy of my NVMe. Also a good test for my backup routine.
Unfortunately the USB based system crashed as well. Same story, root becomes ro and the OS falls over.
Now, I can't completely switch of the NVMe drive as the system boots from the UEFI partition on NVMe before it points to the external HDD, but after the system has booted there should be no access to the NVMe drive. The NVMe driver is in memory though I guess?
I think that means the NVMe drive is off the radar now.
I also installed Win10 on the second internal drive and tried to put some stress on it, didn't manage to make it fall over so far.
Means I am probably back to a software issue rather than hardware? Maybe I need to abandon my backup and consider a minimum install next, adding must have services one after another.
I also opened a call with Dell now, will see what they come up with.
By the way, the system is significantly cooler now that I blew the dust out, so that was definitely an exercise worth doing.
Edit: Almost 6 months after I started this thread I finally have got the solution.
It turned out to be a hardware problem with the Samsung NVMe drive as described here:
- https://askubuntu.com/questions/905710/ext4-fs-error-after-ubuntu-17-04-upgrade#comment1422199_905710 1
- https://bugs.launchpad.net/ubuntu/+s...x/+bug/1678184
None of the solutions described did work but at least it was clear what the problem was, today I received a new (Toshiba) drive from Dell.
Boy, what a journey, but I learned a lot along the way.
- Top
- Bottom
Comment
Comment