Power Management: An nVidia Story
This is a short chronicle of the "suffering" I experienced on my formerly Ubuntu, but now PopOS, System76 laptop. The problem I experienced was as far as I can tell identical to those experienced by other Linux users on the nVidia forums here and in the similar threads.
TL;DR
If you're having an issue with an nVidia card where after resuming from sleep you encounter a black screen and it's entirely unresponsive, then try adding mem_sleep_default=deep to your kernel parameters. On PopOS this should be simply sudo kernelstub -a mem_sleep_default=deep, but it will defer for other distributions.
Background
To summarize, whenever I closed the lid on my laptop or manually suspended it (whatever the mechanism), upon reopening the laptop or resuming by pressing the power button usually I would be greeted with a backlit black screen and no evidence according to the hardware activity lights that any of the gnomes that keep the computer operating were home for business. Every once in awhile the system would resume, but it was incredibly hit or miss. When the system was not responsive, the only way to reset was to do a hardware power off by holding the power button.
Experience
Needless to say, this was very frustrating. To some extent I stopped using my laptop because the very first thing I'd be doing when grabbing it is turning it off and on again. And while this was not very time consuming it was always aggravating. I would periodically resolve to try and fix this issue, scouring the above forums and askubuntu posts for insights into the issue. No matter what changes I made I seemed to be no closer to a solution. Unfortunately too, I have nowhere near the technical chops of this lad to solve this, nor a system configuration that would allow it – this might be a perfect case for something like NixOS though.
A few months ago I went so far as to decide "fuck it" and backed up everything so I could switch from Ubuntu 22.04 to PopOS 22.04 figuring that System76's distribution would have a better chance of working without hiccups on a System76 laptop. Alas, I had no better luck with this than before, but it at least led me to the understanding that this was not likely a problem of packages and distributions. Rather at some point it clicked for me that there must be some elusive comingling of the nVidia driver, the kernel, and the configuration of them both. After actual years of dealing with this problem, the pieces started to mentally fall into place and not too long after I found some insight in what I recall to be a launchpad.net bug report. Namely, there was a reference to documentation stored in /usr/share/doc/nvidia-driver-<version>/html/, which led me to the power management documentation (/usr/share/doc/nvidia-driver-515/html/powermanagement.html on my system to be specific). Way down at the bottom they speak of a known issue where the system may not resume properly, and quite fortunately, a workaround.
What exactly does it say?
On some systems, where the default suspend mode is"s2idle", the system may not resume properly due to a known timing issue in the kernel. The suspend mode can be verified by reading the contents of the file/sys/power/mem_sleep. The following upstream kernel changes have been proposed to fix the issue:
https://lore.kernel.org/linux-pci/20190927090202.1468-1-drake@endlessm.com/
https://lore.kernel.org/linux-pci/20190821124519.71594-1-mika.westerberg@linux.intel.com/
In the interim, the default suspend mode on the affected systems should be set to"deep"using the kernel command line parameter"mem_sleep_default"-
mem_sleep_default=deep
Resolution
After doing this, my woes were finally over. Or at least I hope they are. While my system is still relatively clean after the migration to PopOS, it's not exactly a paragon of idempotency and there are still various changes I made hoping to fix this issue lying around that may have contributed to the overall resolution. Should anyone find this post and the above workaround work for them, please email me at rue@ruethedev.com and let me know. Thanks for reading thus far and happy hacking!
Update (August 11, 2022)
For whatever reason this doesn't appear to be a reliable fix. After updating packages the issue returned, but it seems as if the only requirement to re-fix it is that I run sudo update-initramfs -c -k all to rebuild the initramfs. I find that strange as I would expect that the package manager would update the initramfs more or less the same way I am. I will have to continue to monitor this issue and see if this is a permanent workaround or simply a red herring.