Should I struggle through constant crashes to get my 7900gre with 16gb of vram working, possibly through the headache of ONNX? Can anyone report their own success or offer advice? AMD on linux is generally lovely, SD with AMD on linux, not so much. It was much better with my RTX2080 on linux but gaming was horrible with NVIDIA drivers. I feel I could do more with the 16GB AMD card if stability wasn’t so bad. I currently have both cards running to the horror of my PSU. A1111 does NOT want to see the NVIDIA card, only the AMD. Something about the version of pytorch? More work to be done there.
- Having a much better time back on Cinnamon default instead of Wayland. Oops!
** It heard me. Crashed again on an x/y plot but due to being away from Wayland I was able to see the terminal dump: amdgpu thermal overload! shutdown initiated! That’ll do it! Finally something easy to fix. Wonder why thermal throttling isn’t kicking in to control runaway? Will stress it once more and clock the temps this time.
Temps were exceeding 115C, phew! No idea why the default amdgpu driver has no fan control but they’re ripping like they should now. Monitoring temps has restored system stability. Using multiple amd/nvidia dedicated venv folders and careful driver choice/installation were the keys to multigpu success.
I have run Stable Diffusion models successfully with my ancient Vega 64 with 8 gb vram. However, it does occasionally run out of memory and crash when I run models that want all 8 gigs. I have to run it without a proper DE(openbox, falkon browser with one tab only) if I dont want it to crash frequently.
How bad are your crashes? Mine will either freeze the system entirely or crash the current lightdm session, sometimes recovering, sometimes freezing anyway. Needs power cycle to rescue. What is the DE you speak of? openbox?
the reset situation may improve in the not too distant future: https://www.phoronix.com/news/AMDGPU-Per-Ring-Resets
That seems strange. Perhaps you should stress-test your GPU/system to see if it’s a hardware problem.
I had that concern as well with it being a new card. It performs fine in gaming as well as in every glmark benchmark so far. I have it chalked up to amd support being in experimenntal status on linux/SD. Any other stress tests you recommend while I’m in the return window!? lol
Docker seemed the easiest
I might take the docker route for the ease of troubleshooting if nothing else. So very sick of hard system freezes/crashes while kludging through the troubleshooting process. Any words of wisdom?
If you don’t understand how models use host caching, start there.
Outside of that, you’re asking people to simplify everything into a quick answer, and there is none.
ONNX is the “universal” standard, ensure you didn’t accidentally convert the input model into something else by accident, but more importantly, ensure when you run it and automatically convert, that the works are actually done on the GPU. ONNX defaults to CPU.
I started reading into the ONNX business here https://rocm.blogs.amd.com/artificial-intelligence/stable-diffusion-onnx-runtime/README.html Didn’t take long to see that was beyond me. Has anyone distilled an easy to use model converter/conversion process? One I saw required a HF token for the process, yeesh
No.
As I said, you’re trying to distill people’s profession into an easy to digest guide about “make it work”. Nothin like that exists.
Same way you can’t just get a job doing “doctor stuff”, or “build junk”.
Since only one of us is feeling helpful, here is a 6 minute video for the rest of us to enjoy https://www.youtube.com/watch?v=lRBsmnBE9ZA
Well I finally got the nvidia card working to some extent. On the recommended driver it only works in lowvram. medvram maxes vram too easily on this driver/cuda version for whatever reason. Does anyone know the current best nvidia driver for sd on linux? Perhaps 470, the other provided by the LM driver manager…?
SD works fine for me with: Driver Version: 525.147.05 CUDA Version: 12.0
I use this docker container: https://github.com/AbdBarho/stable-diffusion-webui-docker
You will also need to install the nvidia container toolkit if you use docker containers: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Thank you!! I may rely on this heavily. Too many different drivers to try willy-nilly. I am in the process of attempting with this guide/driver for now. Will report back with my luck or misfortunes https://hub.tcno.co/ai/stable-diffusion/automatic1111-fast/
version for whatever reason. Does anyone know the current best nvidia driver for