VM s PCIe passthrough je okamžitě paused

VM s PCIe passthrough je okamžitě paused
« kdy: 27. 02. 2025, 21:34:22 »
Dobry vecer,


mam tu takovou kuriozitu stran PCIe passthrough (kvm, libvirt).


Pri pokusu o spusteni VM, do niz je skrze PCI passthrough vecpan Broadcom / LSI MegaRAID SAS 2208, se VM nespusti, stav je okamzite paused a dmesg vyplivne chyby z pcieport.

Kód: [Vybrat]
[ 4939.169052] pcieport 0000:00:1b.4: Intel SPT PCH root port ACS workaround enabled
[ 4939.169294] pcieport 0000:00:1b.4: AER: Multiple Correctable error message received from 0000:03:00.0
[ 4939.169307] pcieport 0000:00:1b.4: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
[ 4939.169308] pcieport 0000:00:1b.4:   device [8086:a2eb] error status/mask=00001000/00002000
[ 4939.169310] pcieport 0000:00:1b.4:    [12] Timeout
[ 4939.169324] vfio-pci 0000:03:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
[ 4939.169325] vfio-pci 0000:03:00.0:   device [1000:005b] error status/mask=00000001/00002000
[ 4939.169327] vfio-pci 0000:03:00.0:    [ 0] RxErr                  (First)
[ 4939.169328] vfio-pci 0000:03:00.0: AER:   Error of this Agent is reported first
[ 4939.318287] pcieport 0000:00:1b.4: Intel SPT PCH root port ACS workaround enabled
[ 4939.330493] vfio-pci 0000:03:00.0: enabling device (0000 -> 0003)
[ 4941.548543] pcieport 0000:00:1b.4: AER: Uncorrectable (Non-Fatal) error message received from 0000:00:1b.4
[ 4941.548551] pcieport 0000:00:1b.4: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Completer ID)
[ 4941.548554] pcieport 0000:00:1b.4:   device [8086:a2eb] error status/mask=00008000/00010000
[ 4941.548556] pcieport 0000:00:1b.4:    [15] CmpltAbrt              (First)
[ 4941.548558] pcieport 0000:00:1b.4: AER:   TLP Header: 00000000 00000000 00000000 00000000
[ 4941.548610] pcieport 0000:00:1b.4: AER: device recovery successful

1000:005b je ID radice.
8086:a2eb je ID PCH na desce.


Pozoruhodne je, ze pokud pockam dostatecne dlouho (treba 5-10 minut) a zkusim AlmaLinux VM spustit znovu, tak uz to na 99% projde.



Zelezo:

MSI Z270 PC MATE / i7-8700 / 64 GB
BIOS desky je modnutej pro podporu CoffeeLake CPU (jeden z nutnych patchu se tyka PCIe - ovsem toliko tech z CPU, ne tech z chipsetu)
CSM je vypnute, bezi UEFI-only

Jako OS na zeleze bezi Kubuntu; jadro 6.8.0-54-generic

ACS patch pro IOMMU groups je aktivni.



Bezi mi tam dve VM:

Windows (pres PCI passthrough GTX1070 GPU i Audio, dualport sitovka, SATA radic; skrze MDEV pak Intel iGPU)
AlmaLinux 8 (pres PCI passthrough radic Dell PERC H710)


V minulosti obe VM fungovaly v pohode na prvni dobrou.


Spusteni AlmaLinux VM bez PCI passthrough radice je vzdy OK (to je na zaklade logu vyse neprilis prekvapive).
VM AlmaLinux je arch='x86_64' machine='pc-q35-8.2' a v XML nejsou zadny obskurnosti, takze to nepovazuji za nutne sem nahravat (?).


lspci -tv
Kód: [Vybrat]
-[0000:00]-+-00.0  Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers
           +-01.0-[01]--+-00.0  NVIDIA Corporation GP104 [GeForce GTX 1070]                  >>> PCI passthrough Windows
           |            \-00.1  NVIDIA Corporation GP104 High Definition Audio Controller  >>> PCI passthrough Windows
           +-02.0  Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630]                    >>> MDEV passthrough Windows
           +-08.0  Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
           +-14.0  Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller
           +-14.2  Intel Corporation 200 Series PCH Thermal Subsystem
           +-16.0  Intel Corporation 200 Series PCH CSME HECI #1
           +-17.0  Intel Corporation 200 Series PCH SATA controller [AHCI mode]                 >>> PCI passthrough Windows
           +-1b.0-[02]--
           +-1b.4-[03]----00.0  Broadcom / LSI MegaRAID SAS 2208 [Thunderbolt]              >>> PCI passthrough AlmaLinux (rozbity)
           +-1c.0-[04]--
           +-1c.4-[05]----00.0  ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller
           +-1c.6-[06-07]----00.0-[07]--
           +-1c.7-[08]--+-00.0  Intel Corporation 82576 Gigabit Network Connection     >>> PCI passthrough Windows
           |            \-00.1  Intel Corporation 82576 Gigabit Network Connection      >>> PCI passthrough Windows
           +-1d.0-[09]----00.0  Toshiba Corporation XG4 NVMe SSD Controller
           +-1f.0  Intel Corporation 200 Series PCH LPC Controller (Z270)
           +-1f.2  Intel Corporation 200 Series/Z370 Chipset Family Power Management Controller
           +-1f.3  Intel Corporation 200 Series PCH HD Audio
           +-1f.4  Intel Corporation 200 Series/Z370 Chipset Family SMBus Controller
           \-1f.6  Intel Corporation Ethernet Connection (2) I219-V

/proc/cmdline
Kód: [Vybrat]
BOOT_IMAGE=/boot/vmlinuz-6.8.0-54-generic root=UUID=d8cf64fc-763a-4d17-8203-04b2e6d04c07 ro intel_iommu=on iommu=pt vfio-pci.ids=10de:1b81,10de:10f0,1000:005b earlymodules=vfio-pci i915.enable_guc=0 quiet splash vt.handoff=71000:005b je ID toho radice, 10de... jsou od nVidie.

/etc/modprobe.d/vfio.conf
Kód: [Vybrat]
options vfio-pci ids=10de:1b81,10de:10f0,1000:005b disable_vga=1
/etc/modprobe.d/blacklist.conf
Kód: [Vybrat]
...
blacklist megaraid_sas

cat /etc/modules-load.d/kvm-gvt-d.conf
Kód: [Vybrat]
kvmgt
vfio-iommu-type1
vfio-mdev



V nedavne dobe jsem se akorat snazil rozbehnout Intel iGP passthrough, coz se mi nakonec podarilo, tak nevim, jestli jsem si neco nerozvrtal v souvislosti s tim...


Samo, kdyz vysoupnu megaraid_sas z blacklistu, tak se k poli normalne dostanu a vse funguje v pohode. Tu chybu o pcie portu to hazi jen zcela konkretne pri pokusu to soupnout do passthrough.



Delam neco na prvni pohled blbe?


Diky.


RDa

  • *****
  • 2 899
    • Zobrazit profil
    • E-mail
Re:VM s PCIe passthrough je okamžitě paused
« Odpověď #1 kdy: 27. 02. 2025, 22:28:16 »
Kdyz jsem do VM daval svoji (FPGA based) PCIe kartu, tak jsem mel problem s D3-hot sleepem, pres ktery to VFIO otaci mezi rebooty virtualek.

Mi prijde ze ten SAS radic taky z toho power managementu neni nadsanej, na zaklade toho ze chyba nastava zde:

[ 4939.330493] vfio-pci 0000:03:00.0: enabling device (0000 -> 0003)
[ 4941.548543] pcieport 0000:00:1b.4: AER: Uncorrectable (Non-Fatal) error message received from 0000:00:1b.4

Takze bych zkusil par veci - vypnul ASPM v biosu, a pak upravil VFIO, aby nedaval vypnuty device do takoveho stavu:

edit the file in /etc/modprobe.d where you have configured vfio-pci and append
disable_idle_d3=1

Re:VM s PCIe passthrough je okamžitě paused
« Odpověď #2 kdy: 03. 03. 2025, 22:26:58 »
Diky za napad!
Bohuzel, moznosti ohledne PCIe ASPM v BIOSu nejsou vubec zadne.

Zmenu ve vfio configu jsem udelal, nicmene zmenu to zadnou neprineslo.

Nastesti to teda neni zadnej velkej problem, kdyz opetovnej pokus o start projde v poradku. Jen mi to  vrtalo hlavou... :-)