audk/OvmfPkg/PlatformPei/MemDetect.c

799 lines
24 KiB
C
Raw Normal View History

/**@file
Memory Detection for Virtual Machines.
Copyright (c) 2006 - 2016, Intel Corporation. All rights reserved.<BR>
SPDX-License-Identifier: BSD-2-Clause-Patent
Module Name:
MemDetect.c
**/
//
// The package level header files this module uses
//
OvmfPkg/PlatformPei: support >=1TB high RAM, and discontiguous high RAM In OVMF we currently get the upper (>=4GB) memory size with the GetSystemMemorySizeAbove4gb() function. The GetSystemMemorySizeAbove4gb() function is used in two places: (1) It is the starting point of the calculations in GetFirstNonAddress(). GetFirstNonAddress() in turn - determines the placement of the 64-bit PCI MMIO aperture, - provides input for the GCD memory space map's sizing (see AddressWidthInitialization(), and the CPU HOB in MiscInitialization()), - influences the permanent PEI RAM cap (the DXE core's page tables, built in permanent PEI RAM, grow as the RAM to map grows). (2) In QemuInitializeRam(), GetSystemMemorySizeAbove4gb() determines the single memory descriptor HOB that we produce for the upper memory. Respectively, there are two problems with GetSystemMemorySizeAbove4gb(): (1) It reads a 24-bit count of 64KB RAM chunks from the CMOS, and therefore cannot return a larger value than one terabyte. (2) It cannot express discontiguous high RAM. Starting with version 1.7.0, QEMU has provided the fw_cfg file called "etc/e820". Refer to the following QEMU commits: - 0624c7f916b4 ("e820: pass high memory too.", 2013-10-10), - 7d67110f2d9a ("pc: add etc/e820 fw_cfg file", 2013-10-18) - 7db16f2480db ("pc: register e820 entries for ram", 2013-10-10) Ever since these commits in v1.7.0 -- with the last QEMU release being v2.9.0, and v2.10.0 under development --, the only two RAM entries added to this E820 map correspond to the below-4GB RAM range, and the above-4GB RAM range. And, the above-4GB range exactly matches the CMOS registers in question; see the use of "pcms->above_4g_mem_size": pc_q35_init() | pc_init1() pc_memory_init() e820_add_entry(0x100000000ULL, pcms->above_4g_mem_size, E820_RAM); pc_cmos_init() val = pcms->above_4g_mem_size / 65536; rtc_set_memory(s, 0x5b, val); rtc_set_memory(s, 0x5c, val >> 8); rtc_set_memory(s, 0x5d, val >> 16); Therefore, remedy the above OVMF limitations as follows: (1) Start off GetFirstNonAddress() by scanning the E820 map for the highest exclusive >=4GB RAM address. Fall back to the CMOS if the E820 map is unavailable. Base all further calculations (such as 64-bit PCI MMIO aperture placement, GCD sizing etc) on this value. At the moment, the only difference this change makes is that we can have more than 1TB above 4GB -- given that the sole "high RAM" entry in the E820 map matches the CMOS exactly, modulo the most significant bits (see above). However, Igor plans to add discontiguous (cold-plugged) high RAM to the fw_cfg E820 RAM map later on, and then this scanning will adapt automatically. (2) In QemuInitializeRam(), describe the high RAM regions from the E820 map one by one with memory HOBs. Fall back to the CMOS only if the E820 map is missing. Again, right now this change only makes a difference if there is at least 1TB high RAM. Later on it will adapt to discontiguous high RAM (regardless of its size) automatically. -*- Implementation details: introduce the ScanOrAdd64BitE820Ram() function, which reads the E820 entries from fw_cfg, and finds the highest exclusive >=4GB RAM address, or produces memory resource descriptor HOBs for RAM entries that start at or above 4GB. The RAM map is not read in a single go, because its size can vary, and in PlatformPei we should stay away from dynamic memory allocation, for the following reasons: - "Pool" allocations are limited to ~64KB, are served from HOBs, and cannot be released ever. - "Page" allocations are seriously limited before PlatformPei installs the permanent PEI RAM. Furthermore, page allocations can only be released in DXE, with dedicated code (so the address would have to be passed on with a HOB or PCD). - Raw memory allocation HOBs would require the same freeing in DXE. Therefore we process each E820 entry as soon as it is read from fw_cfg. -*- Considering the impact of high RAM on the DXE core: A few years ago, installing high RAM as *tested* would cause the DXE core to inhabit such ranges rather than carving out its home from the permanent PEI RAM. Fortunately, this was fixed in the following edk2 commit: 3a05b13106d1, "MdeModulePkg DxeCore: Take the range in resource HOB for PHIT as higher priority", 2015-09-18 which I regression-tested at the time: http://mid.mail-archive.com/55FC27B0.4070807@redhat.com Later on, OVMF was changed to install its high RAM as tested (effectively "arming" the earlier DXE core change for OVMF), in the following edk2 commit: 035ce3b37c90, "OvmfPkg/PlatformPei: Add memory above 4GB as tested", 2016-04-21 which I also regression-tested at the time: http://mid.mail-archive.com/571E8B90.1020102@redhat.com Therefore adding more "tested memory" HOBs is safe. Cc: Jordan Justen <jordan.l.justen@intel.com> Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1468526 Contributed-under: TianoCore Contribution Agreement 1.1 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2017-07-08 01:28:37 +02:00
#include <IndustryStandard/E820.h>
#include <IndustryStandard/Q35MchIch9.h>
#include <PiPei.h>
//
// The Library classes this module consumes
//
#include <Library/BaseLib.h>
#include <Library/BaseMemoryLib.h>
#include <Library/DebugLib.h>
#include <Library/HobLib.h>
#include <Library/IoLib.h>
#include <Library/PcdLib.h>
#include <Library/PciLib.h>
#include <Library/PeimEntryPoint.h>
#include <Library/ResourcePublicationLib.h>
#include <Library/MtrrLib.h>
OvmfPkg: PlatformPei: determine the 64-bit PCI host aperture for X64 DXE The main observation about the 64-bit PCI host aperture is that it is the highest part of the useful address space. It impacts the top of the GCD memory space map, and, consequently, our maximum address width calculation for the CPU HOB too. Thus, modify the GetFirstNonAddress() function to consider the following areas above the high RAM, while calculating the first non-address (i.e., the highest inclusive address, plus one): - the memory hotplug area (optional, the size comes from QEMU), - the 64-bit PCI host aperture (we set a default size). While computing the first non-address, capture the base and the size of the 64-bit PCI host aperture at once in PCDs, since they are natural parts of the calculation. (Similarly to how PcdPciMmio32* are not rewritten on the S3 resume path (see the InitializePlatform() -> MemMapInitialization() condition), nor are PcdPciMmio64*. Only the core PciHostBridgeDxe driver consumes them, through our PciHostBridgeLib instance.) Set 32GB as the default size for the aperture. Issue#59 mentions the NVIDIA Tesla K80 as an assignable device. According to nvidia.com, these cards may have 24GB of memory (probably 16GB + 8GB BARs). As a strictly experimental feature, the user can specify the size of the aperture (in MB) as well, with the QEMU option -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536 The "X-" prefix follows the QEMU tradition (spelled "x-" there), meaning that the property is experimental, unstable, and might go away any time. Gerd has proposed heuristics for sizing the aperture automatically (based on 1GB page support and PCPU address width), but such should be delayed to a later patch (which may very well back out "X-PciMmio64Mb" then). For "everyday" guests, the 32GB default for the aperture size shouldn't impact the PEI memory demand (the size of the page tables that the DXE IPL PEIM builds). Namely, we've never reported narrower than 36-bit addresses; the DXE IPL PEIM has always built page tables for 64GB at least. For the aperture to bump the address width above 36 bits, either the guest must have quite a bit of memory itself (in which case the additional PEI memory demand shouldn't matter), or the user must specify a large aperture manually with "X-PciMmio64Mb" (and then he or she is also responsible for giving enough RAM to the VM, to satisfy the PEI memory demand). Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Jordan Justen <jordan.l.justen@intel.com> Cc: Marcel Apfelbaum <marcel@redhat.com> Cc: Thomas Lamprecht <t.lamprecht@proxmox.com> Ref: https://github.com/tianocore/edk2/issues/59 Ref: http://www.nvidia.com/object/tesla-servers.html Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2016-03-04 19:30:45 +01:00
#include <Library/QemuFwCfgLib.h>
#include "Platform.h"
#include "Cmos.h"
UINT8 mPhysMemAddressWidth;
STATIC UINT32 mS3AcpiReservedMemoryBase;
STATIC UINT32 mS3AcpiReservedMemorySize;
STATIC UINT16 mQ35TsegMbytes;
VOID
Q35TsegMbytesInitialization (
VOID
)
{
UINT16 ExtendedTsegMbytes;
RETURN_STATUS PcdStatus;
if (mHostBridgeDevId != INTEL_Q35_MCH_DEVICE_ID) {
DEBUG ((
DEBUG_ERROR,
"%a: no TSEG (SMRAM) on host bridge DID=0x%04x; "
"only DID=0x%04x (Q35) is supported\n",
__FUNCTION__,
mHostBridgeDevId,
INTEL_Q35_MCH_DEVICE_ID
));
ASSERT (FALSE);
CpuDeadLoop ();
}
//
// Check if QEMU offers an extended TSEG.
//
// This can be seen from writing MCH_EXT_TSEG_MB_QUERY to the MCH_EXT_TSEG_MB
// register, and reading back the register.
//
// On a QEMU machine type that does not offer an extended TSEG, the initial
// write overwrites whatever value a malicious guest OS may have placed in
// the (unimplemented) register, before entering S3 or rebooting.
// Subsequently, the read returns MCH_EXT_TSEG_MB_QUERY unchanged.
//
// On a QEMU machine type that offers an extended TSEG, the initial write
// triggers an update to the register. Subsequently, the value read back
// (which is guaranteed to differ from MCH_EXT_TSEG_MB_QUERY) tells us the
// number of megabytes.
//
PciWrite16 (DRAMC_REGISTER_Q35 (MCH_EXT_TSEG_MB), MCH_EXT_TSEG_MB_QUERY);
ExtendedTsegMbytes = PciRead16 (DRAMC_REGISTER_Q35 (MCH_EXT_TSEG_MB));
if (ExtendedTsegMbytes == MCH_EXT_TSEG_MB_QUERY) {
mQ35TsegMbytes = PcdGet16 (PcdQ35TsegMbytes);
return;
}
DEBUG ((
DEBUG_INFO,
"%a: QEMU offers an extended TSEG (%d MB)\n",
__FUNCTION__,
ExtendedTsegMbytes
));
PcdStatus = PcdSet16S (PcdQ35TsegMbytes, ExtendedTsegMbytes);
ASSERT_RETURN_ERROR (PcdStatus);
mQ35TsegMbytes = ExtendedTsegMbytes;
}
OvmfPkg/PlatformPei: support >=1TB high RAM, and discontiguous high RAM In OVMF we currently get the upper (>=4GB) memory size with the GetSystemMemorySizeAbove4gb() function. The GetSystemMemorySizeAbove4gb() function is used in two places: (1) It is the starting point of the calculations in GetFirstNonAddress(). GetFirstNonAddress() in turn - determines the placement of the 64-bit PCI MMIO aperture, - provides input for the GCD memory space map's sizing (see AddressWidthInitialization(), and the CPU HOB in MiscInitialization()), - influences the permanent PEI RAM cap (the DXE core's page tables, built in permanent PEI RAM, grow as the RAM to map grows). (2) In QemuInitializeRam(), GetSystemMemorySizeAbove4gb() determines the single memory descriptor HOB that we produce for the upper memory. Respectively, there are two problems with GetSystemMemorySizeAbove4gb(): (1) It reads a 24-bit count of 64KB RAM chunks from the CMOS, and therefore cannot return a larger value than one terabyte. (2) It cannot express discontiguous high RAM. Starting with version 1.7.0, QEMU has provided the fw_cfg file called "etc/e820". Refer to the following QEMU commits: - 0624c7f916b4 ("e820: pass high memory too.", 2013-10-10), - 7d67110f2d9a ("pc: add etc/e820 fw_cfg file", 2013-10-18) - 7db16f2480db ("pc: register e820 entries for ram", 2013-10-10) Ever since these commits in v1.7.0 -- with the last QEMU release being v2.9.0, and v2.10.0 under development --, the only two RAM entries added to this E820 map correspond to the below-4GB RAM range, and the above-4GB RAM range. And, the above-4GB range exactly matches the CMOS registers in question; see the use of "pcms->above_4g_mem_size": pc_q35_init() | pc_init1() pc_memory_init() e820_add_entry(0x100000000ULL, pcms->above_4g_mem_size, E820_RAM); pc_cmos_init() val = pcms->above_4g_mem_size / 65536; rtc_set_memory(s, 0x5b, val); rtc_set_memory(s, 0x5c, val >> 8); rtc_set_memory(s, 0x5d, val >> 16); Therefore, remedy the above OVMF limitations as follows: (1) Start off GetFirstNonAddress() by scanning the E820 map for the highest exclusive >=4GB RAM address. Fall back to the CMOS if the E820 map is unavailable. Base all further calculations (such as 64-bit PCI MMIO aperture placement, GCD sizing etc) on this value. At the moment, the only difference this change makes is that we can have more than 1TB above 4GB -- given that the sole "high RAM" entry in the E820 map matches the CMOS exactly, modulo the most significant bits (see above). However, Igor plans to add discontiguous (cold-plugged) high RAM to the fw_cfg E820 RAM map later on, and then this scanning will adapt automatically. (2) In QemuInitializeRam(), describe the high RAM regions from the E820 map one by one with memory HOBs. Fall back to the CMOS only if the E820 map is missing. Again, right now this change only makes a difference if there is at least 1TB high RAM. Later on it will adapt to discontiguous high RAM (regardless of its size) automatically. -*- Implementation details: introduce the ScanOrAdd64BitE820Ram() function, which reads the E820 entries from fw_cfg, and finds the highest exclusive >=4GB RAM address, or produces memory resource descriptor HOBs for RAM entries that start at or above 4GB. The RAM map is not read in a single go, because its size can vary, and in PlatformPei we should stay away from dynamic memory allocation, for the following reasons: - "Pool" allocations are limited to ~64KB, are served from HOBs, and cannot be released ever. - "Page" allocations are seriously limited before PlatformPei installs the permanent PEI RAM. Furthermore, page allocations can only be released in DXE, with dedicated code (so the address would have to be passed on with a HOB or PCD). - Raw memory allocation HOBs would require the same freeing in DXE. Therefore we process each E820 entry as soon as it is read from fw_cfg. -*- Considering the impact of high RAM on the DXE core: A few years ago, installing high RAM as *tested* would cause the DXE core to inhabit such ranges rather than carving out its home from the permanent PEI RAM. Fortunately, this was fixed in the following edk2 commit: 3a05b13106d1, "MdeModulePkg DxeCore: Take the range in resource HOB for PHIT as higher priority", 2015-09-18 which I regression-tested at the time: http://mid.mail-archive.com/55FC27B0.4070807@redhat.com Later on, OVMF was changed to install its high RAM as tested (effectively "arming" the earlier DXE core change for OVMF), in the following edk2 commit: 035ce3b37c90, "OvmfPkg/PlatformPei: Add memory above 4GB as tested", 2016-04-21 which I also regression-tested at the time: http://mid.mail-archive.com/571E8B90.1020102@redhat.com Therefore adding more "tested memory" HOBs is safe. Cc: Jordan Justen <jordan.l.justen@intel.com> Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1468526 Contributed-under: TianoCore Contribution Agreement 1.1 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2017-07-08 01:28:37 +02:00
/**
Iterate over the RAM entries in QEMU's fw_cfg E820 RAM map that start outside
of the 32-bit address range.
Find the highest exclusive >=4GB RAM address, or produce memory resource
descriptor HOBs for RAM entries that start at or above 4GB.
@param[out] MaxAddress If MaxAddress is NULL, then ScanOrAdd64BitE820Ram()
produces memory resource descriptor HOBs for RAM
entries that start at or above 4GB.
Otherwise, MaxAddress holds the highest exclusive
>=4GB RAM address on output. If QEMU's fw_cfg E820
RAM map contains no RAM entry that starts outside of
the 32-bit address range, then MaxAddress is exactly
4GB on output.
@retval EFI_SUCCESS The fw_cfg E820 RAM map was found and processed.
@retval EFI_PROTOCOL_ERROR The RAM map was found, but its size wasn't a
whole multiple of sizeof(EFI_E820_ENTRY64). No
RAM entry was processed.
@return Error codes from QemuFwCfgFindFile(). No RAM
entry was processed.
**/
STATIC
EFI_STATUS
ScanOrAdd64BitE820Ram (
OUT UINT64 *MaxAddress OPTIONAL
)
{
EFI_STATUS Status;
FIRMWARE_CONFIG_ITEM FwCfgItem;
UINTN FwCfgSize;
EFI_E820_ENTRY64 E820Entry;
UINTN Processed;
Status = QemuFwCfgFindFile ("etc/e820", &FwCfgItem, &FwCfgSize);
if (EFI_ERROR (Status)) {
return Status;
}
if (FwCfgSize % sizeof E820Entry != 0) {
return EFI_PROTOCOL_ERROR;
}
if (MaxAddress != NULL) {
*MaxAddress = BASE_4GB;
}
QemuFwCfgSelectItem (FwCfgItem);
for (Processed = 0; Processed < FwCfgSize; Processed += sizeof E820Entry) {
QemuFwCfgReadBytes (sizeof E820Entry, &E820Entry);
DEBUG ((
DEBUG_VERBOSE,
"%a: Base=0x%Lx Length=0x%Lx Type=%u\n",
__FUNCTION__,
E820Entry.BaseAddr,
E820Entry.Length,
E820Entry.Type
));
if (E820Entry.Type == EfiAcpiAddressRangeMemory &&
E820Entry.BaseAddr >= BASE_4GB) {
if (MaxAddress == NULL) {
UINT64 Base;
UINT64 End;
//
// Round up the start address, and round down the end address.
//
Base = ALIGN_VALUE (E820Entry.BaseAddr, (UINT64)EFI_PAGE_SIZE);
End = (E820Entry.BaseAddr + E820Entry.Length) &
~(UINT64)EFI_PAGE_MASK;
if (Base < End) {
AddMemoryRangeHob (Base, End);
DEBUG ((
DEBUG_VERBOSE,
"%a: AddMemoryRangeHob [0x%Lx, 0x%Lx)\n",
__FUNCTION__,
Base,
End
));
}
} else {
UINT64 Candidate;
Candidate = E820Entry.BaseAddr + E820Entry.Length;
if (Candidate > *MaxAddress) {
*MaxAddress = Candidate;
DEBUG ((
DEBUG_VERBOSE,
"%a: MaxAddress=0x%Lx\n",
__FUNCTION__,
*MaxAddress
));
}
}
}
}
return EFI_SUCCESS;
}
UINT32
GetSystemMemorySizeBelow4gb (
VOID
)
{
UINT8 Cmos0x34;
UINT8 Cmos0x35;
//
// CMOS 0x34/0x35 specifies the system memory above 16 MB.
// * CMOS(0x35) is the high byte
// * CMOS(0x34) is the low byte
// * The size is specified in 64kb chunks
// * Since this is memory above 16MB, the 16MB must be added
// into the calculation to get the total memory size.
//
Cmos0x34 = (UINT8) CmosRead8 (0x34);
Cmos0x35 = (UINT8) CmosRead8 (0x35);
return (UINT32) (((UINTN)((Cmos0x35 << 8) + Cmos0x34) << 16) + SIZE_16MB);
}
STATIC
UINT64
GetSystemMemorySizeAbove4gb (
)
{
UINT32 Size;
UINTN CmosIndex;
//
// CMOS 0x5b-0x5d specifies the system memory above 4GB MB.
// * CMOS(0x5d) is the most significant size byte
// * CMOS(0x5c) is the middle size byte
// * CMOS(0x5b) is the least significant size byte
// * The size is specified in 64kb chunks
//
Size = 0;
for (CmosIndex = 0x5d; CmosIndex >= 0x5b; CmosIndex--) {
Size = (UINT32) (Size << 8) + (UINT32) CmosRead8 (CmosIndex);
}
return LShiftU64 (Size, 16);
}
/**
Return the highest address that DXE could possibly use, plus one.
**/
STATIC
UINT64
GetFirstNonAddress (
VOID
)
{
UINT64 FirstNonAddress;
OvmfPkg: PlatformPei: determine the 64-bit PCI host aperture for X64 DXE The main observation about the 64-bit PCI host aperture is that it is the highest part of the useful address space. It impacts the top of the GCD memory space map, and, consequently, our maximum address width calculation for the CPU HOB too. Thus, modify the GetFirstNonAddress() function to consider the following areas above the high RAM, while calculating the first non-address (i.e., the highest inclusive address, plus one): - the memory hotplug area (optional, the size comes from QEMU), - the 64-bit PCI host aperture (we set a default size). While computing the first non-address, capture the base and the size of the 64-bit PCI host aperture at once in PCDs, since they are natural parts of the calculation. (Similarly to how PcdPciMmio32* are not rewritten on the S3 resume path (see the InitializePlatform() -> MemMapInitialization() condition), nor are PcdPciMmio64*. Only the core PciHostBridgeDxe driver consumes them, through our PciHostBridgeLib instance.) Set 32GB as the default size for the aperture. Issue#59 mentions the NVIDIA Tesla K80 as an assignable device. According to nvidia.com, these cards may have 24GB of memory (probably 16GB + 8GB BARs). As a strictly experimental feature, the user can specify the size of the aperture (in MB) as well, with the QEMU option -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536 The "X-" prefix follows the QEMU tradition (spelled "x-" there), meaning that the property is experimental, unstable, and might go away any time. Gerd has proposed heuristics for sizing the aperture automatically (based on 1GB page support and PCPU address width), but such should be delayed to a later patch (which may very well back out "X-PciMmio64Mb" then). For "everyday" guests, the 32GB default for the aperture size shouldn't impact the PEI memory demand (the size of the page tables that the DXE IPL PEIM builds). Namely, we've never reported narrower than 36-bit addresses; the DXE IPL PEIM has always built page tables for 64GB at least. For the aperture to bump the address width above 36 bits, either the guest must have quite a bit of memory itself (in which case the additional PEI memory demand shouldn't matter), or the user must specify a large aperture manually with "X-PciMmio64Mb" (and then he or she is also responsible for giving enough RAM to the VM, to satisfy the PEI memory demand). Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Jordan Justen <jordan.l.justen@intel.com> Cc: Marcel Apfelbaum <marcel@redhat.com> Cc: Thomas Lamprecht <t.lamprecht@proxmox.com> Ref: https://github.com/tianocore/edk2/issues/59 Ref: http://www.nvidia.com/object/tesla-servers.html Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2016-03-04 19:30:45 +01:00
UINT64 Pci64Base, Pci64Size;
CHAR8 MbString[7 + 1];
EFI_STATUS Status;
FIRMWARE_CONFIG_ITEM FwCfgItem;
UINTN FwCfgSize;
UINT64 HotPlugMemoryEnd;
RETURN_STATUS PcdStatus;
OvmfPkg/PlatformPei: support >=1TB high RAM, and discontiguous high RAM In OVMF we currently get the upper (>=4GB) memory size with the GetSystemMemorySizeAbove4gb() function. The GetSystemMemorySizeAbove4gb() function is used in two places: (1) It is the starting point of the calculations in GetFirstNonAddress(). GetFirstNonAddress() in turn - determines the placement of the 64-bit PCI MMIO aperture, - provides input for the GCD memory space map's sizing (see AddressWidthInitialization(), and the CPU HOB in MiscInitialization()), - influences the permanent PEI RAM cap (the DXE core's page tables, built in permanent PEI RAM, grow as the RAM to map grows). (2) In QemuInitializeRam(), GetSystemMemorySizeAbove4gb() determines the single memory descriptor HOB that we produce for the upper memory. Respectively, there are two problems with GetSystemMemorySizeAbove4gb(): (1) It reads a 24-bit count of 64KB RAM chunks from the CMOS, and therefore cannot return a larger value than one terabyte. (2) It cannot express discontiguous high RAM. Starting with version 1.7.0, QEMU has provided the fw_cfg file called "etc/e820". Refer to the following QEMU commits: - 0624c7f916b4 ("e820: pass high memory too.", 2013-10-10), - 7d67110f2d9a ("pc: add etc/e820 fw_cfg file", 2013-10-18) - 7db16f2480db ("pc: register e820 entries for ram", 2013-10-10) Ever since these commits in v1.7.0 -- with the last QEMU release being v2.9.0, and v2.10.0 under development --, the only two RAM entries added to this E820 map correspond to the below-4GB RAM range, and the above-4GB RAM range. And, the above-4GB range exactly matches the CMOS registers in question; see the use of "pcms->above_4g_mem_size": pc_q35_init() | pc_init1() pc_memory_init() e820_add_entry(0x100000000ULL, pcms->above_4g_mem_size, E820_RAM); pc_cmos_init() val = pcms->above_4g_mem_size / 65536; rtc_set_memory(s, 0x5b, val); rtc_set_memory(s, 0x5c, val >> 8); rtc_set_memory(s, 0x5d, val >> 16); Therefore, remedy the above OVMF limitations as follows: (1) Start off GetFirstNonAddress() by scanning the E820 map for the highest exclusive >=4GB RAM address. Fall back to the CMOS if the E820 map is unavailable. Base all further calculations (such as 64-bit PCI MMIO aperture placement, GCD sizing etc) on this value. At the moment, the only difference this change makes is that we can have more than 1TB above 4GB -- given that the sole "high RAM" entry in the E820 map matches the CMOS exactly, modulo the most significant bits (see above). However, Igor plans to add discontiguous (cold-plugged) high RAM to the fw_cfg E820 RAM map later on, and then this scanning will adapt automatically. (2) In QemuInitializeRam(), describe the high RAM regions from the E820 map one by one with memory HOBs. Fall back to the CMOS only if the E820 map is missing. Again, right now this change only makes a difference if there is at least 1TB high RAM. Later on it will adapt to discontiguous high RAM (regardless of its size) automatically. -*- Implementation details: introduce the ScanOrAdd64BitE820Ram() function, which reads the E820 entries from fw_cfg, and finds the highest exclusive >=4GB RAM address, or produces memory resource descriptor HOBs for RAM entries that start at or above 4GB. The RAM map is not read in a single go, because its size can vary, and in PlatformPei we should stay away from dynamic memory allocation, for the following reasons: - "Pool" allocations are limited to ~64KB, are served from HOBs, and cannot be released ever. - "Page" allocations are seriously limited before PlatformPei installs the permanent PEI RAM. Furthermore, page allocations can only be released in DXE, with dedicated code (so the address would have to be passed on with a HOB or PCD). - Raw memory allocation HOBs would require the same freeing in DXE. Therefore we process each E820 entry as soon as it is read from fw_cfg. -*- Considering the impact of high RAM on the DXE core: A few years ago, installing high RAM as *tested* would cause the DXE core to inhabit such ranges rather than carving out its home from the permanent PEI RAM. Fortunately, this was fixed in the following edk2 commit: 3a05b13106d1, "MdeModulePkg DxeCore: Take the range in resource HOB for PHIT as higher priority", 2015-09-18 which I regression-tested at the time: http://mid.mail-archive.com/55FC27B0.4070807@redhat.com Later on, OVMF was changed to install its high RAM as tested (effectively "arming" the earlier DXE core change for OVMF), in the following edk2 commit: 035ce3b37c90, "OvmfPkg/PlatformPei: Add memory above 4GB as tested", 2016-04-21 which I also regression-tested at the time: http://mid.mail-archive.com/571E8B90.1020102@redhat.com Therefore adding more "tested memory" HOBs is safe. Cc: Jordan Justen <jordan.l.justen@intel.com> Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1468526 Contributed-under: TianoCore Contribution Agreement 1.1 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2017-07-08 01:28:37 +02:00
//
// set FirstNonAddress to suppress incorrect compiler/analyzer warnings
//
FirstNonAddress = 0;
//
// If QEMU presents an E820 map, then get the highest exclusive >=4GB RAM
// address from it. This can express an address >= 4GB+1TB.
//
// Otherwise, get the flat size of the memory above 4GB from the CMOS (which
// can only express a size smaller than 1TB), and add it to 4GB.
//
Status = ScanOrAdd64BitE820Ram (&FirstNonAddress);
if (EFI_ERROR (Status)) {
FirstNonAddress = BASE_4GB + GetSystemMemorySizeAbove4gb ();
}
OvmfPkg: PlatformPei: determine the 64-bit PCI host aperture for X64 DXE The main observation about the 64-bit PCI host aperture is that it is the highest part of the useful address space. It impacts the top of the GCD memory space map, and, consequently, our maximum address width calculation for the CPU HOB too. Thus, modify the GetFirstNonAddress() function to consider the following areas above the high RAM, while calculating the first non-address (i.e., the highest inclusive address, plus one): - the memory hotplug area (optional, the size comes from QEMU), - the 64-bit PCI host aperture (we set a default size). While computing the first non-address, capture the base and the size of the 64-bit PCI host aperture at once in PCDs, since they are natural parts of the calculation. (Similarly to how PcdPciMmio32* are not rewritten on the S3 resume path (see the InitializePlatform() -> MemMapInitialization() condition), nor are PcdPciMmio64*. Only the core PciHostBridgeDxe driver consumes them, through our PciHostBridgeLib instance.) Set 32GB as the default size for the aperture. Issue#59 mentions the NVIDIA Tesla K80 as an assignable device. According to nvidia.com, these cards may have 24GB of memory (probably 16GB + 8GB BARs). As a strictly experimental feature, the user can specify the size of the aperture (in MB) as well, with the QEMU option -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536 The "X-" prefix follows the QEMU tradition (spelled "x-" there), meaning that the property is experimental, unstable, and might go away any time. Gerd has proposed heuristics for sizing the aperture automatically (based on 1GB page support and PCPU address width), but such should be delayed to a later patch (which may very well back out "X-PciMmio64Mb" then). For "everyday" guests, the 32GB default for the aperture size shouldn't impact the PEI memory demand (the size of the page tables that the DXE IPL PEIM builds). Namely, we've never reported narrower than 36-bit addresses; the DXE IPL PEIM has always built page tables for 64GB at least. For the aperture to bump the address width above 36 bits, either the guest must have quite a bit of memory itself (in which case the additional PEI memory demand shouldn't matter), or the user must specify a large aperture manually with "X-PciMmio64Mb" (and then he or she is also responsible for giving enough RAM to the VM, to satisfy the PEI memory demand). Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Jordan Justen <jordan.l.justen@intel.com> Cc: Marcel Apfelbaum <marcel@redhat.com> Cc: Thomas Lamprecht <t.lamprecht@proxmox.com> Ref: https://github.com/tianocore/edk2/issues/59 Ref: http://www.nvidia.com/object/tesla-servers.html Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2016-03-04 19:30:45 +01:00
//
// If DXE is 32-bit, then we're done; PciBusDxe will degrade 64-bit MMIO
// resources to 32-bit anyway. See DegradeResource() in
// "PciResourceSupport.c".
//
#ifdef MDE_CPU_IA32
if (!FeaturePcdGet (PcdDxeIplSwitchToLongMode)) {
return FirstNonAddress;
}
#endif
//
// Otherwise, in order to calculate the highest address plus one, we must
// consider the 64-bit PCI host aperture too. Fetch the default size.
//
Pci64Size = PcdGet64 (PcdPciMmio64Size);
//
// See if the user specified the number of megabytes for the 64-bit PCI host
// aperture. The number of non-NUL characters in MbString allows for
// 9,999,999 MB, which is approximately 10 TB.
//
// As signaled by the "X-" prefix, this knob is experimental, and might go
// away at any time.
//
Status = QemuFwCfgFindFile ("opt/ovmf/X-PciMmio64Mb", &FwCfgItem,
&FwCfgSize);
if (!EFI_ERROR (Status)) {
if (FwCfgSize >= sizeof MbString) {
DEBUG ((EFI_D_WARN,
"%a: ignoring malformed 64-bit PCI host aperture size from fw_cfg\n",
__FUNCTION__));
} else {
QemuFwCfgSelectItem (FwCfgItem);
QemuFwCfgReadBytes (FwCfgSize, MbString);
MbString[FwCfgSize] = '\0';
Pci64Size = LShiftU64 (AsciiStrDecimalToUint64 (MbString), 20);
}
}
if (Pci64Size == 0) {
if (mBootMode != BOOT_ON_S3_RESUME) {
DEBUG ((EFI_D_INFO, "%a: disabling 64-bit PCI host aperture\n",
__FUNCTION__));
PcdStatus = PcdSet64S (PcdPciMmio64Size, 0);
ASSERT_RETURN_ERROR (PcdStatus);
OvmfPkg: PlatformPei: determine the 64-bit PCI host aperture for X64 DXE The main observation about the 64-bit PCI host aperture is that it is the highest part of the useful address space. It impacts the top of the GCD memory space map, and, consequently, our maximum address width calculation for the CPU HOB too. Thus, modify the GetFirstNonAddress() function to consider the following areas above the high RAM, while calculating the first non-address (i.e., the highest inclusive address, plus one): - the memory hotplug area (optional, the size comes from QEMU), - the 64-bit PCI host aperture (we set a default size). While computing the first non-address, capture the base and the size of the 64-bit PCI host aperture at once in PCDs, since they are natural parts of the calculation. (Similarly to how PcdPciMmio32* are not rewritten on the S3 resume path (see the InitializePlatform() -> MemMapInitialization() condition), nor are PcdPciMmio64*. Only the core PciHostBridgeDxe driver consumes them, through our PciHostBridgeLib instance.) Set 32GB as the default size for the aperture. Issue#59 mentions the NVIDIA Tesla K80 as an assignable device. According to nvidia.com, these cards may have 24GB of memory (probably 16GB + 8GB BARs). As a strictly experimental feature, the user can specify the size of the aperture (in MB) as well, with the QEMU option -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536 The "X-" prefix follows the QEMU tradition (spelled "x-" there), meaning that the property is experimental, unstable, and might go away any time. Gerd has proposed heuristics for sizing the aperture automatically (based on 1GB page support and PCPU address width), but such should be delayed to a later patch (which may very well back out "X-PciMmio64Mb" then). For "everyday" guests, the 32GB default for the aperture size shouldn't impact the PEI memory demand (the size of the page tables that the DXE IPL PEIM builds). Namely, we've never reported narrower than 36-bit addresses; the DXE IPL PEIM has always built page tables for 64GB at least. For the aperture to bump the address width above 36 bits, either the guest must have quite a bit of memory itself (in which case the additional PEI memory demand shouldn't matter), or the user must specify a large aperture manually with "X-PciMmio64Mb" (and then he or she is also responsible for giving enough RAM to the VM, to satisfy the PEI memory demand). Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Jordan Justen <jordan.l.justen@intel.com> Cc: Marcel Apfelbaum <marcel@redhat.com> Cc: Thomas Lamprecht <t.lamprecht@proxmox.com> Ref: https://github.com/tianocore/edk2/issues/59 Ref: http://www.nvidia.com/object/tesla-servers.html Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2016-03-04 19:30:45 +01:00
}
//
// There's nothing more to do; the amount of memory above 4GB fully
// determines the highest address plus one. The memory hotplug area (see
// below) plays no role for the firmware in this case.
//
return FirstNonAddress;
}
//
// The "etc/reserved-memory-end" fw_cfg file, when present, contains an
// absolute, exclusive end address for the memory hotplug area. This area
// starts right at the end of the memory above 4GB. The 64-bit PCI host
// aperture must be placed above it.
//
Status = QemuFwCfgFindFile ("etc/reserved-memory-end", &FwCfgItem,
&FwCfgSize);
if (!EFI_ERROR (Status) && FwCfgSize == sizeof HotPlugMemoryEnd) {
QemuFwCfgSelectItem (FwCfgItem);
QemuFwCfgReadBytes (FwCfgSize, &HotPlugMemoryEnd);
DEBUG ((DEBUG_VERBOSE, "%a: HotPlugMemoryEnd=0x%Lx\n", __FUNCTION__,
HotPlugMemoryEnd));
OvmfPkg: PlatformPei: determine the 64-bit PCI host aperture for X64 DXE The main observation about the 64-bit PCI host aperture is that it is the highest part of the useful address space. It impacts the top of the GCD memory space map, and, consequently, our maximum address width calculation for the CPU HOB too. Thus, modify the GetFirstNonAddress() function to consider the following areas above the high RAM, while calculating the first non-address (i.e., the highest inclusive address, plus one): - the memory hotplug area (optional, the size comes from QEMU), - the 64-bit PCI host aperture (we set a default size). While computing the first non-address, capture the base and the size of the 64-bit PCI host aperture at once in PCDs, since they are natural parts of the calculation. (Similarly to how PcdPciMmio32* are not rewritten on the S3 resume path (see the InitializePlatform() -> MemMapInitialization() condition), nor are PcdPciMmio64*. Only the core PciHostBridgeDxe driver consumes them, through our PciHostBridgeLib instance.) Set 32GB as the default size for the aperture. Issue#59 mentions the NVIDIA Tesla K80 as an assignable device. According to nvidia.com, these cards may have 24GB of memory (probably 16GB + 8GB BARs). As a strictly experimental feature, the user can specify the size of the aperture (in MB) as well, with the QEMU option -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536 The "X-" prefix follows the QEMU tradition (spelled "x-" there), meaning that the property is experimental, unstable, and might go away any time. Gerd has proposed heuristics for sizing the aperture automatically (based on 1GB page support and PCPU address width), but such should be delayed to a later patch (which may very well back out "X-PciMmio64Mb" then). For "everyday" guests, the 32GB default for the aperture size shouldn't impact the PEI memory demand (the size of the page tables that the DXE IPL PEIM builds). Namely, we've never reported narrower than 36-bit addresses; the DXE IPL PEIM has always built page tables for 64GB at least. For the aperture to bump the address width above 36 bits, either the guest must have quite a bit of memory itself (in which case the additional PEI memory demand shouldn't matter), or the user must specify a large aperture manually with "X-PciMmio64Mb" (and then he or she is also responsible for giving enough RAM to the VM, to satisfy the PEI memory demand). Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Jordan Justen <jordan.l.justen@intel.com> Cc: Marcel Apfelbaum <marcel@redhat.com> Cc: Thomas Lamprecht <t.lamprecht@proxmox.com> Ref: https://github.com/tianocore/edk2/issues/59 Ref: http://www.nvidia.com/object/tesla-servers.html Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2016-03-04 19:30:45 +01:00
ASSERT (HotPlugMemoryEnd >= FirstNonAddress);
FirstNonAddress = HotPlugMemoryEnd;
}
//
// SeaBIOS aligns both boundaries of the 64-bit PCI host aperture to 1GB, so
// that the host can map it with 1GB hugepages. Follow suit.
//
Pci64Base = ALIGN_VALUE (FirstNonAddress, (UINT64)SIZE_1GB);
Pci64Size = ALIGN_VALUE (Pci64Size, (UINT64)SIZE_1GB);
//
// The 64-bit PCI host aperture should also be "naturally" aligned. The
// alignment is determined by rounding the size of the aperture down to the
// next smaller or equal power of two. That is, align the aperture by the
// largest BAR size that can fit into it.
//
Pci64Base = ALIGN_VALUE (Pci64Base, GetPowerOfTwo64 (Pci64Size));
if (mBootMode != BOOT_ON_S3_RESUME) {
//
// The core PciHostBridgeDxe driver will automatically add this range to
// the GCD memory space map through our PciHostBridgeLib instance; here we
// only need to set the PCDs.
//
PcdStatus = PcdSet64S (PcdPciMmio64Base, Pci64Base);
ASSERT_RETURN_ERROR (PcdStatus);
PcdStatus = PcdSet64S (PcdPciMmio64Size, Pci64Size);
ASSERT_RETURN_ERROR (PcdStatus);
OvmfPkg: PlatformPei: determine the 64-bit PCI host aperture for X64 DXE The main observation about the 64-bit PCI host aperture is that it is the highest part of the useful address space. It impacts the top of the GCD memory space map, and, consequently, our maximum address width calculation for the CPU HOB too. Thus, modify the GetFirstNonAddress() function to consider the following areas above the high RAM, while calculating the first non-address (i.e., the highest inclusive address, plus one): - the memory hotplug area (optional, the size comes from QEMU), - the 64-bit PCI host aperture (we set a default size). While computing the first non-address, capture the base and the size of the 64-bit PCI host aperture at once in PCDs, since they are natural parts of the calculation. (Similarly to how PcdPciMmio32* are not rewritten on the S3 resume path (see the InitializePlatform() -> MemMapInitialization() condition), nor are PcdPciMmio64*. Only the core PciHostBridgeDxe driver consumes them, through our PciHostBridgeLib instance.) Set 32GB as the default size for the aperture. Issue#59 mentions the NVIDIA Tesla K80 as an assignable device. According to nvidia.com, these cards may have 24GB of memory (probably 16GB + 8GB BARs). As a strictly experimental feature, the user can specify the size of the aperture (in MB) as well, with the QEMU option -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536 The "X-" prefix follows the QEMU tradition (spelled "x-" there), meaning that the property is experimental, unstable, and might go away any time. Gerd has proposed heuristics for sizing the aperture automatically (based on 1GB page support and PCPU address width), but such should be delayed to a later patch (which may very well back out "X-PciMmio64Mb" then). For "everyday" guests, the 32GB default for the aperture size shouldn't impact the PEI memory demand (the size of the page tables that the DXE IPL PEIM builds). Namely, we've never reported narrower than 36-bit addresses; the DXE IPL PEIM has always built page tables for 64GB at least. For the aperture to bump the address width above 36 bits, either the guest must have quite a bit of memory itself (in which case the additional PEI memory demand shouldn't matter), or the user must specify a large aperture manually with "X-PciMmio64Mb" (and then he or she is also responsible for giving enough RAM to the VM, to satisfy the PEI memory demand). Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Jordan Justen <jordan.l.justen@intel.com> Cc: Marcel Apfelbaum <marcel@redhat.com> Cc: Thomas Lamprecht <t.lamprecht@proxmox.com> Ref: https://github.com/tianocore/edk2/issues/59 Ref: http://www.nvidia.com/object/tesla-servers.html Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2016-03-04 19:30:45 +01:00
DEBUG ((EFI_D_INFO, "%a: Pci64Base=0x%Lx Pci64Size=0x%Lx\n",
__FUNCTION__, Pci64Base, Pci64Size));
}
//
// The useful address space ends with the 64-bit PCI host aperture.
//
FirstNonAddress = Pci64Base + Pci64Size;
return FirstNonAddress;
}
/**
Initialize the mPhysMemAddressWidth variable, based on guest RAM size.
**/
VOID
AddressWidthInitialization (
VOID
)
{
UINT64 FirstNonAddress;
//
// As guest-physical memory size grows, the permanent PEI RAM requirements
// are dominated by the identity-mapping page tables built by the DXE IPL.
// The DXL IPL keys off of the physical address bits advertized in the CPU
// HOB. To conserve memory, we calculate the minimum address width here.
//
FirstNonAddress = GetFirstNonAddress ();
mPhysMemAddressWidth = (UINT8)HighBitSet64 (FirstNonAddress);
//
// If FirstNonAddress is not an integral power of two, then we need an
// additional bit.
//
if ((FirstNonAddress & (FirstNonAddress - 1)) != 0) {
++mPhysMemAddressWidth;
}
//
// The minimum address width is 36 (covers up to and excluding 64 GB, which
// is the maximum for Ia32 + PAE). The theoretical architecture maximum for
// X64 long mode is 52 bits, but the DXE IPL clamps that down to 48 bits. We
// can simply assert that here, since 48 bits are good enough for 256 TB.
//
if (mPhysMemAddressWidth <= 36) {
mPhysMemAddressWidth = 36;
}
ASSERT (mPhysMemAddressWidth <= 48);
}
/**
Calculate the cap for the permanent PEI memory.
**/
STATIC
UINT32
GetPeiMemoryCap (
VOID
)
{
BOOLEAN Page1GSupport;
UINT32 RegEax;
UINT32 RegEdx;
UINT32 Pml4Entries;
UINT32 PdpEntries;
UINTN TotalPages;
//
// If DXE is 32-bit, then just return the traditional 64 MB cap.
//
#ifdef MDE_CPU_IA32
if (!FeaturePcdGet (PcdDxeIplSwitchToLongMode)) {
return SIZE_64MB;
}
#endif
//
// Dependent on physical address width, PEI memory allocations can be
// dominated by the page tables built for 64-bit DXE. So we key the cap off
// of those. The code below is based on CreateIdentityMappingPageTables() in
// "MdeModulePkg/Core/DxeIplPeim/X64/VirtualMemory.c".
//
Page1GSupport = FALSE;
if (PcdGetBool (PcdUse1GPageTable)) {
AsmCpuid (0x80000000, &RegEax, NULL, NULL, NULL);
if (RegEax >= 0x80000001) {
AsmCpuid (0x80000001, NULL, NULL, NULL, &RegEdx);
if ((RegEdx & BIT26) != 0) {
Page1GSupport = TRUE;
}
}
}
if (mPhysMemAddressWidth <= 39) {
Pml4Entries = 1;
PdpEntries = 1 << (mPhysMemAddressWidth - 30);
ASSERT (PdpEntries <= 0x200);
} else {
Pml4Entries = 1 << (mPhysMemAddressWidth - 39);
ASSERT (Pml4Entries <= 0x200);
PdpEntries = 512;
}
TotalPages = Page1GSupport ? Pml4Entries + 1 :
(PdpEntries + 1) * Pml4Entries + 1;
ASSERT (TotalPages <= 0x40201);
//
// Add 64 MB for miscellaneous allocations. Note that for
// mPhysMemAddressWidth values close to 36, the cap will actually be
// dominated by this increment.
//
return (UINT32)(EFI_PAGES_TO_SIZE (TotalPages) + SIZE_64MB);
}
/**
Publish PEI core memory
@return EFI_SUCCESS The PEIM initialized successfully.
**/
EFI_STATUS
PublishPeiMemory (
VOID
)
{
EFI_STATUS Status;
EFI_PHYSICAL_ADDRESS MemoryBase;
UINT64 MemorySize;
UINT32 LowerMemorySize;
UINT32 PeiMemoryCap;
LowerMemorySize = GetSystemMemorySizeBelow4gb ();
if (FeaturePcdGet (PcdSmmSmramRequire)) {
//
// TSEG is chipped from the end of low RAM
//
LowerMemorySize -= mQ35TsegMbytes * SIZE_1MB;
}
//
// If S3 is supported, then the S3 permanent PEI memory is placed next,
// downwards. Its size is primarily dictated by CpuMpPei. The formula below
// is an approximation.
//
if (mS3Supported) {
mS3AcpiReservedMemorySize = SIZE_512KB +
mMaxCpuCount *
PcdGet32 (PcdCpuApStackSize);
mS3AcpiReservedMemoryBase = LowerMemorySize - mS3AcpiReservedMemorySize;
LowerMemorySize = mS3AcpiReservedMemoryBase;
}
if (mBootMode == BOOT_ON_S3_RESUME) {
MemoryBase = mS3AcpiReservedMemoryBase;
MemorySize = mS3AcpiReservedMemorySize;
} else {
PeiMemoryCap = GetPeiMemoryCap ();
DEBUG ((EFI_D_INFO, "%a: mPhysMemAddressWidth=%d PeiMemoryCap=%u KB\n",
__FUNCTION__, mPhysMemAddressWidth, PeiMemoryCap >> 10));
//
// Determine the range of memory to use during PEI
//
OvmfPkg: decompress FVs on S3 resume if SMM_REQUIRE is set If OVMF was built with -D SMM_REQUIRE, that implies that the runtime OS is not trusted and we should defend against it tampering with the firmware's data. One such datum is the PEI firmware volume (PEIFV). Normally PEIFV is decompressed on the first boot by SEC, then the OS preserves it across S3 suspend-resume cycles; at S3 resume SEC just reuses the originally decompressed PEIFV. However, if we don't trust the OS, then SEC must decompress PEIFV from the pristine flash every time, lest we execute OS-injected code or work with OS-injected data. Due to how FVMAIN_COMPACT is organized, we can't decompress just PEIFV; the decompression brings DXEFV with itself, plus it uses a temporary output buffer and a scratch buffer too, which even reach above the end of the finally installed DXEFV. For this reason we must keep away a non-malicious OS from DXEFV too, plus the memory up to PcdOvmfDecomprScratchEnd. The delay introduced by the LZMA decompression on S3 resume is negligible. If -D SMM_REQUIRE is not specified, then PcdSmmSmramRequire remains FALSE (from the DEC file), and then this patch has no effect (not counting some changed debug messages). If QEMU doesn't support S3 (or the user disabled it on the QEMU command line), then this patch has no effect also. Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@19037 6f19259b-4bc3-4df7-8a09-765794883524
2015-11-30 19:41:24 +01:00
// Technically we could lay the permanent PEI RAM over SEC's temporary
// decompression and scratch buffer even if "secure S3" is needed, since
// their lifetimes don't overlap. However, PeiFvInitialization() will cover
// RAM up to PcdOvmfDecompressionScratchEnd with an EfiACPIMemoryNVS memory
// allocation HOB, and other allocations served from the permanent PEI RAM
// shouldn't overlap with that HOB.
//
MemoryBase = mS3Supported && FeaturePcdGet (PcdSmmSmramRequire) ?
PcdGet32 (PcdOvmfDecompressionScratchEnd) :
PcdGet32 (PcdOvmfDxeMemFvBase) + PcdGet32 (PcdOvmfDxeMemFvSize);
MemorySize = LowerMemorySize - MemoryBase;
if (MemorySize > PeiMemoryCap) {
MemoryBase = LowerMemorySize - PeiMemoryCap;
MemorySize = PeiMemoryCap;
}
}
//
// Publish this memory to the PEI Core
//
Status = PublishSystemMemory(MemoryBase, MemorySize);
ASSERT_EFI_ERROR (Status);
return Status;
}
/**
Peform Memory Detection for QEMU / KVM
**/
STATIC
VOID
QemuInitializeRam (
VOID
)
{
UINT64 LowerMemorySize;
UINT64 UpperMemorySize;
OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam() At the moment we work with a UC default MTRR type, and set three memory ranges to WB: - [0, 640 KB), - [1 MB, LowerMemorySize), - [4 GB, 4 GB + UpperMemorySize). Unfortunately, coverage for the third range can fail with a high likelihood. If the alignment of the base (ie. 4 GB) and the alignment of the size (UpperMemorySize) differ, then MtrrLib creates a series of variable MTRR entries, with power-of-two sized MTRR masks. And, it's really easy to run out of variable MTRR entries, dependent on the alignment difference. This is a problem because a Linux guest will loudly reject any high memory that is not covered my MTRR. So, let's follow the inverse pattern (loosely inspired by SeaBIOS): - flip the MTRR default type to WB, - set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default type and variable MTRRs, so we can't avoid this, - set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs, - set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs more likely than the other scheme (due to less chaotic alignment differences). Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000) in PcdDebugPrintErrorLevel. Cc: Maoming <maoming.maoming@huawei.com> Cc: Huangpeng (Peter) <peter.huangpeng@huawei.com> Cc: Wei Liu <wei.liu2@citrix.com> Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Tested-by: Maoming <maoming.maoming@huawei.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@17722 6f19259b-4bc3-4df7-8a09-765794883524
2015-06-26 18:09:52 +02:00
MTRR_SETTINGS MtrrSettings;
EFI_STATUS Status;
DEBUG ((EFI_D_INFO, "%a called\n", __FUNCTION__));
//
// Determine total memory size available
//
LowerMemorySize = GetSystemMemorySizeBelow4gb ();
UpperMemorySize = GetSystemMemorySizeAbove4gb ();
if (mBootMode == BOOT_ON_S3_RESUME) {
//
// Create the following memory HOB as an exception on the S3 boot path.
//
// Normally we'd create memory HOBs only on the normal boot path. However,
// CpuMpPei specifically needs such a low-memory HOB on the S3 path as
// well, for "borrowing" a subset of it temporarily, for the AP startup
// vector.
//
// CpuMpPei saves the original contents of the borrowed area in permanent
// PEI RAM, in a backup buffer allocated with the normal PEI services.
// CpuMpPei restores the original contents ("returns" the borrowed area) at
// End-of-PEI. End-of-PEI in turn is emitted by S3Resume2Pei before
// transferring control to the OS's wakeup vector in the FACS.
//
// We expect any other PEIMs that "borrow" memory similarly to CpuMpPei to
// restore the original contents. Furthermore, we expect all such PEIMs
// (CpuMpPei included) to claim the borrowed areas by producing memory
// allocation HOBs, and to honor preexistent memory allocation HOBs when
// looking for an area to borrow.
//
AddMemoryRangeHob (0, BASE_512KB + BASE_128KB);
} else {
//
// Create memory HOBs
//
AddMemoryRangeHob (0, BASE_512KB + BASE_128KB);
OvmfPkg: PlatformPei: account for TSEG size with PcdSmmSmramRequire set PlatformPei calls GetSystemMemorySizeBelow4gb() in three locations: - PublishPeiMemory(): on normal boot, the permanent PEI RAM is installed so that it ends with the RAM below 4GB, - QemuInitializeRam(): on normal boot, memory resource descriptor HOBs are created for the RAM below 4GB; plus MTRR attributes are set (independently of S3 vs. normal boot) - MemMapInitialization(): an MMIO resource descriptor HOB is created for PCI resource allocation, on normal boot, starting at max(RAM below 4GB, 2GB). The first two of these is adjusted for the configured TSEG size, if PcdSmmSmramRequire is set: - In PublishPeiMemory(), the permanent PEI RAM is kept under TSEG. - In QemuInitializeRam(), we must keep the DXE out of TSEG. One idea would be to simply trim the [1MB .. LowerMemorySize] memory resource descriptor HOB, leaving a hole for TSEG in the memory space map. The SMM IPL will however want to massage the caching attributes of the SMRAM range that it loads the SMM core into, with gDS->SetMemorySpaceAttributes(), and that won't work on a hole. So, instead of trimming this range, split the TSEG area off, and report it as a cacheable reserved memory resource. Finally, since reserved memory can be allocated too, pre-allocate TSEG in InitializeRamRegions(), after QemuInitializeRam() returns. (Note that this step alone does not suffice without the resource descriptor HOB trickery: if we omit that, then the DXE IPL PEIM fails to load and start the DXE core.) - In MemMapInitialization(), the start of the PCI MMIO range is not affected. We choose the largest option (8MB) for the default TSEG size. Michael Kinney pointed out that the SMBASE relocation in PiSmmCpuDxeSmm consumes SMRAM proportionally to the number of CPUs. From the three options available, he reported that 8MB was both necessary and sufficient for the SMBASE relocation to succeed with 255 CPUs: - http://thread.gmane.org/gmane.comp.bios.edk2.devel/3020/focus=3137 - http://thread.gmane.org/gmane.comp.bios.edk2.devel/3020/focus=3177 Cc: Michael Kinney <michael.d.kinney@intel.com> Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Michael Kinney <michael.d.kinney@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@19039 6f19259b-4bc3-4df7-8a09-765794883524
2015-11-30 19:41:33 +01:00
if (FeaturePcdGet (PcdSmmSmramRequire)) {
UINT32 TsegSize;
TsegSize = mQ35TsegMbytes * SIZE_1MB;
OvmfPkg: PlatformPei: account for TSEG size with PcdSmmSmramRequire set PlatformPei calls GetSystemMemorySizeBelow4gb() in three locations: - PublishPeiMemory(): on normal boot, the permanent PEI RAM is installed so that it ends with the RAM below 4GB, - QemuInitializeRam(): on normal boot, memory resource descriptor HOBs are created for the RAM below 4GB; plus MTRR attributes are set (independently of S3 vs. normal boot) - MemMapInitialization(): an MMIO resource descriptor HOB is created for PCI resource allocation, on normal boot, starting at max(RAM below 4GB, 2GB). The first two of these is adjusted for the configured TSEG size, if PcdSmmSmramRequire is set: - In PublishPeiMemory(), the permanent PEI RAM is kept under TSEG. - In QemuInitializeRam(), we must keep the DXE out of TSEG. One idea would be to simply trim the [1MB .. LowerMemorySize] memory resource descriptor HOB, leaving a hole for TSEG in the memory space map. The SMM IPL will however want to massage the caching attributes of the SMRAM range that it loads the SMM core into, with gDS->SetMemorySpaceAttributes(), and that won't work on a hole. So, instead of trimming this range, split the TSEG area off, and report it as a cacheable reserved memory resource. Finally, since reserved memory can be allocated too, pre-allocate TSEG in InitializeRamRegions(), after QemuInitializeRam() returns. (Note that this step alone does not suffice without the resource descriptor HOB trickery: if we omit that, then the DXE IPL PEIM fails to load and start the DXE core.) - In MemMapInitialization(), the start of the PCI MMIO range is not affected. We choose the largest option (8MB) for the default TSEG size. Michael Kinney pointed out that the SMBASE relocation in PiSmmCpuDxeSmm consumes SMRAM proportionally to the number of CPUs. From the three options available, he reported that 8MB was both necessary and sufficient for the SMBASE relocation to succeed with 255 CPUs: - http://thread.gmane.org/gmane.comp.bios.edk2.devel/3020/focus=3137 - http://thread.gmane.org/gmane.comp.bios.edk2.devel/3020/focus=3177 Cc: Michael Kinney <michael.d.kinney@intel.com> Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Michael Kinney <michael.d.kinney@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@19039 6f19259b-4bc3-4df7-8a09-765794883524
2015-11-30 19:41:33 +01:00
AddMemoryRangeHob (BASE_1MB, LowerMemorySize - TsegSize);
AddReservedMemoryBaseSizeHob (LowerMemorySize - TsegSize, TsegSize,
TRUE);
} else {
AddMemoryRangeHob (BASE_1MB, LowerMemorySize);
}
OvmfPkg/PlatformPei: support >=1TB high RAM, and discontiguous high RAM In OVMF we currently get the upper (>=4GB) memory size with the GetSystemMemorySizeAbove4gb() function. The GetSystemMemorySizeAbove4gb() function is used in two places: (1) It is the starting point of the calculations in GetFirstNonAddress(). GetFirstNonAddress() in turn - determines the placement of the 64-bit PCI MMIO aperture, - provides input for the GCD memory space map's sizing (see AddressWidthInitialization(), and the CPU HOB in MiscInitialization()), - influences the permanent PEI RAM cap (the DXE core's page tables, built in permanent PEI RAM, grow as the RAM to map grows). (2) In QemuInitializeRam(), GetSystemMemorySizeAbove4gb() determines the single memory descriptor HOB that we produce for the upper memory. Respectively, there are two problems with GetSystemMemorySizeAbove4gb(): (1) It reads a 24-bit count of 64KB RAM chunks from the CMOS, and therefore cannot return a larger value than one terabyte. (2) It cannot express discontiguous high RAM. Starting with version 1.7.0, QEMU has provided the fw_cfg file called "etc/e820". Refer to the following QEMU commits: - 0624c7f916b4 ("e820: pass high memory too.", 2013-10-10), - 7d67110f2d9a ("pc: add etc/e820 fw_cfg file", 2013-10-18) - 7db16f2480db ("pc: register e820 entries for ram", 2013-10-10) Ever since these commits in v1.7.0 -- with the last QEMU release being v2.9.0, and v2.10.0 under development --, the only two RAM entries added to this E820 map correspond to the below-4GB RAM range, and the above-4GB RAM range. And, the above-4GB range exactly matches the CMOS registers in question; see the use of "pcms->above_4g_mem_size": pc_q35_init() | pc_init1() pc_memory_init() e820_add_entry(0x100000000ULL, pcms->above_4g_mem_size, E820_RAM); pc_cmos_init() val = pcms->above_4g_mem_size / 65536; rtc_set_memory(s, 0x5b, val); rtc_set_memory(s, 0x5c, val >> 8); rtc_set_memory(s, 0x5d, val >> 16); Therefore, remedy the above OVMF limitations as follows: (1) Start off GetFirstNonAddress() by scanning the E820 map for the highest exclusive >=4GB RAM address. Fall back to the CMOS if the E820 map is unavailable. Base all further calculations (such as 64-bit PCI MMIO aperture placement, GCD sizing etc) on this value. At the moment, the only difference this change makes is that we can have more than 1TB above 4GB -- given that the sole "high RAM" entry in the E820 map matches the CMOS exactly, modulo the most significant bits (see above). However, Igor plans to add discontiguous (cold-plugged) high RAM to the fw_cfg E820 RAM map later on, and then this scanning will adapt automatically. (2) In QemuInitializeRam(), describe the high RAM regions from the E820 map one by one with memory HOBs. Fall back to the CMOS only if the E820 map is missing. Again, right now this change only makes a difference if there is at least 1TB high RAM. Later on it will adapt to discontiguous high RAM (regardless of its size) automatically. -*- Implementation details: introduce the ScanOrAdd64BitE820Ram() function, which reads the E820 entries from fw_cfg, and finds the highest exclusive >=4GB RAM address, or produces memory resource descriptor HOBs for RAM entries that start at or above 4GB. The RAM map is not read in a single go, because its size can vary, and in PlatformPei we should stay away from dynamic memory allocation, for the following reasons: - "Pool" allocations are limited to ~64KB, are served from HOBs, and cannot be released ever. - "Page" allocations are seriously limited before PlatformPei installs the permanent PEI RAM. Furthermore, page allocations can only be released in DXE, with dedicated code (so the address would have to be passed on with a HOB or PCD). - Raw memory allocation HOBs would require the same freeing in DXE. Therefore we process each E820 entry as soon as it is read from fw_cfg. -*- Considering the impact of high RAM on the DXE core: A few years ago, installing high RAM as *tested* would cause the DXE core to inhabit such ranges rather than carving out its home from the permanent PEI RAM. Fortunately, this was fixed in the following edk2 commit: 3a05b13106d1, "MdeModulePkg DxeCore: Take the range in resource HOB for PHIT as higher priority", 2015-09-18 which I regression-tested at the time: http://mid.mail-archive.com/55FC27B0.4070807@redhat.com Later on, OVMF was changed to install its high RAM as tested (effectively "arming" the earlier DXE core change for OVMF), in the following edk2 commit: 035ce3b37c90, "OvmfPkg/PlatformPei: Add memory above 4GB as tested", 2016-04-21 which I also regression-tested at the time: http://mid.mail-archive.com/571E8B90.1020102@redhat.com Therefore adding more "tested memory" HOBs is safe. Cc: Jordan Justen <jordan.l.justen@intel.com> Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1468526 Contributed-under: TianoCore Contribution Agreement 1.1 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2017-07-08 01:28:37 +02:00
//
// If QEMU presents an E820 map, then create memory HOBs for the >=4GB RAM
// entries. Otherwise, create a single memory HOB with the flat >=4GB
// memory size read from the CMOS.
//
Status = ScanOrAdd64BitE820Ram (NULL);
if (EFI_ERROR (Status) && UpperMemorySize != 0) {
AddMemoryBaseSizeHob (BASE_4GB, UpperMemorySize);
}
}
OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam() At the moment we work with a UC default MTRR type, and set three memory ranges to WB: - [0, 640 KB), - [1 MB, LowerMemorySize), - [4 GB, 4 GB + UpperMemorySize). Unfortunately, coverage for the third range can fail with a high likelihood. If the alignment of the base (ie. 4 GB) and the alignment of the size (UpperMemorySize) differ, then MtrrLib creates a series of variable MTRR entries, with power-of-two sized MTRR masks. And, it's really easy to run out of variable MTRR entries, dependent on the alignment difference. This is a problem because a Linux guest will loudly reject any high memory that is not covered my MTRR. So, let's follow the inverse pattern (loosely inspired by SeaBIOS): - flip the MTRR default type to WB, - set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default type and variable MTRRs, so we can't avoid this, - set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs, - set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs more likely than the other scheme (due to less chaotic alignment differences). Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000) in PcdDebugPrintErrorLevel. Cc: Maoming <maoming.maoming@huawei.com> Cc: Huangpeng (Peter) <peter.huangpeng@huawei.com> Cc: Wei Liu <wei.liu2@citrix.com> Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Tested-by: Maoming <maoming.maoming@huawei.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@17722 6f19259b-4bc3-4df7-8a09-765794883524
2015-06-26 18:09:52 +02:00
//
// We'd like to keep the following ranges uncached:
// - [640 KB, 1 MB)
// - [LowerMemorySize, 4 GB)
//
// Everything else should be WB. Unfortunately, programming the inverse (ie.
// keeping the default UC, and configuring the complement set of the above as
// WB) is not reliable in general, because the end of the upper RAM can have
// practically any alignment, and we may not have enough variable MTRRs to
// cover it exactly.
//
if (IsMtrrSupported ()) {
MtrrGetAllMtrrs (&MtrrSettings);
//
// MTRRs disabled, fixed MTRRs disabled, default type is uncached
//
ASSERT ((MtrrSettings.MtrrDefType & BIT11) == 0);
ASSERT ((MtrrSettings.MtrrDefType & BIT10) == 0);
ASSERT ((MtrrSettings.MtrrDefType & 0xFF) == 0);
//
// flip default type to writeback
//
SetMem (&MtrrSettings.Fixed, sizeof MtrrSettings.Fixed, 0x06);
ZeroMem (&MtrrSettings.Variables, sizeof MtrrSettings.Variables);
MtrrSettings.MtrrDefType |= BIT11 | BIT10 | 6;
MtrrSetAllMtrrs (&MtrrSettings);
OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam() At the moment we work with a UC default MTRR type, and set three memory ranges to WB: - [0, 640 KB), - [1 MB, LowerMemorySize), - [4 GB, 4 GB + UpperMemorySize). Unfortunately, coverage for the third range can fail with a high likelihood. If the alignment of the base (ie. 4 GB) and the alignment of the size (UpperMemorySize) differ, then MtrrLib creates a series of variable MTRR entries, with power-of-two sized MTRR masks. And, it's really easy to run out of variable MTRR entries, dependent on the alignment difference. This is a problem because a Linux guest will loudly reject any high memory that is not covered my MTRR. So, let's follow the inverse pattern (loosely inspired by SeaBIOS): - flip the MTRR default type to WB, - set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default type and variable MTRRs, so we can't avoid this, - set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs, - set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs more likely than the other scheme (due to less chaotic alignment differences). Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000) in PcdDebugPrintErrorLevel. Cc: Maoming <maoming.maoming@huawei.com> Cc: Huangpeng (Peter) <peter.huangpeng@huawei.com> Cc: Wei Liu <wei.liu2@citrix.com> Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Tested-by: Maoming <maoming.maoming@huawei.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@17722 6f19259b-4bc3-4df7-8a09-765794883524
2015-06-26 18:09:52 +02:00
//
// Set memory range from 640KB to 1MB to uncacheable
//
Status = MtrrSetMemoryAttribute (BASE_512KB + BASE_128KB,
BASE_1MB - (BASE_512KB + BASE_128KB), CacheUncacheable);
ASSERT_EFI_ERROR (Status);
OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam() At the moment we work with a UC default MTRR type, and set three memory ranges to WB: - [0, 640 KB), - [1 MB, LowerMemorySize), - [4 GB, 4 GB + UpperMemorySize). Unfortunately, coverage for the third range can fail with a high likelihood. If the alignment of the base (ie. 4 GB) and the alignment of the size (UpperMemorySize) differ, then MtrrLib creates a series of variable MTRR entries, with power-of-two sized MTRR masks. And, it's really easy to run out of variable MTRR entries, dependent on the alignment difference. This is a problem because a Linux guest will loudly reject any high memory that is not covered my MTRR. So, let's follow the inverse pattern (loosely inspired by SeaBIOS): - flip the MTRR default type to WB, - set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default type and variable MTRRs, so we can't avoid this, - set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs, - set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs more likely than the other scheme (due to less chaotic alignment differences). Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000) in PcdDebugPrintErrorLevel. Cc: Maoming <maoming.maoming@huawei.com> Cc: Huangpeng (Peter) <peter.huangpeng@huawei.com> Cc: Wei Liu <wei.liu2@citrix.com> Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Tested-by: Maoming <maoming.maoming@huawei.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@17722 6f19259b-4bc3-4df7-8a09-765794883524
2015-06-26 18:09:52 +02:00
//
// Set memory range from the "top of lower RAM" (RAM below 4GB) to 4GB as
// uncacheable
OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam() At the moment we work with a UC default MTRR type, and set three memory ranges to WB: - [0, 640 KB), - [1 MB, LowerMemorySize), - [4 GB, 4 GB + UpperMemorySize). Unfortunately, coverage for the third range can fail with a high likelihood. If the alignment of the base (ie. 4 GB) and the alignment of the size (UpperMemorySize) differ, then MtrrLib creates a series of variable MTRR entries, with power-of-two sized MTRR masks. And, it's really easy to run out of variable MTRR entries, dependent on the alignment difference. This is a problem because a Linux guest will loudly reject any high memory that is not covered my MTRR. So, let's follow the inverse pattern (loosely inspired by SeaBIOS): - flip the MTRR default type to WB, - set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default type and variable MTRRs, so we can't avoid this, - set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs, - set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs more likely than the other scheme (due to less chaotic alignment differences). Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000) in PcdDebugPrintErrorLevel. Cc: Maoming <maoming.maoming@huawei.com> Cc: Huangpeng (Peter) <peter.huangpeng@huawei.com> Cc: Wei Liu <wei.liu2@citrix.com> Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Tested-by: Maoming <maoming.maoming@huawei.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@17722 6f19259b-4bc3-4df7-8a09-765794883524
2015-06-26 18:09:52 +02:00
//
Status = MtrrSetMemoryAttribute (LowerMemorySize,
SIZE_4GB - LowerMemorySize, CacheUncacheable);
OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam() At the moment we work with a UC default MTRR type, and set three memory ranges to WB: - [0, 640 KB), - [1 MB, LowerMemorySize), - [4 GB, 4 GB + UpperMemorySize). Unfortunately, coverage for the third range can fail with a high likelihood. If the alignment of the base (ie. 4 GB) and the alignment of the size (UpperMemorySize) differ, then MtrrLib creates a series of variable MTRR entries, with power-of-two sized MTRR masks. And, it's really easy to run out of variable MTRR entries, dependent on the alignment difference. This is a problem because a Linux guest will loudly reject any high memory that is not covered my MTRR. So, let's follow the inverse pattern (loosely inspired by SeaBIOS): - flip the MTRR default type to WB, - set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default type and variable MTRRs, so we can't avoid this, - set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs, - set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs more likely than the other scheme (due to less chaotic alignment differences). Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000) in PcdDebugPrintErrorLevel. Cc: Maoming <maoming.maoming@huawei.com> Cc: Huangpeng (Peter) <peter.huangpeng@huawei.com> Cc: Wei Liu <wei.liu2@citrix.com> Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Tested-by: Maoming <maoming.maoming@huawei.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@17722 6f19259b-4bc3-4df7-8a09-765794883524
2015-06-26 18:09:52 +02:00
ASSERT_EFI_ERROR (Status);
}
}
/**
Publish system RAM and reserve memory regions
**/
VOID
InitializeRamRegions (
VOID
)
{
if (!mXen) {
QemuInitializeRam ();
} else {
XenPublishRamRegions ();
}
if (mS3Supported && mBootMode != BOOT_ON_S3_RESUME) {
//
// This is the memory range that will be used for PEI on S3 resume
//
BuildMemoryAllocationHob (
mS3AcpiReservedMemoryBase,
mS3AcpiReservedMemorySize,
EfiACPIMemoryNVS
);
//
// Cover the initial RAM area used as stack and temporary PEI heap.
//
// This is reserved as ACPI NVS so it can be used on S3 resume.
//
BuildMemoryAllocationHob (
PcdGet32 (PcdOvmfSecPeiTempRamBase),
PcdGet32 (PcdOvmfSecPeiTempRamSize),
EfiACPIMemoryNVS
);
OvmfPkg: PlatformPei: protect SEC's GUIDed section handler table thru S3 OVMF's SecMain is unique in the sense that it links against the following two libraries *in combination*: - IntelFrameworkModulePkg/Library/LzmaCustomDecompressLib/ LzmaCustomDecompressLib.inf - MdePkg/Library/BaseExtractGuidedSectionLib/ BaseExtractGuidedSectionLib.inf The ExtractGuidedSectionLib library class allows decompressor modules to register themselves (keyed by GUID) with it, and it allows clients to decompress file sections with a registered decompressor module that matches the section's GUID. BaseExtractGuidedSectionLib is a library instance (of type BASE) for this library class. It has no constructor function. LzmaCustomDecompressLib is a compatible decompressor module (of type BASE). Its section type GUID is gLzmaCustomDecompressGuid == EE4E5898-3914-4259-9D6E-DC7BD79403CF When OVMF's SecMain module starts, the LzmaCustomDecompressLib constructor function is executed, which registers its LZMA decompressor with the above GUID, by calling into BaseExtractGuidedSectionLib: LzmaDecompressLibConstructor() [GuidedSectionExtraction.c] ExtractGuidedSectionRegisterHandlers() [BaseExtractGuidedSectionLib.c] GetExtractGuidedSectionHandlerInfo() PcdGet64 (PcdGuidedExtractHandlerTableAddress) -- NOTE THIS Later, during a normal (non-S3) boot, SecMain utilizes this decompressor to get information about, and to decompress, sections of the OVMF firmware image: SecCoreStartupWithStack() [OvmfPkg/Sec/SecMain.c] SecStartupPhase2() FindAndReportEntryPoints() FindPeiCoreImageBase() DecompressMemFvs() ExtractGuidedSectionGetInfo() [BaseExtractGuidedSectionLib.c] ExtractGuidedSectionDecode() [BaseExtractGuidedSectionLib.c] Notably, only the extraction depends on full-config-boot; the registration of LzmaCustomDecompressLib occurs unconditionally in the SecMain EFI binary, triggered by the library constructor function. This is where the bug happens. BaseExtractGuidedSectionLib maintains the table of GUIDed decompressors (section handlers) at a fixed memory location; selected by PcdGuidedExtractHandlerTableAddress (declared in MdePkg.dec). The default value of this PCD is 0x1000000 (16 MB). This causes SecMain to corrupt guest OS memory during S3, leading to random crashes. Compare the following two memory dumps, the first taken right before suspending, the second taken right after resuming a RHEL-7 guest: crash> rd -8 -p 1000000 0x50 1000000: c0 00 08 00 02 00 00 00 00 00 00 00 00 00 00 00 ................ 1000010: d0 33 0c 00 00 c9 ff ff c0 10 00 01 00 88 ff ff .3.............. 1000020: 0a 6d 57 32 0f 00 00 00 38 00 00 01 00 88 ff ff .mW2....8....... 1000030: 00 00 00 00 00 00 00 00 73 69 67 6e 61 6c 6d 6f ........signalmo 1000040: 64 75 6c 65 2e 73 6f 00 00 00 00 00 00 00 00 00 dule.so......... vs. crash> rd -8 -p 1000000 0x50 1000000: 45 47 53 49 01 00 00 00 20 00 00 01 00 00 00 00 EGSI.... ....... 1000010: 20 01 00 01 00 00 00 00 a0 01 00 01 00 00 00 00 ............... 1000020: 98 58 4e ee 14 39 59 42 9d 6e dc 7b d7 94 03 cf .XN..9YB.n.{.... 1000030: 00 00 00 00 00 00 00 00 73 69 67 6e 61 6c 6d 6f ........signalmo 1000040: 64 75 6c 65 2e 73 6f 00 00 00 00 00 00 00 00 00 dule.so......... The "EGSI" signature corresponds to EXTRACT_HANDLER_INFO_SIGNATURE declared in MdePkg/Library/BaseExtractGuidedSectionLib/BaseExtractGuidedSectionLib.c. Additionally, the gLzmaCustomDecompressGuid (quoted above) is visible at guest-phys offset 0x1000020. Fix the problem as follows: - Carve out 4KB from the 36KB gap that we currently have between PcdOvmfLockBoxStorageBase + PcdOvmfLockBoxStorageSize == 8220 KB and PcdOvmfSecPeiTempRamBase == 8256 KB. - Point PcdGuidedExtractHandlerTableAddress to 8220 KB (0x00807000). - Cover the area with an EfiACPIMemoryNVS type memalloc HOB, if S3 is supported and we're not currently resuming. The 4KB size that we pick is an upper estimate for BaseExtractGuidedSectionLib's internal storage size. The latter is calculated as follows (see GetExtractGuidedSectionHandlerInfo()): sizeof(EXTRACT_GUIDED_SECTION_HANDLER_INFO) + // 32 PcdMaximumGuidedExtractHandler * ( sizeof(GUID) + // 16 sizeof(EXTRACT_GUIDED_SECTION_DECODE_HANDLER) + // 8 sizeof(EXTRACT_GUIDED_SECTION_GET_INFO_HANDLER) // 8 ) OVMF sets PcdMaximumGuidedExtractHandler to 16 decimal (which is the MdePkg default too), yielding 32 + 16 * (16 + 8 + 8) == 544 bytes. Regarding the lifecycle of the new area: (a) when and how it is initialized after first boot of the VM The library linked into SecMain finds that the area lacks the signature. It initializes the signature, plus the rest of the structure. This is independent of S3 support. Consumption of the area is also limited to SEC (but consumption does depend on full-config-boot). (b) how it is protected from memory allocations during DXE It is not, in the general case; and we don't need to. Nothing else links against BaseExtractGuidedSectionLib; it's OK if DXE overwrites the area. (c) how it is protected from the OS When S3 is enabled, we cover it with AcpiNVS in InitializeRamRegions(). When S3 is not supported, the range is not protected. (d) how it is accessed on the S3 resume path Examined by the library linked into SecMain. Registrations update the table in-place (based on GUID matches). (e) how it is accessed on the warm reset path If S3 is enabled, then the OS won't damage the table (due to (c)), hence see (d). If S3 is unsupported, then the OS may or may not overwrite the signature. (It likely will.) This is identical to the pre-patch status. Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@15433 6f19259b-4bc3-4df7-8a09-765794883524
2014-04-05 23:26:09 +02:00
//
// SEC stores its table of GUIDed section handlers here.
//
BuildMemoryAllocationHob (
PcdGet64 (PcdGuidedExtractHandlerTableAddress),
PcdGet32 (PcdGuidedExtractHandlerTableSize),
EfiACPIMemoryNVS
);
#ifdef MDE_CPU_X64
//
// Reserve the initial page tables built by the reset vector code.
//
// Since this memory range will be used by the Reset Vector on S3
// resume, it must be reserved as ACPI NVS.
//
BuildMemoryAllocationHob (
(EFI_PHYSICAL_ADDRESS)(UINTN) PcdGet32 (PcdOvmfSecPageTablesBase),
(UINT64)(UINTN) PcdGet32 (PcdOvmfSecPageTablesSize),
EfiACPIMemoryNVS
);
#endif
}
if (mBootMode != BOOT_ON_S3_RESUME) {
if (!FeaturePcdGet (PcdSmmSmramRequire)) {
//
// Reserve the lock box storage area
//
// Since this memory range will be used on S3 resume, it must be
// reserved as ACPI NVS.
//
// If S3 is unsupported, then various drivers might still write to the
// LockBox area. We ought to prevent DXE from serving allocation requests
// such that they would overlap the LockBox storage.
//
ZeroMem (
(VOID*)(UINTN) PcdGet32 (PcdOvmfLockBoxStorageBase),
(UINTN) PcdGet32 (PcdOvmfLockBoxStorageSize)
);
BuildMemoryAllocationHob (
(EFI_PHYSICAL_ADDRESS)(UINTN) PcdGet32 (PcdOvmfLockBoxStorageBase),
(UINT64)(UINTN) PcdGet32 (PcdOvmfLockBoxStorageSize),
mS3Supported ? EfiACPIMemoryNVS : EfiBootServicesData
);
}
OvmfPkg: PlatformPei: account for TSEG size with PcdSmmSmramRequire set PlatformPei calls GetSystemMemorySizeBelow4gb() in three locations: - PublishPeiMemory(): on normal boot, the permanent PEI RAM is installed so that it ends with the RAM below 4GB, - QemuInitializeRam(): on normal boot, memory resource descriptor HOBs are created for the RAM below 4GB; plus MTRR attributes are set (independently of S3 vs. normal boot) - MemMapInitialization(): an MMIO resource descriptor HOB is created for PCI resource allocation, on normal boot, starting at max(RAM below 4GB, 2GB). The first two of these is adjusted for the configured TSEG size, if PcdSmmSmramRequire is set: - In PublishPeiMemory(), the permanent PEI RAM is kept under TSEG. - In QemuInitializeRam(), we must keep the DXE out of TSEG. One idea would be to simply trim the [1MB .. LowerMemorySize] memory resource descriptor HOB, leaving a hole for TSEG in the memory space map. The SMM IPL will however want to massage the caching attributes of the SMRAM range that it loads the SMM core into, with gDS->SetMemorySpaceAttributes(), and that won't work on a hole. So, instead of trimming this range, split the TSEG area off, and report it as a cacheable reserved memory resource. Finally, since reserved memory can be allocated too, pre-allocate TSEG in InitializeRamRegions(), after QemuInitializeRam() returns. (Note that this step alone does not suffice without the resource descriptor HOB trickery: if we omit that, then the DXE IPL PEIM fails to load and start the DXE core.) - In MemMapInitialization(), the start of the PCI MMIO range is not affected. We choose the largest option (8MB) for the default TSEG size. Michael Kinney pointed out that the SMBASE relocation in PiSmmCpuDxeSmm consumes SMRAM proportionally to the number of CPUs. From the three options available, he reported that 8MB was both necessary and sufficient for the SMBASE relocation to succeed with 255 CPUs: - http://thread.gmane.org/gmane.comp.bios.edk2.devel/3020/focus=3137 - http://thread.gmane.org/gmane.comp.bios.edk2.devel/3020/focus=3177 Cc: Michael Kinney <michael.d.kinney@intel.com> Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Michael Kinney <michael.d.kinney@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@19039 6f19259b-4bc3-4df7-8a09-765794883524
2015-11-30 19:41:33 +01:00
if (FeaturePcdGet (PcdSmmSmramRequire)) {
UINT32 TsegSize;
//
// Make sure the TSEG area that we reported as a reserved memory resource
// cannot be used for reserved memory allocations.
//
TsegSize = mQ35TsegMbytes * SIZE_1MB;
OvmfPkg: PlatformPei: account for TSEG size with PcdSmmSmramRequire set PlatformPei calls GetSystemMemorySizeBelow4gb() in three locations: - PublishPeiMemory(): on normal boot, the permanent PEI RAM is installed so that it ends with the RAM below 4GB, - QemuInitializeRam(): on normal boot, memory resource descriptor HOBs are created for the RAM below 4GB; plus MTRR attributes are set (independently of S3 vs. normal boot) - MemMapInitialization(): an MMIO resource descriptor HOB is created for PCI resource allocation, on normal boot, starting at max(RAM below 4GB, 2GB). The first two of these is adjusted for the configured TSEG size, if PcdSmmSmramRequire is set: - In PublishPeiMemory(), the permanent PEI RAM is kept under TSEG. - In QemuInitializeRam(), we must keep the DXE out of TSEG. One idea would be to simply trim the [1MB .. LowerMemorySize] memory resource descriptor HOB, leaving a hole for TSEG in the memory space map. The SMM IPL will however want to massage the caching attributes of the SMRAM range that it loads the SMM core into, with gDS->SetMemorySpaceAttributes(), and that won't work on a hole. So, instead of trimming this range, split the TSEG area off, and report it as a cacheable reserved memory resource. Finally, since reserved memory can be allocated too, pre-allocate TSEG in InitializeRamRegions(), after QemuInitializeRam() returns. (Note that this step alone does not suffice without the resource descriptor HOB trickery: if we omit that, then the DXE IPL PEIM fails to load and start the DXE core.) - In MemMapInitialization(), the start of the PCI MMIO range is not affected. We choose the largest option (8MB) for the default TSEG size. Michael Kinney pointed out that the SMBASE relocation in PiSmmCpuDxeSmm consumes SMRAM proportionally to the number of CPUs. From the three options available, he reported that 8MB was both necessary and sufficient for the SMBASE relocation to succeed with 255 CPUs: - http://thread.gmane.org/gmane.comp.bios.edk2.devel/3020/focus=3137 - http://thread.gmane.org/gmane.comp.bios.edk2.devel/3020/focus=3177 Cc: Michael Kinney <michael.d.kinney@intel.com> Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Michael Kinney <michael.d.kinney@intel.com> git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@19039 6f19259b-4bc3-4df7-8a09-765794883524
2015-11-30 19:41:33 +01:00
BuildMemoryAllocationHob (
GetSystemMemorySizeBelow4gb() - TsegSize,
TsegSize,
EfiReservedMemoryType
);
}
}
}