[U-Boot] [PATCH 0/8] x86: quark: Convert to driver model

This series converts to use driver model for PCI/USB/ETH on Intel Galileo board.
Note this series depend on Simon's patch [1] below, which fixes the issue that pci enumeration is not automatically triggered when using pci config read/write apis.
Boot time performance degradation is observed with the conversion to use dm pci. Intel Quark SoC has a low end x86 processor with only 400MHz frequency and the most time consuming part is with MRC. Each MRC register programming requires indirect access via pci bus. With dm pci, accessing pci configuration space has some overhead. Unfortunately this single access overhead gets accumulated in the whole MRC process, and finally leads to twice boot time (25s) than before (12s).
To speed up the boot, create an optimized version of pci config read/write routines without bothering to go through driver model. Now it only takes about 3 seconds to finish MRC, which is really fast (8 times faster than dm pci, or 4 times faster than before).
[1]: http://patchwork.ozlabs.org/patch/509736/
Bin Meng (8): x86: quark: Make host bridge (b.d.f=0.0.0) visible dm: pci: Allow skipping device configuration x86: Convert to use driver model pci on quark/galileo x86: galileo: Convert to use CONFIG_DM_USB net: designware: Fix build warnings net: designware: Add support to PCI designware devices x86: Convert to use driver model eth on quark/galileo x86: quark: Optimize MRC execution time
arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- arch/x86/cpu/quark/pci.c | 46 +-------------------------------- arch/x86/cpu/quark/quark.c | 27 +++----------------- arch/x86/dts/galileo.dts | 8 ++++-- configs/galileo_defconfig | 5 +++- drivers/net/designware.c | 47 +++++++++++++++++++++++++++++++--- drivers/pci/pci-uclass.c | 3 +++ include/configs/galileo.h | 12 --------- 8 files changed, 98 insertions(+), 109 deletions(-)

Quark-specific codes pci_skip_dev() hide the host bridge, but since all other x86 boards do not hide the host bridge (b.d.f=0.0.0) now, remove such limitation for quark too.
Signed-off-by: Bin Meng bmeng.cn@gmail.com ---
arch/x86/cpu/quark/pci.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/arch/x86/cpu/quark/pci.c b/arch/x86/cpu/quark/pci.c index 354e15a..a1685ad 100644 --- a/arch/x86/cpu/quark/pci.c +++ b/arch/x86/cpu/quark/pci.c @@ -61,10 +61,8 @@ int pci_skip_dev(struct pci_controller *hose, pci_dev_t dev) * For now we just skip these two devices, and this needs to * be revisited later. */ - if (dev == QUARK_HOST_BRIDGE || - dev == QUARK_PCIE0 || dev == QUARK_PCIE1) { + if (dev == QUARK_PCIE0 || dev == QUARK_PCIE1) return 1; - }
return 0; }

On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Quark-specific codes pci_skip_dev() hide the host bridge, but since all other x86 boards do not hide the host bridge (b.d.f=0.0.0) now, remove such limitation for quark too.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
arch/x86/cpu/quark/pci.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
Acked-by: Simon Glass sjg@chromium.org

The non-dm pci driver can skip specified device configuration. With dm conversion, this capability is lost. Add this back for dm pci.
Signed-off-by: Bin Meng bmeng.cn@gmail.com ---
drivers/pci/pci-uclass.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/pci/pci-uclass.c b/drivers/pci/pci-uclass.c index b25298f..0e64244 100644 --- a/drivers/pci/pci-uclass.c +++ b/drivers/pci/pci-uclass.c @@ -562,6 +562,9 @@ int pci_bind_bus_devices(struct udevice *bus) struct udevice *dev; ulong class;
+ if (pci_skip_dev(bus->uclass_priv, bdf)) + continue; + if (PCI_FUNC(bdf) && !found_multi) continue; /* Check only the first access, we don't expect problems */

Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
The non-dm pci driver can skip specified device configuration. With dm conversion, this capability is lost. Add this back for dm pci.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
drivers/pci/pci-uclass.c | 3 +++ 1 file changed, 3 insertions(+)
I really am not keen on these callback functions and would like to avoid them for driver model. Can we check a device tree property instead?
Regards, Simon

Hi Simon,
On Mon, Aug 31, 2015 at 9:21 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
The non-dm pci driver can skip specified device configuration. With dm conversion, this capability is lost. Add this back for dm pci.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
drivers/pci/pci-uclass.c | 3 +++ 1 file changed, 3 insertions(+)
I really am not keen on these callback functions and would like to avoid them for driver model. Can we check a device tree property instead?
Yes, I think so. Will try in v2.
Regards, Bin

Move to driver model pci for Intel quark/galileo.
Signed-off-by: Bin Meng bmeng.cn@gmail.com ---
arch/x86/cpu/quark/pci.c | 42 ------------------------------------------ arch/x86/cpu/quark/quark.c | 8 ++++---- arch/x86/dts/galileo.dts | 8 ++++++-- configs/galileo_defconfig | 1 + include/configs/galileo.h | 12 ------------ 5 files changed, 11 insertions(+), 60 deletions(-)
diff --git a/arch/x86/cpu/quark/pci.c b/arch/x86/cpu/quark/pci.c index a1685ad..ffef7f2 100644 --- a/arch/x86/cpu/quark/pci.c +++ b/arch/x86/cpu/quark/pci.c @@ -6,50 +6,8 @@
#include <common.h> #include <pci.h> -#include <asm/pci.h> #include <asm/arch/device.h>
-DECLARE_GLOBAL_DATA_PTR; - -void board_pci_setup_hose(struct pci_controller *hose) -{ - hose->first_busno = 0; - hose->last_busno = 0; - - /* PCI memory space */ - pci_set_region(hose->regions + 0, - CONFIG_PCI_MEM_BUS, - CONFIG_PCI_MEM_PHYS, - CONFIG_PCI_MEM_SIZE, - PCI_REGION_MEM); - - /* PCI IO space */ - pci_set_region(hose->regions + 1, - CONFIG_PCI_IO_BUS, - CONFIG_PCI_IO_PHYS, - CONFIG_PCI_IO_SIZE, - PCI_REGION_IO); - - pci_set_region(hose->regions + 2, - CONFIG_PCI_PREF_BUS, - CONFIG_PCI_PREF_PHYS, - CONFIG_PCI_PREF_SIZE, - PCI_REGION_PREFETCH); - - pci_set_region(hose->regions + 3, - 0, - 0, - gd->ram_size, - PCI_REGION_MEM | PCI_REGION_SYS_MEMORY); - - hose->region_count = 4; -} - -int board_pci_post_scan(struct pci_controller *hose) -{ - return 0; -} - int pci_skip_dev(struct pci_controller *hose, pci_dev_t dev) { /* diff --git a/arch/x86/cpu/quark/quark.c b/arch/x86/cpu/quark/quark.c index 12ac376..e4a7c50 100644 --- a/arch/x86/cpu/quark/quark.c +++ b/arch/x86/cpu/quark/quark.c @@ -84,7 +84,6 @@ static void quark_enable_legacy_seg(void)
int arch_cpu_init(void) { - struct pci_controller *hose; int ret;
post_code(POST_CPU_INIT); @@ -96,10 +95,11 @@ int arch_cpu_init(void) if (ret) return ret;
- ret = pci_early_init_hose(&hose); - if (ret) - return ret; + return 0; +}
+int arch_cpu_init_dm(void) +{ /* * Quark SoC has some non-standard BARs (excluding PCI standard BARs) * which need be initialized with suggested values diff --git a/arch/x86/dts/galileo.dts b/arch/x86/dts/galileo.dts index d77ff8a..f119bf7 100644 --- a/arch/x86/dts/galileo.dts +++ b/arch/x86/dts/galileo.dts @@ -54,8 +54,11 @@ pci { #address-cells = <3>; #size-cells = <2>; - compatible = "intel,pci"; - device_type = "pci"; + compatible = "pci-x86"; + u-boot,dm-pre-reloc; + ranges = <0x02000000 0x0 0x90000000 0x90000000 0 0x20000000 + 0x42000000 0x0 0xb0000000 0xb0000000 0 0x20000000 + 0x01000000 0x0 0x2000 0x2000 0 0xe000>;
pciuart0: uart@14,5 { compatible = "pci8086,0936.00", @@ -63,6 +66,7 @@ "pciclass,070002", "pciclass,0700", "x86-uart"; + u-boot,dm-pre-reloc; reg = <0x0000a500 0x0 0x0 0x0 0x0 0x0200a510 0x0 0x0 0x0 0x0>; reg-shift = <2>; diff --git a/configs/galileo_defconfig b/configs/galileo_defconfig index 6ef1090..358e87b 100644 --- a/configs/galileo_defconfig +++ b/configs/galileo_defconfig @@ -11,6 +11,7 @@ CONFIG_BOOTSTAGE=y CONFIG_BOOTSTAGE_REPORT=y CONFIG_CMD_BOOTSTAGE=y CONFIG_OF_CONTROL=y +CONFIG_DM_PCI=y CONFIG_SPI_FLASH=y CONFIG_NETDEVICES=y CONFIG_ETH_DESIGNWARE=y diff --git a/include/configs/galileo.h b/include/configs/galileo.h index 3c3c6e9..b7ec279 100644 --- a/include/configs/galileo.h +++ b/include/configs/galileo.h @@ -20,18 +20,6 @@ /* ns16550 UART is memory-mapped in Quark SoC */ #undef CONFIG_SYS_NS16550_PORT_MAPPED
-#define CONFIG_PCI_MEM_BUS 0x90000000 -#define CONFIG_PCI_MEM_PHYS CONFIG_PCI_MEM_BUS -#define CONFIG_PCI_MEM_SIZE 0x20000000 - -#define CONFIG_PCI_PREF_BUS 0xb0000000 -#define CONFIG_PCI_PREF_PHYS CONFIG_PCI_PREF_BUS -#define CONFIG_PCI_PREF_SIZE 0x20000000 - -#define CONFIG_PCI_IO_BUS 0x2000 -#define CONFIG_PCI_IO_PHYS CONFIG_PCI_IO_BUS -#define CONFIG_PCI_IO_SIZE 0xe000 - #define CONFIG_SYS_EARLY_PCI_INIT #define CONFIG_PCI_PNP

On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Move to driver model pci for Intel quark/galileo.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
arch/x86/cpu/quark/pci.c | 42 ------------------------------------------ arch/x86/cpu/quark/quark.c | 8 ++++---- arch/x86/dts/galileo.dts | 8 ++++++-- configs/galileo_defconfig | 1 + include/configs/galileo.h | 12 ------------ 5 files changed, 11 insertions(+), 60 deletions(-)
Acked-by: Simon Glass sjg@chromium.org

On 31 August 2015 at 18:32, Simon Glass sjg@chromium.org wrote:
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Move to driver model pci for Intel quark/galileo.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
arch/x86/cpu/quark/pci.c | 42 ------------------------------------------ arch/x86/cpu/quark/quark.c | 8 ++++---- arch/x86/dts/galileo.dts | 8 ++++++-- configs/galileo_defconfig | 1 + include/configs/galileo.h | 12 ------------ 5 files changed, 11 insertions(+), 60 deletions(-)
Acked-by: Simon Glass sjg@chromium.org
Applied to u-boot-x86, thanks!

Move to driver model for USB on Intel Galileo.
Note: USB is not working on Galileo, which is the same as before. This is just to prepare the CONFIG_DM_ETH conversion later, as without CONFIG_DM_USB the build won't pass.
Signed-off-by: Bin Meng bmeng.cn@gmail.com ---
configs/galileo_defconfig | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/configs/galileo_defconfig b/configs/galileo_defconfig index 358e87b..55a002f 100644 --- a/configs/galileo_defconfig +++ b/configs/galileo_defconfig @@ -15,6 +15,8 @@ CONFIG_DM_PCI=y CONFIG_SPI_FLASH=y CONFIG_NETDEVICES=y CONFIG_ETH_DESIGNWARE=y +CONFIG_USB=y +CONFIG_DM_USB=y CONFIG_DM_RTC=y CONFIG_USE_PRIVATE_LIBGCC=y CONFIG_SYS_VSNPRINTF=y

On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Move to driver model for USB on Intel Galileo.
Note: USB is not working on Galileo, which is the same as before. This is just to prepare the CONFIG_DM_ETH conversion later, as without CONFIG_DM_USB the build won't pass.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
configs/galileo_defconfig | 2 ++ 1 file changed, 2 insertions(+)
Acked-by: Simon Glass sjg@chromium.org

When building dm version of designware eth driver on a platform with 64-bit phys_addr_t, it reports the following warnings:
drivers/net/designware.c: In function 'designware_eth_probe': drivers/net/designware.c:599:2: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'phys_addr_t' [-Wformat] drivers/net/designware.c:600:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] drivers/net/designware.c:601:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
This commit fixes the build warnings.
Signed-off-by: Bin Meng bmeng.cn@gmail.com ---
drivers/net/designware.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/designware.c b/drivers/net/designware.c index d9cb507..ae78d21 100644 --- a/drivers/net/designware.c +++ b/drivers/net/designware.c @@ -562,12 +562,12 @@ static int designware_eth_probe(struct udevice *dev) { struct eth_pdata *pdata = dev_get_platdata(dev); struct dw_eth_dev *priv = dev_get_priv(dev); + u32 iobase = pdata->iobase; int ret;
- debug("%s, iobase=%lx, priv=%p\n", __func__, pdata->iobase, priv); - priv->mac_regs_p = (struct eth_mac_regs *)pdata->iobase; - priv->dma_regs_p = (struct eth_dma_regs *)(pdata->iobase + - DW_DMA_BASE_OFFSET); + debug("%s, iobase=%x, priv=%p\n", __func__, iobase, priv); + priv->mac_regs_p = (struct eth_mac_regs *)iobase; + priv->dma_regs_p = (struct eth_dma_regs *)(iobase + DW_DMA_BASE_OFFSET); priv->interface = pdata->phy_interface;
dw_mdio_init(dev->name, priv->mac_regs_p);

On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
When building dm version of designware eth driver on a platform with 64-bit phys_addr_t, it reports the following warnings:
drivers/net/designware.c: In function 'designware_eth_probe': drivers/net/designware.c:599:2: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'phys_addr_t' [-Wformat] drivers/net/designware.c:600:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] drivers/net/designware.c:601:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
This commit fixes the build warnings.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
drivers/net/designware.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
Reviewed-by: Simon Glass sjg@chromium.org

The Designware ethernet controller is also seen on PCI bus, e.g. on Intel Quark SoC. Add this support in the DM version driver.
Signed-off-by: Bin Meng bmeng.cn@gmail.com ---
drivers/net/designware.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
diff --git a/drivers/net/designware.c b/drivers/net/designware.c index ae78d21..06d6f6a 100644 --- a/drivers/net/designware.c +++ b/drivers/net/designware.c @@ -14,6 +14,7 @@ #include <errno.h> #include <miiphy.h> #include <malloc.h> +#include <pci.h> #include <linux/compiler.h> #include <linux/err.h> #include <asm/io.h> @@ -558,6 +559,20 @@ static int designware_eth_write_hwaddr(struct udevice *dev) return _dw_write_hwaddr(priv, pdata->enetaddr); }
+static int designware_eth_bind(struct udevice *dev) +{ + static int num_cards; + char name[20]; + + /* Create a unique device name for PCI type devices */ + if (device_get_uclass_id(dev->parent) == UCLASS_PCI) { + sprintf(name, "eth_designware#%u", num_cards++); + device_set_name(dev, name); + } + + return 0; +} + static int designware_eth_probe(struct udevice *dev) { struct eth_pdata *pdata = dev_get_platdata(dev); @@ -565,6 +580,22 @@ static int designware_eth_probe(struct udevice *dev) u32 iobase = pdata->iobase; int ret;
+ /* + * If we are on PCI bus, either directly attached to a PCI root port, + * or via a PCI bridge, fill in platdata before we probe the hardware. + */ + if (device_get_uclass_id(dev->parent) == UCLASS_PCI) { + pci_dev_t bdf; + + bdf = pci_get_bdf(dev); + pci_read_config_dword(bdf, PCI_BASE_ADDRESS_0, &iobase); + iobase &= PCI_BASE_ADDRESS_MEM_MASK; + iobase = pci_mem_to_phys(bdf, iobase); + + pdata->iobase = iobase; + pdata->phy_interface = PHY_INTERFACE_MODE_RMII; + } + debug("%s, iobase=%x, priv=%p\n", __func__, iobase, priv); priv->mac_regs_p = (struct eth_mac_regs *)iobase; priv->dma_regs_p = (struct eth_dma_regs *)(iobase + DW_DMA_BASE_OFFSET); @@ -617,10 +648,18 @@ U_BOOT_DRIVER(eth_designware) = { .id = UCLASS_ETH, .of_match = designware_eth_ids, .ofdata_to_platdata = designware_eth_ofdata_to_platdata, + .bind = designware_eth_bind, .probe = designware_eth_probe, .ops = &designware_eth_ops, .priv_auto_alloc_size = sizeof(struct dw_eth_dev), .platdata_auto_alloc_size = sizeof(struct eth_pdata), .flags = DM_FLAG_ALLOC_PRIV_DMA, }; + +static struct pci_device_id supported[] = { + { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_QRK_EMAC) }, + { } +}; + +U_BOOT_PCI_DEVICE(eth_designware, supported); #endif

Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
The Designware ethernet controller is also seen on PCI bus, e.g. on Intel Quark SoC. Add this support in the DM version driver.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
drivers/net/designware.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
diff --git a/drivers/net/designware.c b/drivers/net/designware.c index ae78d21..06d6f6a 100644 --- a/drivers/net/designware.c +++ b/drivers/net/designware.c @@ -14,6 +14,7 @@ #include <errno.h> #include <miiphy.h> #include <malloc.h> +#include <pci.h> #include <linux/compiler.h> #include <linux/err.h> #include <asm/io.h> @@ -558,6 +559,20 @@ static int designware_eth_write_hwaddr(struct udevice *dev) return _dw_write_hwaddr(priv, pdata->enetaddr); }
+static int designware_eth_bind(struct udevice *dev) +{
static int num_cards;
char name[20];
/* Create a unique device name for PCI type devices */
if (device_get_uclass_id(dev->parent) == UCLASS_PCI) {
sprintf(name, "eth_designware#%u", num_cards++);
device_set_name(dev, name);
Hmm this is the second time it has come up/ This is OK, but I wonder if we should add a generic function that names devices based on the number seen for each driver? Could be tricky though. Maybe when we get to 3?
}
return 0;
+}
static int designware_eth_probe(struct udevice *dev) { struct eth_pdata *pdata = dev_get_platdata(dev); @@ -565,6 +580,22 @@ static int designware_eth_probe(struct udevice *dev) u32 iobase = pdata->iobase; int ret;
/*
* If we are on PCI bus, either directly attached to a PCI root port,
* or via a PCI bridge, fill in platdata before we probe the hardware.
*/
if (device_get_uclass_id(dev->parent) == UCLASS_PCI) {
To make it easier to adjust the logic here, can you please create something like device_is_on_pci_bus() to hold this logic? I still think we might change to using UCLASS_BRIDGE for bridges.
pci_dev_t bdf;
bdf = pci_get_bdf(dev);
pci_read_config_dword(bdf, PCI_BASE_ADDRESS_0, &iobase);
iobase &= PCI_BASE_ADDRESS_MEM_MASK;
iobase = pci_mem_to_phys(bdf, iobase);
pdata->iobase = iobase;
pdata->phy_interface = PHY_INTERFACE_MODE_RMII;
}
debug("%s, iobase=%x, priv=%p\n", __func__, iobase, priv); priv->mac_regs_p = (struct eth_mac_regs *)iobase; priv->dma_regs_p = (struct eth_dma_regs *)(iobase + DW_DMA_BASE_OFFSET);
@@ -617,10 +648,18 @@ U_BOOT_DRIVER(eth_designware) = { .id = UCLASS_ETH, .of_match = designware_eth_ids, .ofdata_to_platdata = designware_eth_ofdata_to_platdata,
.bind = designware_eth_bind, .probe = designware_eth_probe, .ops = &designware_eth_ops, .priv_auto_alloc_size = sizeof(struct dw_eth_dev), .platdata_auto_alloc_size = sizeof(struct eth_pdata), .flags = DM_FLAG_ALLOC_PRIV_DMA,
};
+static struct pci_device_id supported[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_QRK_EMAC) },
{ }
+};
+U_BOOT_PCI_DEVICE(eth_designware, supported);
#endif
1.8.2.1

Hi Simon,
On Tue, Sep 1, 2015 at 8:33 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
The Designware ethernet controller is also seen on PCI bus, e.g. on Intel Quark SoC. Add this support in the DM version driver.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
drivers/net/designware.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
diff --git a/drivers/net/designware.c b/drivers/net/designware.c index ae78d21..06d6f6a 100644 --- a/drivers/net/designware.c +++ b/drivers/net/designware.c @@ -14,6 +14,7 @@ #include <errno.h> #include <miiphy.h> #include <malloc.h> +#include <pci.h> #include <linux/compiler.h> #include <linux/err.h> #include <asm/io.h> @@ -558,6 +559,20 @@ static int designware_eth_write_hwaddr(struct udevice *dev) return _dw_write_hwaddr(priv, pdata->enetaddr); }
+static int designware_eth_bind(struct udevice *dev) +{
static int num_cards;
char name[20];
/* Create a unique device name for PCI type devices */
if (device_get_uclass_id(dev->parent) == UCLASS_PCI) {
sprintf(name, "eth_designware#%u", num_cards++);
device_set_name(dev, name);
Hmm this is the second time it has come up/ This is OK, but I wonder if we should add a generic function that names devices based on the number seen for each driver?
Maybe not based on the number seen for each driver, as some drivers can support both PCI and non-PCI devices. What we need is just for PCI devices. For non-PCI devices, device tree already provides the name, although they might not be unique (depending on how device tree is written)
Could be tricky though. Maybe when we get to 3?
Yes, it is tricky. It is a per-driver setting, but depending on where the device resides.
}
return 0;
+}
static int designware_eth_probe(struct udevice *dev) { struct eth_pdata *pdata = dev_get_platdata(dev); @@ -565,6 +580,22 @@ static int designware_eth_probe(struct udevice *dev) u32 iobase = pdata->iobase; int ret;
/*
* If we are on PCI bus, either directly attached to a PCI root port,
* or via a PCI bridge, fill in platdata before we probe the hardware.
*/
if (device_get_uclass_id(dev->parent) == UCLASS_PCI) {
To make it easier to adjust the logic here, can you please create something like device_is_on_pci_bus() to hold this logic? I still think we might change to using UCLASS_BRIDGE for bridges.
Yes, I can add device_is_on_pci_bus(). Do you think we should put it into pci-uclass.c, or just local in this designware driver?
pci_dev_t bdf;
bdf = pci_get_bdf(dev);
pci_read_config_dword(bdf, PCI_BASE_ADDRESS_0, &iobase);
iobase &= PCI_BASE_ADDRESS_MEM_MASK;
iobase = pci_mem_to_phys(bdf, iobase);
pdata->iobase = iobase;
pdata->phy_interface = PHY_INTERFACE_MODE_RMII;
}
debug("%s, iobase=%x, priv=%p\n", __func__, iobase, priv); priv->mac_regs_p = (struct eth_mac_regs *)iobase; priv->dma_regs_p = (struct eth_dma_regs *)(iobase + DW_DMA_BASE_OFFSET);
@@ -617,10 +648,18 @@ U_BOOT_DRIVER(eth_designware) = { .id = UCLASS_ETH, .of_match = designware_eth_ids, .ofdata_to_platdata = designware_eth_ofdata_to_platdata,
.bind = designware_eth_bind, .probe = designware_eth_probe, .ops = &designware_eth_ops, .priv_auto_alloc_size = sizeof(struct dw_eth_dev), .platdata_auto_alloc_size = sizeof(struct eth_pdata), .flags = DM_FLAG_ALLOC_PRIV_DMA,
};
+static struct pci_device_id supported[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_QRK_EMAC) },
{ }
+};
+U_BOOT_PCI_DEVICE(eth_designware, supported);
#endif
Regards, Bin

Hi Bin,
On 31 August 2015 at 19:19, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 8:33 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
The Designware ethernet controller is also seen on PCI bus, e.g. on Intel Quark SoC. Add this support in the DM version driver.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
drivers/net/designware.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
diff --git a/drivers/net/designware.c b/drivers/net/designware.c index ae78d21..06d6f6a 100644 --- a/drivers/net/designware.c +++ b/drivers/net/designware.c @@ -14,6 +14,7 @@ #include <errno.h> #include <miiphy.h> #include <malloc.h> +#include <pci.h> #include <linux/compiler.h> #include <linux/err.h> #include <asm/io.h> @@ -558,6 +559,20 @@ static int designware_eth_write_hwaddr(struct udevice *dev) return _dw_write_hwaddr(priv, pdata->enetaddr); }
+static int designware_eth_bind(struct udevice *dev) +{
static int num_cards;
char name[20];
/* Create a unique device name for PCI type devices */
if (device_get_uclass_id(dev->parent) == UCLASS_PCI) {
sprintf(name, "eth_designware#%u", num_cards++);
device_set_name(dev, name);
Hmm this is the second time it has come up/ This is OK, but I wonder if we should add a generic function that names devices based on the number seen for each driver?
Maybe not based on the number seen for each driver, as some drivers can support both PCI and non-PCI devices. What we need is just for PCI devices. For non-PCI devices, device tree already provides the name, although they might not be unique (depending on how device tree is written)
Could be tricky though. Maybe when we get to 3?
Yes, it is tricky. It is a per-driver setting, but depending on where the device resides.
OK, well what you have is OK for now.
}
return 0;
+}
static int designware_eth_probe(struct udevice *dev) { struct eth_pdata *pdata = dev_get_platdata(dev); @@ -565,6 +580,22 @@ static int designware_eth_probe(struct udevice *dev) u32 iobase = pdata->iobase; int ret;
/*
* If we are on PCI bus, either directly attached to a PCI root port,
* or via a PCI bridge, fill in platdata before we probe the hardware.
*/
if (device_get_uclass_id(dev->parent) == UCLASS_PCI) {
To make it easier to adjust the logic here, can you please create something like device_is_on_pci_bus() to hold this logic? I still think we might change to using UCLASS_BRIDGE for bridges.
Yes, I can add device_is_on_pci_bus(). Do you think we should put it into pci-uclass.c, or just local in this designware driver?
If it is static inline then pci.h would be OK. I'm not sure it would work if you put in pci-uclass.c since some drivers might be built without PCI support enabled?
pci_dev_t bdf;
bdf = pci_get_bdf(dev);
pci_read_config_dword(bdf, PCI_BASE_ADDRESS_0, &iobase);
iobase &= PCI_BASE_ADDRESS_MEM_MASK;
iobase = pci_mem_to_phys(bdf, iobase);
pdata->iobase = iobase;
pdata->phy_interface = PHY_INTERFACE_MODE_RMII;
}
debug("%s, iobase=%x, priv=%p\n", __func__, iobase, priv); priv->mac_regs_p = (struct eth_mac_regs *)iobase; priv->dma_regs_p = (struct eth_dma_regs *)(iobase + DW_DMA_BASE_OFFSET);
@@ -617,10 +648,18 @@ U_BOOT_DRIVER(eth_designware) = { .id = UCLASS_ETH, .of_match = designware_eth_ids, .ofdata_to_platdata = designware_eth_ofdata_to_platdata,
.bind = designware_eth_bind, .probe = designware_eth_probe, .ops = &designware_eth_ops, .priv_auto_alloc_size = sizeof(struct dw_eth_dev), .platdata_auto_alloc_size = sizeof(struct eth_pdata), .flags = DM_FLAG_ALLOC_PRIV_DMA,
};
+static struct pci_device_id supported[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_QRK_EMAC) },
{ }
+};
+U_BOOT_PCI_DEVICE(eth_designware, supported);
#endif
Regards, Bin
Regards, Simon

Convert to use DM version of Designware ethernet driver on Intel quark/galileo.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
---
arch/x86/cpu/quark/quark.c | 19 ------------------- configs/galileo_defconfig | 2 +- 2 files changed, 1 insertion(+), 20 deletions(-)
diff --git a/arch/x86/cpu/quark/quark.c b/arch/x86/cpu/quark/quark.c index e4a7c50..773e529 100644 --- a/arch/x86/cpu/quark/quark.c +++ b/arch/x86/cpu/quark/quark.c @@ -6,8 +6,6 @@
#include <common.h> #include <mmc.h> -#include <netdev.h> -#include <phy.h> #include <asm/io.h> #include <asm/irq.h> #include <asm/pci.h> @@ -132,23 +130,6 @@ int cpu_mmc_init(bd_t *bis) ARRAY_SIZE(mmc_supported)); }
-int cpu_eth_init(bd_t *bis) -{ - u32 base; - int ret0, ret1; - - pci_read_config_dword(QUARK_EMAC0, PCI_BASE_ADDRESS_0, &base); - ret0 = designware_initialize(base, PHY_INTERFACE_MODE_RMII); - - pci_read_config_dword(QUARK_EMAC1, PCI_BASE_ADDRESS_0, &base); - ret1 = designware_initialize(base, PHY_INTERFACE_MODE_RMII); - - if (ret0 < 0 && ret1 < 0) - return -1; - else - return 0; -} - void cpu_irq_init(void) { struct quark_rcba *rcba; diff --git a/configs/galileo_defconfig b/configs/galileo_defconfig index 55a002f..481b964 100644 --- a/configs/galileo_defconfig +++ b/configs/galileo_defconfig @@ -13,7 +13,7 @@ CONFIG_CMD_BOOTSTAGE=y CONFIG_OF_CONTROL=y CONFIG_DM_PCI=y CONFIG_SPI_FLASH=y -CONFIG_NETDEVICES=y +CONFIG_DM_ETH=y CONFIG_ETH_DESIGNWARE=y CONFIG_USB=y CONFIG_DM_USB=y

On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Convert to use DM version of Designware ethernet driver on Intel quark/galileo.
Signed-off-by: Bin Meng bmeng.cn@gmail.com
arch/x86/cpu/quark/quark.c | 19 ------------------- configs/galileo_defconfig | 2 +- 2 files changed, 1 insertion(+), 20 deletions(-)
Acked-by: Simon Glass sjg@chromium.org

Boot time performance degradation is observed with the conversion to use dm pci. Intel Quark SoC has a low end x86 processor with only 400MHz frequency and the most time consuming part is with MRC. Each MRC register programming requires indirect access via pci bus. With dm pci, accessing pci configuration space has some overhead. Unfortunately this single access overhead gets accumulated in the whole MRC process, and finally leads to twice boot time (25 seconds) than before (12 seconds).
To speed up the boot, create an optimized version of pci config read/write routines without bothering to go through driver model. Now it only takes about 3 seconds to finish MRC, which is really fast (8 times faster than dm pci, or 4 times faster than before).
Signed-off-by: Bin Meng bmeng.cn@gmail.com ---
arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 22 deletions(-)
diff --git a/arch/x86/cpu/quark/msg_port.c b/arch/x86/cpu/quark/msg_port.c index 31713e3..a75ef23 100644 --- a/arch/x86/cpu/quark/msg_port.c +++ b/arch/x86/cpu/quark/msg_port.c @@ -5,34 +5,49 @@ */
#include <common.h> -#include <pci.h> #include <asm/arch/device.h> #include <asm/arch/msg_port.h> +#include <asm/io.h> +#include <asm/pci.h> + +/* Optimized pci config write dword routine */ +static void qrk_pci_write_config_dword(pci_dev_t dev, int offset, u32 value) +{ + outl(dev | offset | PCI_CFG_EN, PCI_REG_ADDR); + outl(value, PCI_REG_DATA); +} + +/* Optimized pci config read dword routine */ +static void qrk_pci_read_config_dword(pci_dev_t dev, int offset, u32 *valuep) +{ + outl(dev | offset | PCI_CFG_EN, PCI_REG_ADDR); + *valuep = inl(PCI_REG_DATA); +}
void msg_port_setup(int op, int port, int reg) { - pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_REG, - (((op) << 24) | ((port) << 16) | - (((reg) << 8) & 0xff00) | MSG_BYTE_ENABLE)); + qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_REG, + (((op) << 24) | ((port) << 16) | + (((reg) << 8) & 0xff00) | MSG_BYTE_ENABLE)); }
u32 msg_port_read(u8 port, u32 reg) { u32 value;
- pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, - reg & 0xffffff00); + qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, + reg & 0xffffff00); msg_port_setup(MSG_OP_READ, port, reg); - pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value); + qrk_pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value);
return value; }
void msg_port_write(u8 port, u32 reg, u32 value) { - pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value); - pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, - reg & 0xffffff00); + qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value); + qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, + reg & 0xffffff00); msg_port_setup(MSG_OP_WRITE, port, reg); }
@@ -40,19 +55,19 @@ u32 msg_port_alt_read(u8 port, u32 reg) { u32 value;
- pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, - reg & 0xffffff00); + qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, + reg & 0xffffff00); msg_port_setup(MSG_OP_ALT_READ, port, reg); - pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value); + qrk_pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value);
return value; }
void msg_port_alt_write(u8 port, u32 reg, u32 value) { - pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value); - pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, - reg & 0xffffff00); + qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value); + qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, + reg & 0xffffff00); msg_port_setup(MSG_OP_ALT_WRITE, port, reg); }
@@ -60,18 +75,18 @@ u32 msg_port_io_read(u8 port, u32 reg) { u32 value;
- pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, - reg & 0xffffff00); + qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, + reg & 0xffffff00); msg_port_setup(MSG_OP_IO_READ, port, reg); - pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value); + qrk_pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value);
return value; }
void msg_port_io_write(u8 port, u32 reg, u32 value) { - pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value); - pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, - reg & 0xffffff00); + qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value); + qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG, + reg & 0xffffff00); msg_port_setup(MSG_OP_IO_WRITE, port, reg); }

Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Boot time performance degradation is observed with the conversion to use dm pci. Intel Quark SoC has a low end x86 processor with only 400MHz frequency and the most time consuming part is with MRC. Each MRC register programming requires indirect access via pci bus. With dm pci, accessing pci configuration space has some overhead. Unfortunately this single access overhead gets accumulated in the whole MRC process, and finally leads to twice boot time (25 seconds) than before (12 seconds).
To speed up the boot, create an optimized version of pci config read/write routines without bothering to go through driver model. Now it only takes about 3 seconds to finish MRC, which is really fast (8 times faster than dm pci, or 4 times faster than before).
Signed-off-by: Bin Meng bmeng.cn@gmail.com
arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 22 deletions(-)
Before I delve into the patch - with driver model we are using the I/O method - see pci_x86_read_config(). Is that the source of the slowdown or is it just general driver model overhead.
If the former then perhaps we should change this. If the latter then we have work to do...
Regards, Simon

Hi Simon,
On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Boot time performance degradation is observed with the conversion to use dm pci. Intel Quark SoC has a low end x86 processor with only 400MHz frequency and the most time consuming part is with MRC. Each MRC register programming requires indirect access via pci bus. With dm pci, accessing pci configuration space has some overhead. Unfortunately this single access overhead gets accumulated in the whole MRC process, and finally leads to twice boot time (25 seconds) than before (12 seconds).
To speed up the boot, create an optimized version of pci config read/write routines without bothering to go through driver model. Now it only takes about 3 seconds to finish MRC, which is really fast (8 times faster than dm pci, or 4 times faster than before).
Signed-off-by: Bin Meng bmeng.cn@gmail.com
arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 22 deletions(-)
Before I delve into the patch - with driver model we are using the I/O method - see pci_x86_read_config(). Is that the source of the slowdown or is it just general driver model overhead.
The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR controller. Inside msg_port.c, pci_write_config_dword() and pci_read_config_dword() are called.
With driver model, the overhead is:
pci_write_config_dword() -> pci_write_config32() -> pci_write_config() -> uclass_get_device_by_seq() then pci_bus_write_config() will finally call pci_x86_read_config().
Without driver model, there is still some overhead (so previously the MRC time was about 12 seconds)
pci_write_config_dword() -> pci_hose_write_config_dword() -> TYPE1_PCI_OP(write, dword, u32, outl, 0)
With my optimized version, pci_write_config_dword() directly calls a hardcoded dword size pci config access, without the need to consider offset and mask, and dereferencing hose->cfg_addr/cfg->data.
If the former then perhaps we should change this. If the latter then we have work to do...
Regards, Bin

Hi Bin,
On 31 August 2015 at 07:43, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Boot time performance degradation is observed with the conversion to use dm pci. Intel Quark SoC has a low end x86 processor with only 400MHz frequency and the most time consuming part is with MRC. Each MRC register programming requires indirect access via pci bus. With dm pci, accessing pci configuration space has some overhead. Unfortunately this single access overhead gets accumulated in the whole MRC process, and finally leads to twice boot time (25 seconds) than before (12 seconds).
To speed up the boot, create an optimized version of pci config read/write routines without bothering to go through driver model. Now it only takes about 3 seconds to finish MRC, which is really fast (8 times faster than dm pci, or 4 times faster than before).
Signed-off-by: Bin Meng bmeng.cn@gmail.com
arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 22 deletions(-)
Before I delve into the patch - with driver model we are using the I/O method - see pci_x86_read_config(). Is that the source of the slowdown or is it just general driver model overhead.
The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR controller. Inside msg_port.c, pci_write_config_dword() and pci_read_config_dword() are called.
With driver model, the overhead is:
pci_write_config_dword() -> pci_write_config32() -> pci_write_config() -> uclass_get_device_by_seq() then pci_bus_write_config() will finally call pci_x86_read_config().
Without driver model, there is still some overhead (so previously the MRC time was about 12 seconds)
pci_write_config_dword() -> pci_hose_write_config_dword() -> TYPE1_PCI_OP(write, dword, u32, outl, 0)
With my optimized version, pci_write_config_dword() directly calls a hardcoded dword size pci config access, without the need to consider offset and mask, and dereferencing hose->cfg_addr/cfg->data.
What about if we use dm_pci_read_config32()? We should try to move PCI access to driver model to avoid the uclass_get_device_by_seq() everywhere.
If the former then perhaps we should change this. If the latter then we have work to do...
Regards, Simon

Hi Simon,
On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 07:43, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Boot time performance degradation is observed with the conversion to use dm pci. Intel Quark SoC has a low end x86 processor with only 400MHz frequency and the most time consuming part is with MRC. Each MRC register programming requires indirect access via pci bus. With dm pci, accessing pci configuration space has some overhead. Unfortunately this single access overhead gets accumulated in the whole MRC process, and finally leads to twice boot time (25 seconds) than before (12 seconds).
To speed up the boot, create an optimized version of pci config read/write routines without bothering to go through driver model. Now it only takes about 3 seconds to finish MRC, which is really fast (8 times faster than dm pci, or 4 times faster than before).
Signed-off-by: Bin Meng bmeng.cn@gmail.com
arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 22 deletions(-)
Before I delve into the patch - with driver model we are using the I/O method - see pci_x86_read_config(). Is that the source of the slowdown or is it just general driver model overhead.
The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR controller. Inside msg_port.c, pci_write_config_dword() and pci_read_config_dword() are called.
With driver model, the overhead is:
pci_write_config_dword() -> pci_write_config32() -> pci_write_config() -> uclass_get_device_by_seq() then pci_bus_write_config() will finally call pci_x86_read_config().
Without driver model, there is still some overhead (so previously the MRC time was about 12 seconds)
pci_write_config_dword() -> pci_hose_write_config_dword() -> TYPE1_PCI_OP(write, dword, u32, outl, 0)
With my optimized version, pci_write_config_dword() directly calls a hardcoded dword size pci config access, without the need to consider offset and mask, and dereferencing hose->cfg_addr/cfg->data.
What about if we use dm_pci_read_config32()? We should try to move PCI access to driver model to avoid the uclass_get_device_by_seq() everywhere.
I don't think that helps. dm_pci_read_config32() requires a dm driver. MRC is just something that program a bunch of registers with pci config rw call.
If the former then perhaps we should change this. If the latter then we have work to do...
Also for this patch, I just realized that not only it helps to reduce the boot time, but also it helps to support PCIe root port in future patches.
By checking the Quark SoC manual, I found something that needs to be done to get the two PCIe root ports on Quark SoC to work. In order to get PCIe root ports show up in the PCI configuration space, we need program some registers in the message bus (using APIs in arch/x86/cpu/quark/msg_port.c). With driver model, if we still call pci_write_config_dword() in the msg_port.c, we end up triggering PCI enumeration process first before we actually write something to the configuration space, however the enumeration process will hang when scanning to the PCIe root port if we don't properly initialize the message bus registers. This is a chicken-egg problem.
Regards, Bin

Hi Bin,
On 31 August 2015 at 08:04, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 07:43, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Boot time performance degradation is observed with the conversion to use dm pci. Intel Quark SoC has a low end x86 processor with only 400MHz frequency and the most time consuming part is with MRC. Each MRC register programming requires indirect access via pci bus. With dm pci, accessing pci configuration space has some overhead. Unfortunately this single access overhead gets accumulated in the whole MRC process, and finally leads to twice boot time (25 seconds) than before (12 seconds).
To speed up the boot, create an optimized version of pci config read/write routines without bothering to go through driver model. Now it only takes about 3 seconds to finish MRC, which is really fast (8 times faster than dm pci, or 4 times faster than before).
Signed-off-by: Bin Meng bmeng.cn@gmail.com
arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 22 deletions(-)
Before I delve into the patch - with driver model we are using the I/O method - see pci_x86_read_config(). Is that the source of the slowdown or is it just general driver model overhead.
The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR controller. Inside msg_port.c, pci_write_config_dword() and pci_read_config_dword() are called.
With driver model, the overhead is:
pci_write_config_dword() -> pci_write_config32() -> pci_write_config() -> uclass_get_device_by_seq() then pci_bus_write_config() will finally call pci_x86_read_config().
Without driver model, there is still some overhead (so previously the MRC time was about 12 seconds)
pci_write_config_dword() -> pci_hose_write_config_dword() -> TYPE1_PCI_OP(write, dword, u32, outl, 0)
With my optimized version, pci_write_config_dword() directly calls a hardcoded dword size pci config access, without the need to consider offset and mask, and dereferencing hose->cfg_addr/cfg->data.
What about if we use dm_pci_read_config32()? We should try to move PCI access to driver model to avoid the uclass_get_device_by_seq() everywhere.
I don't think that helps. dm_pci_read_config32() requires a dm driver. MRC is just something that program a bunch of registers with pci config rw call.
My question is really what takes the time? It's not clear whether it is the driver model overhead or something else. The code you add in qrk_pci_write_config_dword() looks very similar to pci_x86_read_config().
If the former then perhaps we should change this. If the latter then we have work to do...
Also for this patch, I just realized that not only it helps to reduce the boot time, but also it helps to support PCIe root port in future patches.
By checking the Quark SoC manual, I found something that needs to be done to get the two PCIe root ports on Quark SoC to work. In order to get PCIe root ports show up in the PCI configuration space, we need program some registers in the message bus (using APIs in arch/x86/cpu/quark/msg_port.c). With driver model, if we still call pci_write_config_dword() in the msg_port.c, we end up triggering PCI enumeration process first before we actually write something to the configuration space, however the enumeration process will hang when scanning to the PCIe root port if we don't properly initialize the message bus registers. This is a chicken-egg problem.
Sure, although I see that as a separate problem.
We can't have a driver model implementation that is really slow, so I'd like to clear that up first. Of course your patch makes sense for other reasons, but I don't want to play whack-a-mole here.
Also we should start to move things away from the non-driver-model pci functions.
Regards, Simon

Hi Simon,
On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 08:04, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 07:43, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote:
Boot time performance degradation is observed with the conversion to use dm pci. Intel Quark SoC has a low end x86 processor with only 400MHz frequency and the most time consuming part is with MRC. Each MRC register programming requires indirect access via pci bus. With dm pci, accessing pci configuration space has some overhead. Unfortunately this single access overhead gets accumulated in the whole MRC process, and finally leads to twice boot time (25 seconds) than before (12 seconds).
To speed up the boot, create an optimized version of pci config read/write routines without bothering to go through driver model. Now it only takes about 3 seconds to finish MRC, which is really fast (8 times faster than dm pci, or 4 times faster than before).
Signed-off-by: Bin Meng bmeng.cn@gmail.com
arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 22 deletions(-)
Before I delve into the patch - with driver model we are using the I/O method - see pci_x86_read_config(). Is that the source of the slowdown or is it just general driver model overhead.
The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR controller. Inside msg_port.c, pci_write_config_dword() and pci_read_config_dword() are called.
With driver model, the overhead is:
pci_write_config_dword() -> pci_write_config32() -> pci_write_config() -> uclass_get_device_by_seq() then pci_bus_write_config() will finally call pci_x86_read_config().
Without driver model, there is still some overhead (so previously the MRC time was about 12 seconds)
pci_write_config_dword() -> pci_hose_write_config_dword() -> TYPE1_PCI_OP(write, dword, u32, outl, 0)
With my optimized version, pci_write_config_dword() directly calls a hardcoded dword size pci config access, without the need to consider offset and mask, and dereferencing hose->cfg_addr/cfg->data.
What about if we use dm_pci_read_config32()? We should try to move PCI access to driver model to avoid the uclass_get_device_by_seq() everywhere.
I don't think that helps. dm_pci_read_config32() requires a dm driver. MRC is just something that program a bunch of registers with pci config rw call.
My question is really what takes the time? It's not clear whether it is the driver model overhead or something else. The code you add in qrk_pci_write_config_dword() looks very similar to pci_x86_read_config().
It is the driver model overhead. In order to get to pci_x86_read_config(), we need go through a bunch of function calls (see above). Yes, my version is very similar to pci_x86_read_config(), but my version is more simpler as it only needs to deal with dword size thus no need to do offset mask and switch/case. If you look at the Quark MRC codes, there are thousands of calls to msg_port_read() and msg_port_write().
If the former then perhaps we should change this. If the latter then we have work to do...
Also for this patch, I just realized that not only it helps to reduce the boot time, but also it helps to support PCIe root port in future patches.
By checking the Quark SoC manual, I found something that needs to be done to get the two PCIe root ports on Quark SoC to work. In order to get PCIe root ports show up in the PCI configuration space, we need program some registers in the message bus (using APIs in arch/x86/cpu/quark/msg_port.c). With driver model, if we still call pci_write_config_dword() in the msg_port.c, we end up triggering PCI enumeration process first before we actually write something to the configuration space, however the enumeration process will hang when scanning to the PCIe root port if we don't properly initialize the message bus registers. This is a chicken-egg problem.
Sure, although I see that as a separate problem.
Yes, it is a separate problem. It came to me when I was reading the manual after I submitted the patch.
We can't have a driver model implementation that is really slow, so I'd like to clear that up first. Of course your patch makes sense for other reasons, but I don't want to play whack-a-mole here.
Agreed. So far the driver model PCI is used on x86 boards and sandbox. The performance issue was not obvious on these targets, but it is quite noticeable on Intel Quark. These PCI config read/write routines will go through lots of function calls before we actually touch the I/O ports 0xcf8 and 0xcfc, especially when the device is not on the root bus (it needs to go through its parent followed by its parent's parent).
But anyway I think this optimization for Quark is needed. I doubt we can optimize driver model pci to such an extent.
Also we should start to move things away from the non-driver-model pci functions.
Regards, Bin

Hi Bin,
On 31 August 2015 at 19:40, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 08:04, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 07:43, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote: > Boot time performance degradation is observed with the conversion > to use dm pci. Intel Quark SoC has a low end x86 processor with > only 400MHz frequency and the most time consuming part is with MRC. > Each MRC register programming requires indirect access via pci bus. > With dm pci, accessing pci configuration space has some overhead. > Unfortunately this single access overhead gets accumulated in the > whole MRC process, and finally leads to twice boot time (25 seconds) > than before (12 seconds). > > To speed up the boot, create an optimized version of pci config > read/write routines without bothering to go through driver model. > Now it only takes about 3 seconds to finish MRC, which is really > fast (8 times faster than dm pci, or 4 times faster than before). > > Signed-off-by: Bin Meng bmeng.cn@gmail.com > --- > > arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- > 1 file changed, 37 insertions(+), 22 deletions(-)
Before I delve into the patch - with driver model we are using the I/O method - see pci_x86_read_config(). Is that the source of the slowdown or is it just general driver model overhead.
The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR controller. Inside msg_port.c, pci_write_config_dword() and pci_read_config_dword() are called.
With driver model, the overhead is:
pci_write_config_dword() -> pci_write_config32() -> pci_write_config() -> uclass_get_device_by_seq() then pci_bus_write_config() will finally call pci_x86_read_config().
Without driver model, there is still some overhead (so previously the MRC time was about 12 seconds)
pci_write_config_dword() -> pci_hose_write_config_dword() -> TYPE1_PCI_OP(write, dword, u32, outl, 0)
With my optimized version, pci_write_config_dword() directly calls a hardcoded dword size pci config access, without the need to consider offset and mask, and dereferencing hose->cfg_addr/cfg->data.
What about if we use dm_pci_read_config32()? We should try to move PCI access to driver model to avoid the uclass_get_device_by_seq() everywhere.
I don't think that helps. dm_pci_read_config32() requires a dm driver. MRC is just something that program a bunch of registers with pci config rw call.
My question is really what takes the time? It's not clear whether it is the driver model overhead or something else. The code you add in qrk_pci_write_config_dword() looks very similar to pci_x86_read_config().
It is the driver model overhead. In order to get to pci_x86_read_config(), we need go through a bunch of function calls (see above). Yes, my version is very similar to pci_x86_read_config(), but my version is more simpler as it only needs to deal with dword size thus no need to do offset mask and switch/case. If you look at the Quark MRC codes, there are thousands of calls to msg_port_read() and msg_port_write().
If the former then perhaps we should change this. If the latter then we have work to do...
Also for this patch, I just realized that not only it helps to reduce the boot time, but also it helps to support PCIe root port in future patches.
By checking the Quark SoC manual, I found something that needs to be done to get the two PCIe root ports on Quark SoC to work. In order to get PCIe root ports show up in the PCI configuration space, we need program some registers in the message bus (using APIs in arch/x86/cpu/quark/msg_port.c). With driver model, if we still call pci_write_config_dword() in the msg_port.c, we end up triggering PCI enumeration process first before we actually write something to the configuration space, however the enumeration process will hang when scanning to the PCIe root port if we don't properly initialize the message bus registers. This is a chicken-egg problem.
Sure, although I see that as a separate problem.
Yes, it is a separate problem. It came to me when I was reading the manual after I submitted the patch.
We can't have a driver model implementation that is really slow, so I'd like to clear that up first. Of course your patch makes sense for other reasons, but I don't want to play whack-a-mole here.
Agreed. So far the driver model PCI is used on x86 boards and sandbox. The performance issue was not obvious on these targets, but it is quite noticeable on Intel Quark. These PCI config read/write routines will go through lots of function calls before we actually touch the I/O ports 0xcf8 and 0xcfc, especially when the device is not on the root bus (it needs to go through its parent followed by its parent's parent).
But anyway I think this optimization for Quark is needed. I doubt we can optimize driver model pci to such an extent.
Can we use this as an opportunity to try a few things? If we use the dm_pci functions that should cut out some overhead. Can you try an experiment to see how much difference it makes?
dm_pci_read_config32() pci_get_bdf() pci_read_config() for loop which probably terminates immediately pci_bus_read_config() read_config(), which is pci_x86_read_config()
So it's not great but it doesn't look too bad.
Also is this an Intel Gallileo gen 2 development board? I'm thinking of getting one.
Also we should start to move things away from the non-driver-model pci functions.
Regards, Bin
Regards, Simon

Hi Simon,
On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 19:40, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 08:04, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 07:43, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote: > Hi Bin, > > On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote: >> Boot time performance degradation is observed with the conversion >> to use dm pci. Intel Quark SoC has a low end x86 processor with >> only 400MHz frequency and the most time consuming part is with MRC. >> Each MRC register programming requires indirect access via pci bus. >> With dm pci, accessing pci configuration space has some overhead. >> Unfortunately this single access overhead gets accumulated in the >> whole MRC process, and finally leads to twice boot time (25 seconds) >> than before (12 seconds). >> >> To speed up the boot, create an optimized version of pci config >> read/write routines without bothering to go through driver model. >> Now it only takes about 3 seconds to finish MRC, which is really >> fast (8 times faster than dm pci, or 4 times faster than before). >> >> Signed-off-by: Bin Meng bmeng.cn@gmail.com >> --- >> >> arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- >> 1 file changed, 37 insertions(+), 22 deletions(-) > > Before I delve into the patch - with driver model we are using the I/O > method - see pci_x86_read_config(). Is that the source of the slowdown > or is it just general driver model overhead.
The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR controller. Inside msg_port.c, pci_write_config_dword() and pci_read_config_dword() are called.
With driver model, the overhead is:
pci_write_config_dword() -> pci_write_config32() -> pci_write_config() -> uclass_get_device_by_seq() then pci_bus_write_config() will finally call pci_x86_read_config().
Without driver model, there is still some overhead (so previously the MRC time was about 12 seconds)
pci_write_config_dword() -> pci_hose_write_config_dword() -> TYPE1_PCI_OP(write, dword, u32, outl, 0)
With my optimized version, pci_write_config_dword() directly calls a hardcoded dword size pci config access, without the need to consider offset and mask, and dereferencing hose->cfg_addr/cfg->data.
What about if we use dm_pci_read_config32()? We should try to move PCI access to driver model to avoid the uclass_get_device_by_seq() everywhere.
I don't think that helps. dm_pci_read_config32() requires a dm driver. MRC is just something that program a bunch of registers with pci config rw call.
My question is really what takes the time? It's not clear whether it is the driver model overhead or something else. The code you add in qrk_pci_write_config_dword() looks very similar to pci_x86_read_config().
It is the driver model overhead. In order to get to pci_x86_read_config(), we need go through a bunch of function calls (see above). Yes, my version is very similar to pci_x86_read_config(), but my version is more simpler as it only needs to deal with dword size thus no need to do offset mask and switch/case. If you look at the Quark MRC codes, there are thousands of calls to msg_port_read() and msg_port_write().
> > If the former then perhaps we should change this. If the latter then > we have work to do... >
Also for this patch, I just realized that not only it helps to reduce the boot time, but also it helps to support PCIe root port in future patches.
By checking the Quark SoC manual, I found something that needs to be done to get the two PCIe root ports on Quark SoC to work. In order to get PCIe root ports show up in the PCI configuration space, we need program some registers in the message bus (using APIs in arch/x86/cpu/quark/msg_port.c). With driver model, if we still call pci_write_config_dword() in the msg_port.c, we end up triggering PCI enumeration process first before we actually write something to the configuration space, however the enumeration process will hang when scanning to the PCIe root port if we don't properly initialize the message bus registers. This is a chicken-egg problem.
Sure, although I see that as a separate problem.
Yes, it is a separate problem. It came to me when I was reading the manual after I submitted the patch.
We can't have a driver model implementation that is really slow, so I'd like to clear that up first. Of course your patch makes sense for other reasons, but I don't want to play whack-a-mole here.
Agreed. So far the driver model PCI is used on x86 boards and sandbox. The performance issue was not obvious on these targets, but it is quite noticeable on Intel Quark. These PCI config read/write routines will go through lots of function calls before we actually touch the I/O ports 0xcf8 and 0xcfc, especially when the device is not on the root bus (it needs to go through its parent followed by its parent's parent).
But anyway I think this optimization for Quark is needed. I doubt we can optimize driver model pci to such an extent.
Can we use this as an opportunity to try a few things? If we use the dm_pci functions that should cut out some overhead. Can you try an experiment to see how much difference it makes?
dm_pci_read_config32()
We can't use this API as MRC is not a dm driver.
pci_get_bdf() pci_read_config() for loop which probably terminates immediately pci_bus_read_config() read_config(), which is pci_x86_read_config()
So it's not great but it doesn't look too bad.
Also is this an Intel Gallileo gen 2 development board? I'm thinking of getting one.
Yes, this is an Intel Galileo gen2 development board. Although there is an gen1 board in the past and the same u-boot.rom can boot on both gen1 and gen2 board, Intel is now shipping only gen2 board.
Also we should start to move things away from the non-driver-model pci functions.
Regards, Bin

Hi Bin,
On 31 August 2015 at 21:04, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 19:40, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 08:04, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 07:43, Bin Meng bmeng.cn@gmail.com wrote: > Hi Simon, > > On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote: >> Hi Bin, >> >> On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote: >>> Boot time performance degradation is observed with the conversion >>> to use dm pci. Intel Quark SoC has a low end x86 processor with >>> only 400MHz frequency and the most time consuming part is with MRC. >>> Each MRC register programming requires indirect access via pci bus. >>> With dm pci, accessing pci configuration space has some overhead. >>> Unfortunately this single access overhead gets accumulated in the >>> whole MRC process, and finally leads to twice boot time (25 seconds) >>> than before (12 seconds). >>> >>> To speed up the boot, create an optimized version of pci config >>> read/write routines without bothering to go through driver model. >>> Now it only takes about 3 seconds to finish MRC, which is really >>> fast (8 times faster than dm pci, or 4 times faster than before). >>> >>> Signed-off-by: Bin Meng bmeng.cn@gmail.com >>> --- >>> >>> arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- >>> 1 file changed, 37 insertions(+), 22 deletions(-) >> >> Before I delve into the patch - with driver model we are using the I/O >> method - see pci_x86_read_config(). Is that the source of the slowdown >> or is it just general driver model overhead. > > The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR > controller. Inside msg_port.c, pci_write_config_dword() and > pci_read_config_dword() are called. > > With driver model, the overhead is: > > pci_write_config_dword() -> pci_write_config32() -> pci_write_config() > -> uclass_get_device_by_seq() then pci_bus_write_config() will finally > call pci_x86_read_config(). > > Without driver model, there is still some overhead (so previously the > MRC time was about 12 seconds) > > pci_write_config_dword() -> pci_hose_write_config_dword() -> > TYPE1_PCI_OP(write, dword, u32, outl, 0) > > With my optimized version, pci_write_config_dword() directly calls a > hardcoded dword size pci config access, without the need to consider > offset and mask, and dereferencing hose->cfg_addr/cfg->data.
What about if we use dm_pci_read_config32()? We should try to move PCI access to driver model to avoid the uclass_get_device_by_seq() everywhere.
I don't think that helps. dm_pci_read_config32() requires a dm driver. MRC is just something that program a bunch of registers with pci config rw call.
My question is really what takes the time? It's not clear whether it is the driver model overhead or something else. The code you add in qrk_pci_write_config_dword() looks very similar to pci_x86_read_config().
It is the driver model overhead. In order to get to pci_x86_read_config(), we need go through a bunch of function calls (see above). Yes, my version is very similar to pci_x86_read_config(), but my version is more simpler as it only needs to deal with dword size thus no need to do offset mask and switch/case. If you look at the Quark MRC codes, there are thousands of calls to msg_port_read() and msg_port_write().
> >> >> If the former then perhaps we should change this. If the latter then >> we have work to do... >>
Also for this patch, I just realized that not only it helps to reduce the boot time, but also it helps to support PCIe root port in future patches.
By checking the Quark SoC manual, I found something that needs to be done to get the two PCIe root ports on Quark SoC to work. In order to get PCIe root ports show up in the PCI configuration space, we need program some registers in the message bus (using APIs in arch/x86/cpu/quark/msg_port.c). With driver model, if we still call pci_write_config_dword() in the msg_port.c, we end up triggering PCI enumeration process first before we actually write something to the configuration space, however the enumeration process will hang when scanning to the PCIe root port if we don't properly initialize the message bus registers. This is a chicken-egg problem.
Sure, although I see that as a separate problem.
Yes, it is a separate problem. It came to me when I was reading the manual after I submitted the patch.
We can't have a driver model implementation that is really slow, so I'd like to clear that up first. Of course your patch makes sense for other reasons, but I don't want to play whack-a-mole here.
Agreed. So far the driver model PCI is used on x86 boards and sandbox. The performance issue was not obvious on these targets, but it is quite noticeable on Intel Quark. These PCI config read/write routines will go through lots of function calls before we actually touch the I/O ports 0xcf8 and 0xcfc, especially when the device is not on the root bus (it needs to go through its parent followed by its parent's parent).
But anyway I think this optimization for Quark is needed. I doubt we can optimize driver model pci to such an extent.
Can we use this as an opportunity to try a few things? If we use the dm_pci functions that should cut out some overhead. Can you try an experiment to see how much difference it makes?
dm_pci_read_config32()
We can't use this API as MRC is not a dm driver.
OK, probably I need to dig in and understand this a little better. Is it running pre-relocation with the early PCI stuff? We could make a driver with UCLASS_RAM perhaps.
pci_get_bdf() pci_read_config() for loop which probably terminates immediately pci_bus_read_config() read_config(), which is pci_x86_read_config()
So it's not great but it doesn't look too bad.
Also is this an Intel Gallileo gen 2 development board? I'm thinking of getting one.
Yes, this is an Intel Galileo gen2 development board. Although there is an gen1 board in the past and the same u-boot.rom can boot on both gen1 and gen2 board, Intel is now shipping only gen2 board.
OK I've ordered one to try out.
Also we should start to move things away from the non-driver-model pci functions.
Regards, Simon

Hi Simon,
On Tue, Sep 1, 2015 at 11:12 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 21:04, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 19:40, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 08:04, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass sjg@chromium.org wrote: > Hi Bin, > > On 31 August 2015 at 07:43, Bin Meng bmeng.cn@gmail.com wrote: >> Hi Simon, >> >> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote: >>> Hi Bin, >>> >>> On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote: >>>> Boot time performance degradation is observed with the conversion >>>> to use dm pci. Intel Quark SoC has a low end x86 processor with >>>> only 400MHz frequency and the most time consuming part is with MRC. >>>> Each MRC register programming requires indirect access via pci bus. >>>> With dm pci, accessing pci configuration space has some overhead. >>>> Unfortunately this single access overhead gets accumulated in the >>>> whole MRC process, and finally leads to twice boot time (25 seconds) >>>> than before (12 seconds). >>>> >>>> To speed up the boot, create an optimized version of pci config >>>> read/write routines without bothering to go through driver model. >>>> Now it only takes about 3 seconds to finish MRC, which is really >>>> fast (8 times faster than dm pci, or 4 times faster than before). >>>> >>>> Signed-off-by: Bin Meng bmeng.cn@gmail.com >>>> --- >>>> >>>> arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- >>>> 1 file changed, 37 insertions(+), 22 deletions(-) >>> >>> Before I delve into the patch - with driver model we are using the I/O >>> method - see pci_x86_read_config(). Is that the source of the slowdown >>> or is it just general driver model overhead. >> >> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR >> controller. Inside msg_port.c, pci_write_config_dword() and >> pci_read_config_dword() are called. >> >> With driver model, the overhead is: >> >> pci_write_config_dword() -> pci_write_config32() -> pci_write_config() >> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally >> call pci_x86_read_config(). >> >> Without driver model, there is still some overhead (so previously the >> MRC time was about 12 seconds) >> >> pci_write_config_dword() -> pci_hose_write_config_dword() -> >> TYPE1_PCI_OP(write, dword, u32, outl, 0) >> >> With my optimized version, pci_write_config_dword() directly calls a >> hardcoded dword size pci config access, without the need to consider >> offset and mask, and dereferencing hose->cfg_addr/cfg->data. > > What about if we use dm_pci_read_config32()? We should try to move PCI > access to driver model to avoid the uclass_get_device_by_seq() > everywhere.
I don't think that helps. dm_pci_read_config32() requires a dm driver. MRC is just something that program a bunch of registers with pci config rw call.
My question is really what takes the time? It's not clear whether it is the driver model overhead or something else. The code you add in qrk_pci_write_config_dword() looks very similar to pci_x86_read_config().
It is the driver model overhead. In order to get to pci_x86_read_config(), we need go through a bunch of function calls (see above). Yes, my version is very similar to pci_x86_read_config(), but my version is more simpler as it only needs to deal with dword size thus no need to do offset mask and switch/case. If you look at the Quark MRC codes, there are thousands of calls to msg_port_read() and msg_port_write().
> >> >>> >>> If the former then perhaps we should change this. If the latter then >>> we have work to do... >>>
Also for this patch, I just realized that not only it helps to reduce the boot time, but also it helps to support PCIe root port in future patches.
By checking the Quark SoC manual, I found something that needs to be done to get the two PCIe root ports on Quark SoC to work. In order to get PCIe root ports show up in the PCI configuration space, we need program some registers in the message bus (using APIs in arch/x86/cpu/quark/msg_port.c). With driver model, if we still call pci_write_config_dword() in the msg_port.c, we end up triggering PCI enumeration process first before we actually write something to the configuration space, however the enumeration process will hang when scanning to the PCIe root port if we don't properly initialize the message bus registers. This is a chicken-egg problem.
Sure, although I see that as a separate problem.
Yes, it is a separate problem. It came to me when I was reading the manual after I submitted the patch.
We can't have a driver model implementation that is really slow, so I'd like to clear that up first. Of course your patch makes sense for other reasons, but I don't want to play whack-a-mole here.
Agreed. So far the driver model PCI is used on x86 boards and sandbox. The performance issue was not obvious on these targets, but it is quite noticeable on Intel Quark. These PCI config read/write routines will go through lots of function calls before we actually touch the I/O ports 0xcf8 and 0xcfc, especially when the device is not on the root bus (it needs to go through its parent followed by its parent's parent).
But anyway I think this optimization for Quark is needed. I doubt we can optimize driver model pci to such an extent.
Can we use this as an opportunity to try a few things? If we use the dm_pci functions that should cut out some overhead. Can you try an experiment to see how much difference it makes?
dm_pci_read_config32()
We can't use this API as MRC is not a dm driver.
OK, probably I need to dig in and understand this a little better. Is it running pre-relocation with the early PCI stuff? We could make a driver with UCLASS_RAM perhaps.
Yes, it is running pre-relocation with the early PCI stuff. But I doubt the need to create a UCLASS_RAM for x86 targets as most x86 targets uses FSP to initialize the RAM. The best candidate to implement UCLASS_RAM that I can think of now is the Freescale DDR controller driver on powerpc, and on ARM recently. It supports both SPD and memory-down. On ARM, most RAM targets' DDR initialization is memory-down I believe.
pci_get_bdf() pci_read_config() for loop which probably terminates immediately pci_bus_read_config() read_config(), which is pci_x86_read_config()
So it's not great but it doesn't look too bad.
Also is this an Intel Gallileo gen 2 development board? I'm thinking of getting one.
Yes, this is an Intel Galileo gen2 development board. Although there is an gen1 board in the past and the same u-boot.rom can boot on both gen1 and gen2 board, Intel is now shipping only gen2 board.
OK I've ordered one to try out.
Also we should start to move things away from the non-driver-model pci functions.
Regards, Bin

Hi Simon,
On Tue, Sep 1, 2015 at 11:22 AM, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 11:12 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 21:04, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 19:40, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 08:04, Bin Meng bmeng.cn@gmail.com wrote: > Hi Simon, > > On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass sjg@chromium.org wrote: >> Hi Bin, >> >> On 31 August 2015 at 07:43, Bin Meng bmeng.cn@gmail.com wrote: >>> Hi Simon, >>> >>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote: >>>> Hi Bin, >>>> >>>> On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote: >>>>> Boot time performance degradation is observed with the conversion >>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with >>>>> only 400MHz frequency and the most time consuming part is with MRC. >>>>> Each MRC register programming requires indirect access via pci bus. >>>>> With dm pci, accessing pci configuration space has some overhead. >>>>> Unfortunately this single access overhead gets accumulated in the >>>>> whole MRC process, and finally leads to twice boot time (25 seconds) >>>>> than before (12 seconds). >>>>> >>>>> To speed up the boot, create an optimized version of pci config >>>>> read/write routines without bothering to go through driver model. >>>>> Now it only takes about 3 seconds to finish MRC, which is really >>>>> fast (8 times faster than dm pci, or 4 times faster than before). >>>>> >>>>> Signed-off-by: Bin Meng bmeng.cn@gmail.com >>>>> --- >>>>> >>>>> arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- >>>>> 1 file changed, 37 insertions(+), 22 deletions(-) >>>> >>>> Before I delve into the patch - with driver model we are using the I/O >>>> method - see pci_x86_read_config(). Is that the source of the slowdown >>>> or is it just general driver model overhead. >>> >>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR >>> controller. Inside msg_port.c, pci_write_config_dword() and >>> pci_read_config_dword() are called. >>> >>> With driver model, the overhead is: >>> >>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config() >>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally >>> call pci_x86_read_config(). >>> >>> Without driver model, there is still some overhead (so previously the >>> MRC time was about 12 seconds) >>> >>> pci_write_config_dword() -> pci_hose_write_config_dword() -> >>> TYPE1_PCI_OP(write, dword, u32, outl, 0) >>> >>> With my optimized version, pci_write_config_dword() directly calls a >>> hardcoded dword size pci config access, without the need to consider >>> offset and mask, and dereferencing hose->cfg_addr/cfg->data. >> >> What about if we use dm_pci_read_config32()? We should try to move PCI >> access to driver model to avoid the uclass_get_device_by_seq() >> everywhere. > > I don't think that helps. dm_pci_read_config32() requires a dm driver. > MRC is just something that program a bunch of registers with pci > config rw call.
My question is really what takes the time? It's not clear whether it is the driver model overhead or something else. The code you add in qrk_pci_write_config_dword() looks very similar to pci_x86_read_config().
It is the driver model overhead. In order to get to pci_x86_read_config(), we need go through a bunch of function calls (see above). Yes, my version is very similar to pci_x86_read_config(), but my version is more simpler as it only needs to deal with dword size thus no need to do offset mask and switch/case. If you look at the Quark MRC codes, there are thousands of calls to msg_port_read() and msg_port_write().
> >> >>> >>>> >>>> If the former then perhaps we should change this. If the latter then >>>> we have work to do... >>>> > > Also for this patch, I just realized that not only it helps to reduce > the boot time, but also it helps to support PCIe root port in future > patches. > > By checking the Quark SoC manual, I found something that needs to be > done to get the two PCIe root ports on Quark SoC to work. In order to > get PCIe root ports show up in the PCI configuration space, we need > program some registers in the message bus (using APIs in > arch/x86/cpu/quark/msg_port.c). With driver model, if we still call > pci_write_config_dword() in the msg_port.c, we end up triggering PCI > enumeration process first before we actually write something to the > configuration space, however the enumeration process will hang when > scanning to the PCIe root port if we don't properly initialize the > message bus registers. This is a chicken-egg problem.
Sure, although I see that as a separate problem.
Yes, it is a separate problem. It came to me when I was reading the manual after I submitted the patch.
We can't have a driver model implementation that is really slow, so I'd like to clear that up first. Of course your patch makes sense for other reasons, but I don't want to play whack-a-mole here.
Agreed. So far the driver model PCI is used on x86 boards and sandbox. The performance issue was not obvious on these targets, but it is quite noticeable on Intel Quark. These PCI config read/write routines will go through lots of function calls before we actually touch the I/O ports 0xcf8 and 0xcfc, especially when the device is not on the root bus (it needs to go through its parent followed by its parent's parent).
But anyway I think this optimization for Quark is needed. I doubt we can optimize driver model pci to such an extent.
Can we use this as an opportunity to try a few things? If we use the dm_pci functions that should cut out some overhead. Can you try an experiment to see how much difference it makes?
dm_pci_read_config32()
We can't use this API as MRC is not a dm driver.
OK, probably I need to dig in and understand this a little better. Is it running pre-relocation with the early PCI stuff? We could make a driver with UCLASS_RAM perhaps.
Yes, it is running pre-relocation with the early PCI stuff. But I doubt the need to create a UCLASS_RAM for x86 targets as most x86 targets uses FSP to initialize the RAM. The best candidate to implement UCLASS_RAM that I can think of now is the Freescale DDR controller driver on powerpc, and on ARM recently. It supports both SPD and memory-down. On ARM, most RAM targets' DDR initialization is memory-down I believe.
Some updates today when trying to support PCIe root ports in the v2:
I moved qrk_pci_write_config_dword() and qrk_pci_read_config_dword() from msg_port.c to quark.c and update all codes in quark.c to call these two routines to avoid the chicken & egg problem. With this change, I noticed that the MRC execution time changed from 3 seconds to 4 seconds. So 1 additional second is needed. I disassembled u-boot and found in v1 since qrk_pci_write_config_dword() and qrk_pci_read_config_dword() are declared static in msg_port.c, they are inlined by the compiler into these APIs in msg_port.c. But with v2 changes, they are no longer inline but normal function call, which causes this additional 1 second.
So even inline makes some improvement for MRC, not to mention if we can avoid these multiple call chains in the driver model PCI APIs.
Regards, Bin

Hi Bin,
On 1 September 2015 at 04:29, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 11:22 AM, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 11:12 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 21:04, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass sjg@chromium.org wrote:
Hi Bin,
On 31 August 2015 at 19:40, Bin Meng bmeng.cn@gmail.com wrote:
Hi Simon,
On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass sjg@chromium.org wrote: > Hi Bin, > > On 31 August 2015 at 08:04, Bin Meng bmeng.cn@gmail.com wrote: >> Hi Simon, >> >> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass sjg@chromium.org wrote: >>> Hi Bin, >>> >>> On 31 August 2015 at 07:43, Bin Meng bmeng.cn@gmail.com wrote: >>>> Hi Simon, >>>> >>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass sjg@chromium.org wrote: >>>>> Hi Bin, >>>>> >>>>> On 31 August 2015 at 03:52, Bin Meng bmeng.cn@gmail.com wrote: >>>>>> Boot time performance degradation is observed with the conversion >>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with >>>>>> only 400MHz frequency and the most time consuming part is with MRC. >>>>>> Each MRC register programming requires indirect access via pci bus. >>>>>> With dm pci, accessing pci configuration space has some overhead. >>>>>> Unfortunately this single access overhead gets accumulated in the >>>>>> whole MRC process, and finally leads to twice boot time (25 seconds) >>>>>> than before (12 seconds). >>>>>> >>>>>> To speed up the boot, create an optimized version of pci config >>>>>> read/write routines without bothering to go through driver model. >>>>>> Now it only takes about 3 seconds to finish MRC, which is really >>>>>> fast (8 times faster than dm pci, or 4 times faster than before). >>>>>> >>>>>> Signed-off-by: Bin Meng bmeng.cn@gmail.com >>>>>> --- >>>>>> >>>>>> arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++---------------- >>>>>> 1 file changed, 37 insertions(+), 22 deletions(-) >>>>> >>>>> Before I delve into the patch - with driver model we are using the I/O >>>>> method - see pci_x86_read_config(). Is that the source of the slowdown >>>>> or is it just general driver model overhead. >>>> >>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR >>>> controller. Inside msg_port.c, pci_write_config_dword() and >>>> pci_read_config_dword() are called. >>>> >>>> With driver model, the overhead is: >>>> >>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config() >>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally >>>> call pci_x86_read_config(). >>>> >>>> Without driver model, there is still some overhead (so previously the >>>> MRC time was about 12 seconds) >>>> >>>> pci_write_config_dword() -> pci_hose_write_config_dword() -> >>>> TYPE1_PCI_OP(write, dword, u32, outl, 0) >>>> >>>> With my optimized version, pci_write_config_dword() directly calls a >>>> hardcoded dword size pci config access, without the need to consider >>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data. >>> >>> What about if we use dm_pci_read_config32()? We should try to move PCI >>> access to driver model to avoid the uclass_get_device_by_seq() >>> everywhere. >> >> I don't think that helps. dm_pci_read_config32() requires a dm driver. >> MRC is just something that program a bunch of registers with pci >> config rw call. > > My question is really what takes the time? It's not clear whether it > is the driver model overhead or something else. The code you add in > qrk_pci_write_config_dword() looks very similar to > pci_x86_read_config(). >
It is the driver model overhead. In order to get to pci_x86_read_config(), we need go through a bunch of function calls (see above). Yes, my version is very similar to pci_x86_read_config(), but my version is more simpler as it only needs to deal with dword size thus no need to do offset mask and switch/case. If you look at the Quark MRC codes, there are thousands of calls to msg_port_read() and msg_port_write().
>> >>> >>>> >>>>> >>>>> If the former then perhaps we should change this. If the latter then >>>>> we have work to do... >>>>> >> >> Also for this patch, I just realized that not only it helps to reduce >> the boot time, but also it helps to support PCIe root port in future >> patches. >> >> By checking the Quark SoC manual, I found something that needs to be >> done to get the two PCIe root ports on Quark SoC to work. In order to >> get PCIe root ports show up in the PCI configuration space, we need >> program some registers in the message bus (using APIs in >> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call >> pci_write_config_dword() in the msg_port.c, we end up triggering PCI >> enumeration process first before we actually write something to the >> configuration space, however the enumeration process will hang when >> scanning to the PCIe root port if we don't properly initialize the >> message bus registers. This is a chicken-egg problem. > > Sure, although I see that as a separate problem. >
Yes, it is a separate problem. It came to me when I was reading the manual after I submitted the patch.
> We can't have a driver model implementation that is really slow, so > I'd like to clear that up first. Of course your patch makes sense for > other reasons, but I don't want to play whack-a-mole here. >
Agreed. So far the driver model PCI is used on x86 boards and sandbox. The performance issue was not obvious on these targets, but it is quite noticeable on Intel Quark. These PCI config read/write routines will go through lots of function calls before we actually touch the I/O ports 0xcf8 and 0xcfc, especially when the device is not on the root bus (it needs to go through its parent followed by its parent's parent).
But anyway I think this optimization for Quark is needed. I doubt we can optimize driver model pci to such an extent.
Can we use this as an opportunity to try a few things? If we use the dm_pci functions that should cut out some overhead. Can you try an experiment to see how much difference it makes?
dm_pci_read_config32()
We can't use this API as MRC is not a dm driver.
OK, probably I need to dig in and understand this a little better. Is it running pre-relocation with the early PCI stuff? We could make a driver with UCLASS_RAM perhaps.
Yes, it is running pre-relocation with the early PCI stuff. But I doubt the need to create a UCLASS_RAM for x86 targets as most x86 targets uses FSP to initialize the RAM. The best candidate to implement UCLASS_RAM that I can think of now is the Freescale DDR controller driver on powerpc, and on ARM recently. It supports both SPD and memory-down. On ARM, most RAM targets' DDR initialization is memory-down I believe.
Some updates today when trying to support PCIe root ports in the v2:
I moved qrk_pci_write_config_dword() and qrk_pci_read_config_dword() from msg_port.c to quark.c and update all codes in quark.c to call these two routines to avoid the chicken & egg problem. With this change, I noticed that the MRC execution time changed from 3 seconds to 4 seconds. So 1 additional second is needed. I disassembled u-boot and found in v1 since qrk_pci_write_config_dword() and qrk_pci_read_config_dword() are declared static in msg_port.c, they are inlined by the compiler into these APIs in msg_port.c. But with v2 changes, they are no longer inline but normal function call, which causes this additional 1 second.
So even inline makes some improvement for MRC, not to mention if we can avoid these multiple call chains in the driver model PCI APIs.
OK so at this point I think we should give up and go with your original code. I hope that in future we can figure out a way to make this more efficient, but for now, let's run with it.
Regards, Simon
participants (2)
-
Bin Meng
-
Simon Glass