[U-Boot] boot-up time optimization. Where to start?

Hi there,
I just started to work on my bachelor thesis. It is about "Linux boot-up time optimization". The past days I spend analyzing what consumes the most time in the boot process.
I found that u-boot takes pretty much as long as the whole Linux kernel (the one we are using).
I started digging into the source and I think I have a big picture of what is going on. I already learned from the mailing list that it is a good idea to start a discussion early if you plan to change something and want it upstream. At this point of my thesis I'am free to choose where I start - only string attached is that if it is platform specific it has to be TI OMAP3.
So here is my question: Where do you see the most potential to optimize u-boot?
I already have two bullets on my list (just some ideas - maybe totally unrealistic *g*): - Use Hardware specific copy commands - build the checksum while moving the kernel to RAM
Thanks! Simon

On Wed, Apr 27, 2011 at 8:39 AM, Simon Schwarz simonschwarzcor@googlemail.com wrote:
So here is my question: Where do you see the most potential to optimize u-boot?
I'm sure many of the timeouts could be optimized.

On Wed, Apr 27, 2011 at 03:39:18PM +0200, Simon Schwarz wrote:
So here is my question: Where do you see the most potential to optimize u-boot?
I already have two bullets on my list (just some ideas - maybe totally unrealistic *g*):
- Use Hardware specific copy commands
- build the checksum while moving the kernel to RAM
This shouldn't be a matter of opinion/guess work. You should instrument U-Boot first and find out where the time is really being spent. Otherwise you run the classic risk of optimizing something that isn't really a problem.
Good luck, I look forward to seeing the results of your analysis.

On Wed, Apr 27, 2011 at 09:39, Simon Schwarz wrote:
I found that u-boot takes pretty much as long as the whole Linux kernel (the one we are using).
I started digging into the source and I think I have a big picture of what is going on. I already learned from the mailing list that it is a good idea to start a discussion early if you plan to change something and want it upstream. At this point of my thesis I'am free to choose where I start - only string attached is that if it is platform specific it has to be TI OMAP3.
make sure caches are enabled. then do as Eric suggested. -mike

Hi,
Am Mittwoch, 27. April 2011, 15:39:18 schrieb Simon Schwarz:
I just started to work on my bachelor thesis. It is about "Linux boot-up time optimization". The past days I spend analyzing what consumes the most time in the boot process.
I found that u-boot takes pretty much as long as the whole Linux kernel (the one we are using).
I started digging into the source and I think I have a big picture of what is going on. I already learned from the mailing list that it is a good idea to start a discussion early if you plan to change something and want it upstream. At this point of my thesis I'am free to choose where I start - only string attached is that if it is platform specific it has to be TI OMAP3.
So here is my question: Where do you see the most potential to optimize u-boot?
Setting stdin, stdout and stderr takes a lot of time (IIRC ~500ms). Which IMO is useless on a bootloader without LCD support.
Regards, Alexander

On Wed, Apr 27, 2011 at 11:59, Alexander Stein wrote:
Am Mittwoch, 27. April 2011, 15:39:18 schrieb Simon Schwarz:
I just started to work on my bachelor thesis. It is about "Linux boot-up time optimization". The past days I spend analyzing what consumes the most time in the boot process.
I found that u-boot takes pretty much as long as the whole Linux kernel (the one we are using).
I started digging into the source and I think I have a big picture of what is going on. I already learned from the mailing list that it is a good idea to start a discussion early if you plan to change something and want it upstream. At this point of my thesis I'am free to choose where I start - only string attached is that if it is platform specific it has to be TI OMAP3.
So here is my question: Where do you see the most potential to optimize u-boot?
Setting stdin, stdout and stderr takes a lot of time (IIRC ~500ms). Which IMO is useless on a bootloader without LCD support.
eh ? those are still used with serial consoles. unless you're talking about some driver that is specific to the OMAP3 and/or a board. -mike

Dear Alexander Stein,
In message 201104271759.11818.alexander.stein@systec-electronic.com you wrote:
Setting stdin, stdout and stderr takes a lot of time (IIRC ~500ms). Which IMO is useless on a bootloader without LCD support.
Statements like this are completely worhtless if you don;t tell exactly on which architecture and board, and with which exact version of U-Boot such numbers have been measured.
Best regards,
Wolfgang Denk

Dear Wolfgang,
Am Mittwoch, 27. April 2011, 21:08:50 schrieb Wolfgang Denk:
In message 201104271759.11818.alexander.stein@systec-electronic.com you
wrote:
Setting stdin, stdout and stderr takes a lot of time (IIRC ~500ms). Which IMO is useless on a bootloader without LCD support.
Statements like this are completely worhtless if you don;t tell exactly on which architecture and board, and with which exact version of U-Boot such numbers have been measured.
Ok, let me be more precise on this. We used U-Boot v2010.09 on a custom board running on an I.MX35 (ARM1136). We noticed the following code snippet took relatively long. From common/console.c in console_init_r(void):
/* Setting environment variables */ for (i = 0; i < 3; i++) { setenv(stdio_names[i], stdio_devices[i]->name); }
We added PIN toggling around this part of code and measured something >100ms. A collegue said it was ~100ms, I remembered ~500ms. Dunno who is right.
Regards, Alexander

Hi Alexander,
Dear Wolfgang,
Am Mittwoch, 27. April 2011, 21:08:50 schrieb Wolfgang Denk:
In message 201104271759.11818.alexander.stein@systec-electronic.com you
wrote:
Setting stdin, stdout and stderr takes a lot of time (IIRC ~500ms). Which IMO is useless on a bootloader without LCD support.
Statements like this are completely worhtless if you don;t tell exactly on which architecture and board, and with which exact version of U-Boot such numbers have been measured.
Ok, let me be more precise on this. We used U-Boot v2010.09 on a custom board running on an I.MX35 (ARM1136). We noticed the following code snippet took relatively long.
From common/console.c in console_init_r(void):
/* Setting environment variables */ for (i = 0; i < 3; i++) { setenv(stdio_names[i], stdio_devices[i]->name); }
We added PIN toggling around this part of code and measured something >100ms. A collegue said it was ~100ms, I remembered ~500ms. Dunno who is right.
It doesn't really matter who is right - 100ms is way off for setting these variables. Looking into common/cmd_nvedit.c, these variables have a special handling and there are ifdef's involved, so its not straightforward to read. You should really find out, where in there the time is spent for your board and fix the problem ;)
Cheers Detlev

Hi Detlev,
Am Montag, 2. Mai 2011, 17:31:15 schrieb Detlev Zundel:
Hi Alexander,
Dear Wolfgang,
Am Mittwoch, 27. April 2011, 21:08:50 schrieb Wolfgang Denk:
In message 201104271759.11818.alexander.stein@systec-electronic.com you
wrote:
Setting stdin, stdout and stderr takes a lot of time (IIRC ~500ms). Which IMO is useless on a bootloader without LCD support.
Statements like this are completely worhtless if you don;t tell exactly on which architecture and board, and with which exact version of U-Boot such numbers have been measured.
Ok, let me be more precise on this. We used U-Boot v2010.09 on a custom board running on an I.MX35 (ARM1136). We noticed the following code snippet took relatively long.
From common/console.c in console_init_r(void): /* Setting environment variables */ for (i = 0; i < 3; i++) {
setenv(stdio_names[i], stdio_devices[i]->name);
}
We added PIN toggling around this part of code and measured something
100ms. A collegue said it was ~100ms, I remembered ~500ms. Dunno who is
right.
It doesn't really matter who is right - 100ms is way off for setting these variables. Looking into common/cmd_nvedit.c, these variables have a special handling and there are ifdef's involved, so its not straightforward to read. You should really find out, where in there the time is spent for your board and fix the problem ;)
Our 'fix' was removing the stated lines at all.
Regards, Alexander

Hi Alexander,
[...]
Ok, let me be more precise on this. We used U-Boot v2010.09 on a custom board running on an I.MX35 (ARM1136). We noticed the following code snippet took relatively long.
From common/console.c in console_init_r(void): /* Setting environment variables */ for (i = 0; i < 3; i++) {
setenv(stdio_names[i], stdio_devices[i]->name);
}
We added PIN toggling around this part of code and measured something
100ms. A collegue said it was ~100ms, I remembered ~500ms. Dunno who is
right.
It doesn't really matter who is right - 100ms is way off for setting these variables. Looking into common/cmd_nvedit.c, these variables have a special handling and there are ifdef's involved, so its not straightforward to read. You should really find out, where in there the time is spent for your board and fix the problem ;)
Our 'fix' was removing the stated lines at all.
That is not a fix but simply ignoring a problem. Maybe if you find out why these lines have such an unexpected run-time, you will solve more problems also.
We have a mantra here on the mailing list, so let me introduce you to it:
Solve the problems one by one in the order that you encounter them. Every ignored problem will come back later and catch you. Really, it will.
Now repeat after me ;)
Cheers Detlev

Dear Alexander Stein,
In message 201105021807.52142.alexander.stein@systec-electronic.com you wrote:
Our 'fix' was removing the stated lines at all.
Did you understand what you were doing?
Best regards,
Wolfgang Denk

Dear Alexander Stein,
In message 201105021640.27241.alexander.stein@systec-electronic.com you wrote:
Ok, let me be more precise on this. We used U-Boot v2010.09 on a custom board running on an I.MX35 (ARM1136).
Let's summarize known facts:
1. You are talking about an out-of-tree port, i. e. code which is completely unknown here, so it is basicly impossible to comment. We can barely speculate.
2. This is an ARM board.
3. This is an old version before cache support for ARM was added.
We noticed the following code snippet took relatively long. From common/console.c in console_init_r(void):
/* Setting environment variables */ for (i = 0; i < 3; i++) { setenv(stdio_names[i], stdio_devices[i]->name); }
We added PIN toggling around this part of code and measured something >100ms. A collegue said it was ~100ms, I remembered ~500ms. Dunno who is right.
Both numbers are way off.
Let me speculate: (I) you have a _huge_ environment allocated for your board, probably > 100 KiB or more; (II) you are loading it from a slow storage device, probably NAND flash; (III) you are running on a narrow system bus (16 bit) with non-optimal RAM timings; (IV) you do all this with caches turned off; (V) you measure some numbers but you don;t understand what they mean.
I bet some beer that at least 3 of these speculations hit the point.
Best regards,
Wolfgang Denk

Dear Wolfgang,
Am Montag, 2. Mai 2011, 19:00:47 schrieb Wolfgang Denk:
In message 201105021640.27241.alexander.stein@systec-electronic.com you
wrote:
Ok, let me be more precise on this. We used U-Boot v2010.09 on a custom board running on an I.MX35 (ARM1136).
Let's summarize known facts:
- You are talking about an out-of-tree port, i. e. code which is completely unknown here, so it is basicly impossible to comment. We can barely speculate.
I'm aware of that. But on the other hand, cmd_nvedit.c and console.c are both generic parts. As Detlev have written, this is not the cause of the problem. Let's see if I find time again for this.
This is an ARM board.
This is an old version before cache support for ARM was added.
This specific version was selected due to relocation problems on ARM. But I expect the dcache doesn't have that big influence on the named code part as the environment is already in RAM.
We noticed the following code snippet took relatively long.
From common/console.c in console_init_r(void):
/* Setting environment variables */ for (i = 0; i < 3; i++) {
setenv(stdio_names[i], stdio_devices[i]->name);
}
We added PIN toggling around this part of code and measured something
100ms. A collegue said it was ~100ms, I remembered ~500ms. Dunno who is
right.
Both numbers are way off.
Let me speculate: (I) you have a _huge_ environment allocated for your board, probably > 100 KiB or more;
Environment size: 2098/131067 bytes
So, no.
(II) you are loading it from a slow storage device, probably NAND flash;
The environment is stored in NOR-Flash. So, no.
(III) you are running on a narrow system bus (16 bit) with non-optimal RAM timings;
It is using a 32-Bit RAM-Bus. So, no.
(IV) you do all this with caches turned off;
dcaches should be off, while icaches are on. So yes and no.
(V) you measure some numbers but you don;t understand what they mean.
These numbers show me that this part of code increases the start time of a considerable amount. The workaround resulted in a faster startup without notable side effect. I'm aware this is not the fix of the problem. So yes and no.
I bet some beer that at least 3 of these speculations hit the point.
You better not want to bet here :-)
Regards, Alexander

On 05/03/2011 08:48 AM, Alexander Stein wrote:
Hi Alexander,
Am Montag, 2. Mai 2011, 19:00:47 schrieb Wolfgang Denk:
In message 201105021640.27241.alexander.stein@systec-electronic.com you
wrote:
Ok, let me be more precise on this. We used U-Boot v2010.09 on a custom board running on an I.MX35 (ARM1136).
The i.MX35 is supported in u-boot mainline. As I can suppose, you start from the mx35pdk to port your code, probably you have good chances to use the last u-boot code. And then to send patches for your board, too... ;-)
This is an ARM board.
This is an old version before cache support for ARM was added.
This specific version was selected due to relocation problems on ARM.
The mx35pdk board is supported in u-boot mainline using relocation.
expect the dcache doesn't have that big influence on the named code part as the environment is already in RAM.
Check in the mailing list, the numbers reported by people who measured the influence of cache say there is a big difference, specially when big chunks of code is copied, that is copying the kernel from storage before booting.
Best regards, Stefano Babic

Dear Alexander Stein,
In message 201105030848.17576.alexander.stein@systec-electronic.com you wrote:
This specific version was selected due to relocation problems on ARM. But I expect the dcache doesn't have that big influence on the named code part as the environment is already in RAM.
Your expectation is most likely completely wrong. Reading from / writing to uncached RAM is painfully slow compared to a system with caches turned on. And if you - as I speculate - need to checksum a huge amount of data, this will delay things without need.
Are you also still using the old environment code in your port, or is the new, hash table based one? When using the old code, there are additional penalties for using a needlessly big environment as each call to setenv() will recalculate the checksum.
Let me speculate: (I) you have a _huge_ environment allocated for your board, probably > 100 KiB or more;
Environment size: 2098/131067 bytes
So, no.
So, yes! You cannot even read your own numbers correctly.
131067 = 128 KiB which _is_ > 100 KiB.
(II) you are loading it from a slow storage device, probably NAND flash;
The environment is stored in NOR-Flash. So, no.
Especially on NOR flash there is no reason to use an environment size of 128 KiB when you only use 2 KiB of it.
(III) you are running on a narrow system bus (16 bit) with non-optimal RAM timings;
It is using a 32-Bit RAM-Bus. So, no.
And your NOR flash?
And your memory timings?
(IV) you do all this with caches turned off;
dcaches should be off, while icaches are on. So yes and no.
DC of makes things awfully slow. See comments of commits c3330e9, 95c6f6d and 7e4a9e6 - for plain RAM bound operations like copying/uncompressing an image from RAM to RAM switchign on the DC can accelerate the system by a factor of up to >15.
(V) you measure some numbers but you don;t understand what they mean.
These numbers show me that this part of code increases the start time of a considerable amount.
You don;t even understand that you have > 100 KiB of environment size which gets checksummed without need.
The workaround resulted in a faster startup without notable side effect. I'm aware this is not the fix of the problem. So yes and no.
I bet some beer that at least 3 of these speculations hit the point.
You better not want to bet here :-)
It appears I was right on accounts (I), (IV) and (V), it seems ;-) And (III) is not clear yet.
Fact is, the code that you claim takes 100 (or 500) ms to run has no potential for such a long run time unless your system is seriously misconfigured. I guess it runs at least 100 times faster on all systems I have access to.
Best regards,
Wolfgang Denk

Dear Wolfgang,
Am Donnerstag, 5. Mai 2011, 07:32:20 schrieb Wolfgang Denk:
In message 201105030848.17576.alexander.stein@systec-electronic.com you
wrote:
This specific version was selected due to relocation problems on ARM. But I expect the dcache doesn't have that big influence on the named code part as the environment is already in RAM.
Your expectation is most likely completely wrong. Reading from / writing to uncached RAM is painfully slow compared to a system with caches turned on. And if you - as I speculate - need to checksum a huge amount of data, this will delay things without need.
Are you also still using the old environment code in your port, or is the new, hash table based one? When using the old code, there are additional penalties for using a needlessly big environment as each call to setenv() will recalculate the checksum.
I was digging into this problem for a short time. And yes, the CRC checksumcalculation takes about 25ms each run. So setenv is called for each stdin,stdout and stderr. which sums up to ~75ms. So you're right this is the old environment code. Here a dcache will speed up the execution of course. But our standard startup just stars U-Boot and copies the Linux kernel into RAM and starts it. There is not much use of dcache during copy here.
(III) you are running on a narrow system bus (16 bit) with non-optimal RAM timings;
It is using a 32-Bit RAM-Bus. So, no.
And your NOR flash?
It is connected 16-bit like most devices only support, but it is setup to use page read mode.
And your memory timings?
Should be pretty good.
(IV) you do all this with caches turned off;
dcaches should be off, while icaches are on. So yes and no.
DC of makes things awfully slow. See comments of commits c3330e9, 95c6f6d and 7e4a9e6 - for plain RAM bound operations like copying/uncompressing an image from RAM to RAM switchign on the DC can accelerate the system by a factor of up to >15.
Yes, from RAM to RAM, dcache will help a lot. But we neither copy from RAM to RAM nor do we uncompressing.
(V) you measure some numbers but you don;t understand what they mean.
These numbers show me that this part of code increases the start time of a considerable amount.
You don;t even understand that you have > 100 KiB of environment size which gets checksummed without need.
Mh, this might be an option for further ports.
Fact is, the code that you claim takes 100 (or 500) ms to run has no potential for such a long run time unless your system is seriously misconfigured. I guess it runs at least 100 times faster on all systems I have access to.
Well, as already said this is related to CRC calculation of environment. I did a fast port to v2011.03 and the setenv is a lot faster, which is due the new env code base. But I also noticed the time until kernel_entry is called is about 30ms later after reset than on the old code base. But I didn't investigate any time further to see what caused this. But AFAICS also the new U-Boot code doesn't enable dcache on ARM1136 either.
Regards, Alexander

Dear Alexander Stein,
In message 201105050906.35834.alexander.stein@systec-electronic.com you wrote:
Are you also still using the old environment code in your port, or is the new, hash table based one? When using the old code, there are additional penalties for using a needlessly big environment as each call to setenv() will recalculate the checksum.
I was digging into this problem for a short time. And yes, the CRC checksumcalculation takes about 25ms each run. So setenv is called for each stdin,stdout and stderr. which sums up to ~75ms. So you're right this is the old environment code. Here a dcache will speed up the execution of course.
Even more so would reducing the environment size to some reasonable value. Currently you are using some 2 KiB, so say you set the environment size to 8 KiB. This would be 1/16 of your current size, which means the ~75ms would shrink to less than 5 ms. You are wasting 70 ms (only here - there are other places which will add to this figure) just because this inappropriate configuration.
But our standard startup just stars U-Boot and copies the Linux kernel into RAM and starts it. There is not much use of dcache during copy here.
You are wrong. There is a huge difference between perrforming a copy operation in single write cycles to uncached RAM versus writing to a cached area where the cache flushes willoperate in burst mode. Also, the U-Boot code will run faster, too, so copying and decompression is much faster.
You repeat the same mistake again: you make assumptions about what may or may not be fast or slow on your system without actually measuring it. Donald Knuth is right again: "Early optimization is the root of much evil."
It is using a 32-Bit RAM-Bus. So, no.
And your NOR flash?
It is connected 16-bit like most devices only support, but it is setup to use page read mode.
Well, many systems use two 16 bit chips in parallel to give a 32 bit bus.
DC of makes things awfully slow. See comments of commits c3330e9, 95c6f6d and 7e4a9e6 - for plain RAM bound operations like copying/uncompressing an image from RAM to RAM switchign on the DC can accelerate the system by a factor of up to >15.
Yes, from RAM to RAM, dcache will help a lot. But we neither copy from RAM to RAM nor do we uncompressing.
There is still a huge diference in memory bandwith between using plain single write cycles versus burst mode accesses.
Don't speculate. Measure yourself!
Best regards,
Wolfgang Denk

On 05/05/2011 09:06 AM, Alexander Stein wrote:
Well, as already said this is related to CRC calculation of environment. I did a fast port to v2011.03 and the setenv is a lot faster, which is due the new env code base. But I also noticed the time until kernel_entry is called is about 30ms later after reset than on the old code base. But I didn't investigate any time further to see what caused this. But AFAICS also the new U-Boot code doesn't enable dcache on ARM1136 either.
This is wrong. The dcache is enabled on several MX31 boards. On the mx35pdk, it is not enabled.
Add CONFIG_CMD_CACHE to your configuration file, and you will get the dcache <on|off> command.
Best regards, Stefano Babic

On Thursday 05 May 2011 17:32:20 Wolfgang Denk wrote:
Dear Alexander Stein,
In message 201105030848.17576.alexander.stein@systec-electronic.com you
wrote:
This specific version was selected due to relocation problems on ARM. But I expect the dcache doesn't have that big influence on the named code part as the environment is already in RAM.
Your expectation is most likely completely wrong. Reading from / writing to uncached RAM is painfully slow compared to a system with caches turned on. And if you - as I speculate - need to checksum a huge amount of data, this will delay things without need.
Caching has a huge effect on **all** code and is the first thing I'd play with in trying to speed things up.
I have been doing some stuff to speed omap3 booting. It was taking approx 4 seconds from power up until the kernel started spewing boot messages. That is now down to less than 2 secs (including the funky omap3 romboot time, loading uboot from NAND and then loading the kernel from NAND). Only difference was turning on caching in uboot using the caching commands.
-- Charles

On Thu, May 5, 2011 at 11:10 AM, Charles Manning manningc2@actrix.gen.nzwrote:
On Thursday 05 May 2011 17:32:20 Wolfgang Denk wrote:
Dear Alexander Stein,
In message 201105030848.17576.alexander.stein@systec-electronic.com
you wrote:
This specific version was selected due to relocation problems on ARM.
But
I expect the dcache doesn't have that big influence on the named code part as the environment is already in RAM.
Your expectation is most likely completely wrong. Reading from / writing to uncached RAM is painfully slow compared to a system with caches turned on. And if you - as I speculate - need to checksum a huge amount of data, this will delay things without need.
Caching has a huge effect on **all** code and is the first thing I'd play with in trying to speed things up.
Yes agreed. Running U-Boot without caching is a great way to slow boot time. In fact we should turn on the L2 cache if there is one.
I have been doing some stuff to speed omap3 booting. It was taking approx 4 seconds from power up until the kernel started spewing boot messages. That is now down to less than 2 secs (including the funky omap3 romboot time, loading uboot from NAND and then loading the kernel from NAND). Only difference was turning on caching in uboot using the caching commands.
I have a Seaboard (Tegra2) networking booting using a USB dongle in about 5s with console output, about 3s of which is USB and PHY delay. Turning off caching adds several seconds mainly because the tftp is so much slower. The same would apply for any boot medium. I have a patch which displays boot time in microseconds if it is of interest to anyone:
Timer summary in microseconds: Mark Elapsed Stage 0 0 awake 193,298 193,298 usb_start 1,342,411 1,149,113 eth_start 3,767,039 2,424,628 bootp_start 3,790,121 23,082 bootp_stop 3,790,293 172 tftp start 4,761,459 971,166 tftp done 4,761,489 30 bootm_start 4,892,145 130,656 start_kernel
Regards, Simon
-- Charles
U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot

Hi Simon,
[...]
I have a patch which displays boot time in microseconds if it is of interest to anyone:
Timer summary in microseconds: Mark Elapsed Stage 0 0 awake 193,298 193,298 usb_start 1,342,411 1,149,113 eth_start 3,767,039 2,424,628 bootp_start 3,790,121 23,082 bootp_stop 3,790,293 172 tftp start 4,761,459 971,166 tftp done 4,761,489 30 bootm_start 4,892,145 130,656 start_kernel
Of course we are interested, please show us the code. Such code, even if it does not get into mainline can be a good starter for other people benchmarking their port.
Thanks Detlev

How could I miss this. Of course I am interested!
Thanky you Simon
2011/5/9 Detlev Zundel dzu@denx.de
Hi Simon,
[...]
I have a patch which displays boot time in microseconds if it is of interest to anyone:
Timer summary in microseconds: Mark Elapsed Stage 0 0 awake 193,298 193,298 usb_start 1,342,411 1,149,113 eth_start 3,767,039 2,424,628 bootp_start 3,790,121 23,082 bootp_stop 3,790,293 172 tftp start 4,761,459 971,166 tftp done 4,761,489 30 bootm_start 4,892,145 130,656 start_kernel
Of course we are interested, please show us the code. Such code, even if it does not get into mainline can be a good starter for other people benchmarking their port.
Thanks Detlev
-- "One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.
-- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-40 Fax: (+49)-8142-66989-80 Email: dzu@denx.de

Hi,
I will post a patch. Also we have a 'time' command we could upstream.
Regards, Simon
On Mon, May 9, 2011 at 3:49 AM, Detlev Zundel dzu@denx.de wrote:
Hi Simon,
[...]
I have a patch which displays boot time in microseconds if it is of interest to anyone:
Timer summary in microseconds: Mark Elapsed Stage 0 0 awake 193,298 193,298 usb_start 1,342,411 1,149,113 eth_start 3,767,039 2,424,628 bootp_start 3,790,121 23,082 bootp_stop 3,790,293 172 tftp start 4,761,459 971,166 tftp done 4,761,489 30 bootm_start 4,892,145 130,656 start_kernel
Of course we are interested, please show us the code. Such code, even if it does not get into mainline can be a good starter for other people benchmarking their port.
Thanks Detlev
-- "One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.
-- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-40 Fax: (+49)-8142-66989-80 Email: dzu@denx.de

Dear Simon Schwarz,
In message BANLkTi=HfOsg73_GCC=4aVoi_XUNbcf9nQ@mail.gmail.com you wrote:
I just started to work on my bachelor thesis. It is about "Linux boot-up time optimization". The past days I spend analyzing what consumes the most time in the boot process.
A lot of efforts have alreay been spent in this area. A ton of specialized solutions exist, some of them even documetned. I highly recommend to study the available information first. A good starting point might be this page: http://www.elinux.org/Boot_Time
I found that u-boot takes pretty much as long as the whole Linux kernel (the one we are using).
This depends on a lot of things: architecture, boot device, memory architecture, etc. etc.
I started digging into the source and I think I have a big picture of what is going on. I already learned from the mailing list that it is a good idea to start a discussion early if you plan to change something and want it upstream. At this point of my thesis I'am free to choose where I start - only string attached is that if it is platform specific it has to be TI OMAP3.
So here is my question: Where do you see the most potential to optimize u-boot?
I already have two bullets on my list (just some ideas - maybe totally unrealistic *g*):
- Use Hardware specific copy commands
- build the checksum while moving the kernel to RAM
Eventually you will find that these are not _that_ important.
A few thoughts, mostly unsorted:
- Booting fast means running fast, so make sure your CPU can run as fast as possible: chose an optimized clock configuration; chose optimized memory timings; make sure all caches are enabled.
- Embedded processors usually have tiny caches, so performance is often limited by the memory bandwidth: a device with a 32 bit wide memory bus will be faster than one with a 16 bit bus.
- Memory technology is imprtant: DDR3 is much faster than plain old SDRAM.
- Chose the storage device wisely: using NOR flash as boot device allows direct execution of the code from the NOR (often using a 32 bit wide bus); if you use NAND flash as storage you will have a relatively slow download and a way more complicated protocol for booting.
- Benchmark your system before you make assumptions. For example, it is not possible to decide in general if a compressed or an uncompressed Linux kernel image is a better choice - if you have a fast bus and a slow CPU, then the CPU performance for uncompressing the code may be the bottleneck, and an uncompressed kernel may be way faster; if you have a fast CPU and a relatively slow storage device it may be much more efficient to load the smaller, compressed kernel image, and the CPU will still be mostly idle uncompressing it.
- Keep in mind that you are always trading efforts and costs, boot time and reliability/security against each other. You always get a maximum of two items, never all three (good, cheap, fast). If you optimize for speed, you may have to accept not only higher development efforts, but for example also higher production coses (like when needing bigger boot devices for uncompressed images or file systems), but also have to give up security/reliabilty by for example switching off the checksum protection of the images.
- Optimize your configuration: U-Boot in it's general configuration is more of a bring-up and debug tool, combined with a powerful machine for implementing fancy features like software updates and such, but not optimized at all for fast booting.
Removing unused code (both in U-Boot and Linux) makes the memory footprint smaller and thus loading faster.
- Apply the old Antoine de Saint-Exupery aphorism: "Perfection is reached, not when there is no longer anything to add, but when there is no longer anything to take away." For example when booting form NAND flash, we have a number of steps: some ROM code loads a small block of code from NAND into some (usually limited) memory; this code (in U-Boot terms the nand_spl code) will load the rest of U-Boot and start it; U-Boot will then initialize more of the system and relocate itself to the upper end o the RAM; then U-Boot will load the Linux kernel and start it.
This allows for a powerful and very flexible system, but how much of this power and flexibility do you really need on a system where your primary goal is to boot fast, i. e. to make U-Boot goes out of the way as fast as possible?
If you want to boot fast, why do you go through all these steps, then? Why does not your nand_spl code load the Linux kernel directly and start it, instead of wasting time with all these other steps and phases? [Note that this is the approach taken by all these systems that report sub-second boot times.]
- Don't stop the optimization with the loading of the Linux kernel. The choice of the root file system type, the strategy for driver initialization (immediate with statically linked drivers versus lazy with dynamically loaded modues) and for starting software services is at least as well as important.
Keep in mind that on most systems it's trivial (with standard U-Boot) to have Linux starting the first user space code within 5 or 6 seconds after port-on. But on many systems the appliucation startup will take much longer than that.
- And on each step of you path benchmark again, and focus on the hot spots. It makes little sense to spend a week of effort on reducing the execustion time of function foo() to 5% of it's original value, when this cunction contributes only 1% to the total boot time.
Hope this helps.
Best regards,
Wolfgang Denk
participants (10)
-
Alexander Stein
-
Charles Manning
-
Detlev Zundel
-
Eric Cooper
-
Mike Frysinger
-
Simon Glass
-
Simon Schwarz
-
Stefano Babic
-
Tabi Timur-B04825
-
Wolfgang Denk