On ARM conditional instruction execution
Brane2 wrote:
Actually, nothing is lost except perhaps code space to encode instructions in, and it does have some rather interesting ways of simplifying code.
I don't see how "nothing is lost". Dead instructions still have to be decoded, if nothing else. So, you have to have decoder faster than the rest of the chip. That is, decoder has to be able to decode more instructions per clock than rst of the chip is able to retire.
And that can go only so far. Maybe one or two instructions extra, even if that. So you can't have many "dead by condition" instructions in the loop" or your pipeline will start to see some serious bubbles.
Hm. I really don't wee where this 'dead instruction' idea comes from, as most of the time instructions are live, not dead, i.e. condition for execution is 'Always'. This just means the bit field denoting the condition is mostly a constant within each instruction word.
There is absolutely no reason not to decode this instruction as the condition evaluation takes place during instruction execution, namely before the end result is written (either truly or effectively, depending on the actual core implementation). This is just like decoding a condition test or conditional jump, it has to be decoded anyway for the condition to be evaluated. When looking at the basics of the architecture, one should look at the first implementation - so no condition predictions, no instruction folding, no superscalarity etc, just a basic fetch - decode - execute pipeline system where 'execute' may be further divided into more pipeline stages. The condition must be evaluated at instruction execution to insure it's correctly set by any previous instructions, hence there is really no difference in pipeline flow regardless of condition met or not - if not, the instruction effectively becomes a 'NOP'. Obviously, in this simple implementation, there is an energy penalty as the actual instruction is really executed, just the result is either stored where it's supposed to go, or discarded.
Again, in this simple model, any changes to the program flow except pure linear flow are to be avoided for best efficiency. When conditional instruction execution is used in clever code, unlike even the simplest 'skip next instruction', the program flow remains linear, preventing any pipeline stall or flushing of pipelined instructions that will not be executed if a condition is met and a skip occurs. From the standpoint of a simple RISC, it really does not matter how far the PC moves as long as it does not simply increment - you get to do 'garbage collection' on your pipeline.
Also, one should remember that there are various (usually compiler implemented) tricks that can be used to prevent pipeline stall or flush. For instance, a classic example would be a loop that breaks on a condition being met. Basically, it ends with a 'jump back to the beginning of loop if break condition not met' and at that point you have N more instructions waiting in the pipeline, which would normally have to be thrown away. Certainly, if handling the jump's effect on the pipeline could be avoided, it would simplify instruction fetch-decode a lot. So, assume it is not handled in hardware at all, and that N=3. The proper way of encoding this loop would be the following:
;Loop start
Instruction 1
Instruction 2
Instruction 3
Loop re-entry:
<remainder of the loop>
jump to loop re-entry if condition for break not met
Instruction 1 if condition for break not met
instruction 2 if condition for break not met
instruction 3 if condition for break not met
; end of loop
What happens here?
Since N=3, this means 3 instructions after the jump instruction are pre-fetched in the pipeline at the point where the jump is executed. In order to correctly execute the loop, the first 3 instructions of the loop are duplicated after the jump instruction so that at the point the jump is executed, they will be already pre-fetched, and can be executed just as they would be in a traditional CPU if the jump went back to the loop start point. Instead, it goes back 3 instructions after, exactly the same 3 that are already pre-fetched, so nothing needs be done to the pipeline despite the jump.
However, these 3 instructions are conditional, and the condition is the same as the condition to jump. This is done so, because once the loop breaks, the 3 instructions following the jump are again pre-fetched and will execute, even though they shouldn't as they are really part of the loop code. However, since the condition for loop break has been met, the instructions will actually become NOP and although they will use up some clocks, again nothing needs be done tot he pipeline.
The premise is, of course, that the loop goes round many times while it breaks only once, so this additional 3 clock overhead is outside the loop and consumes negligible time compared to the loop itself.
Now, if it were me designing the CPU, and I found that I could just about squeeze the condition codes in every instruction (perhaps losing a bit or two that could maybe get me a few more seldomly used instructions into a single word), I would find it a great trade-off if it meant my pipeline remains simple and needs no hardware to deal with stalls or flushes.
As you can see, the decode unit is not required to run any faster than the rest of the CPU, in fact, things stay nice and in synch. At least until an interrupt comes along
Another thing to keep in mind here, some of these, actually rather simple techniques' are actually IP of certain CPU manufacturers and in some cases chip designers had to work around this. One interesting piece of trickery like that is used by the SPARC processor (and others I'm told), in order to keep instructions executing at a 1 Clock per instruction rate, and keep their pipeline 'flat'. Load 32-bit constant into register, is the instruction of choice here.
Normally, this would be an op-code followed by a 32-bit constant, which is of course a 'special case' considering all other instructions have only one instruction word and all the data is contained within. Well, some clever guy at SUN figured to abolish this instruction altogether, and instead, have the instruction 'load high word'. This is because they could put the instruction code and a word constant in a single 32-bit instruction word.
So, LOAD.L Rn, #$12345678 becomes 2 instructions, LOAD.L Rn, #$1234 followed by LOAD.H Rn, #$5678
The nice thing being both instructions take one 32-bit instruction word and execute in 1 clock, no special cases, and no need to medde with the pipeline.
RISC code is quite abundant with things like this, which would not be immediately clear to a regular assembler user, neither would a 'feature' intended to implement them, be immediately obvious, until one looks at some real code. In fact, where the manufacturer provides a means to program the CPU in assembler, there are pseudo-instructions or macros that actually convert to the proper instructions and hide the trickery even at the assembly level.
One which we actually found very useful, is saturation arithmetic, normally a rather bothersome thing for standard CPUs.
Yes, but:
1. You can always find that one instruction that makes your life in that particular case easier.
True, but that's exactly the point I meant with the 'feature' actually being a side-effect rather than intended for clever programming, even though it can be used for hat sort of thing. Keep in mind we are really talking 1st-gen RISC here, and they were designed to be simple hardware-wise, not clever for programming.
2. Saturation logic is discipline for FPGA or at least vector computing. If you want to intensely massage some bigger fields, you usually get yourself some chip with decent vector unit.
3. Your example uses conditonally executing one instruction in the loop. This offers very limited possibilities, and yet eats precious bits in every instruction word...
See above - besides, keep in mind the history of the ARM. It's first implementation was intended to do pretty much everything in a computer, rather reminiscent of Sir Clive's designs. No fancy new technologies then, but you still needed to manipulate bits and do sort-of DSP for graphics and sound, and IIRC, it did a rather good job, too.
1. WIth MIPS you just conditionally skip jump over next instrucion with similar result.
2. Other designs can detect branch to either same cacheline or one prefetched after it, with similar result.
Again, see above, the idea behind this is way older than branch prediction and in fact geared at simplifying the operation of the pipeline by avoiding 'traditional' jump implementations, in order to keep the pipeline full. In essence, the idea is, avoid the need to clean-up the pipeline, hence no need for any hardware to do it, and avoid it by avoiding actual program flow changes whenever possible.
Yes and no - depending on how you go around the whole thing. If you are counting on various fast memory units (which includes caches) inside the chip itself, in order to speed-up critical code execution, then you would be fresh out of luck with this approach as you get your RAM pre-organized by the manufacturer of the chip.
Yes, but you could modify your approach for that. CPU can usually signalize nature of the fetch. SO, if it is instruction fetch and you sense that in 16-byte group say first two lwords bytes are native code and last two are not, you could simply insert TRAP instructions there and fill the appropriate table with original opcodes of supplemented instructions.
If CPU then executes only legitimate code, everything will go "as usual". If it tries to execute trapped code, it will trap and read from the table what the trapped opcode was and then decide what to do.
True, but in fact you are to an extent duplicating the trap mechanism and not exactly with trivial hardware. This sort of thing would be a bonus if you were building a machine primairly intended to emulate other machines or CPUs. With 6.5 as the approach, the idea is to eventually migrate as much as possible to native code, hence your new programs (or routines, modules etc) would in the end all start with TRAP (native code) and not get out of that mode until they end, and here we could just as well be talking a whole application.
On the other hand, basing the OS API on TRAPs (or equivalent) provides a well defined line where the application (user) crosses into OS (supervisor/privileged) mode. This line is absolutely crucuial to QDOS/SMSQ and in fact the way the most basic and IMHO most important idea of that OS is implemented - using the lowest possible hardware based mechanism in the CPU itself to provide atomicity and time slicing and real-time functions. This is in fact the way QDOS/SMSQ avoids the use of semaphores and other traditional arbitration schemes, and effectively puts them into the domain of hardware, where they cannot be unduly influenced by software. From this standpoint, the overhead of the TRAPs is not only unavoidable but necessary.
One thing which this mechanism offers that is not very obvious, is based on the fact that TRAP implies a machine state change. Mostly we look at it as change from user to supervisor mode, but fi one looks a bit deeper, the idea is based on changing from one virtual CPU to another. This, not so obviously, implies that the concept can be expanded where the changeover does not have to be between virtual but real CPUs. And further, why assume the CPUs should be the same, since it's really only the API definition that tells us how parameters and results (i.e. information) is passed from one state to the other. This 'portal' of sorts is the basic foundation of QDOS/SMSQ, and as such has been there from the very beginning, which makes this OS much more easily adaptable to multiprocessin, be it virtual or real.
Funny enough the absence of a MMU also makes it easyer in this way - memory management in multiprocessor systems is no small feat and doing it across several different ones, or even different virtual ones (such as a 'native' and 'emulated' CPU) is no small feat.
It has been my experience in designing several platforms since early 2003, that development systems are consistently getting worse in quality, although they try to peddle quantity (of options... which mostly do not work!) as a selling point.
Also, because synthesis tools are largely re-used, there is less and less user control over how logic is actually compiled and fitted into a FPGA, and sometimes the actual (not presented by the marketing department!) performance of a compiler is abysmal - for instance, unable to actually properly use the very extra features that may make your chosen FPGA best fit the task at hand, namely implementing a CPU. Unfortunately, there is really little one can do about this.
Just out of curiosity- wouldn't some open-sourced IDE solve this problem ? I know FPGA manufacturers keep bitmap structures as a secret. But bitmap generators are available usually as a separate executable inside their tool. All I would need for simple projects is some graphical tool with chips design so I could preset LUTs etc etc.
IOW, I could use FPGA schematics with predefined native elements ( without switch matrix ) and then place and route them inside FPGA PCB. With all elements predefined then I could just employ closed-source bitstream generator and get myself a final bitstream...
Manufacturers also unfortunately keep routing resource details and configuration bit mapping a secret. So... no go, at least as far as I have been able to see. Simply put, if you want it fast, you need to pay for the next bigger and faster FPGA. Then when you finish development, hope and pray it compiles for a smaller FPGA.
Older CPLDs normally had a mauch simpler structure so you could 'massage' things into there with a few properly placed directives and locks. but for truly complex logic (and not just lots of simple copy-paste stuff), FPGAs are IT and whatever tools you can get with the money at your disposal, you are stuck with.