Branch Target Alignment in VLIW Assembler
Variable length instruction formats with EOP (End of Packet) bit or similar techniques will be implemented in VLIW architectures for reducing code size. This article describes the potential problem that arises with branch target non-alignment with physical memory boundary. The paper [1] describes the problem and the technique to resolve it. I have provided my interpretation of problem understanding and the solution with examples.
The VLIW assembler of is responsible for making branch target alignment decisions while laying out the code in memory. Enforcing all branch targets to be aligned to packet boundaries may blow up the code size by a large amount due to wasted bits. On the other hand, for frequently visited branch targets outside the sequential flow of control, the accumulated stall penalty due to the target instruction not being contained within one instruction packet may be substantial.
Look at the code with one branch instruction and 9 instructions between branch and branch target instruction. Assuming the VLIW architecture has 4-slot and variable length instructions.
The VLIW assembler of is responsible for making branch target alignment decisions while laying out the code in memory. Enforcing all branch targets to be aligned to packet boundaries may blow up the code size by a large amount due to wasted bits. On the other hand, for frequently visited branch targets outside the sequential flow of control, the accumulated stall penalty due to the target instruction not being contained within one instruction packet may be substantial.
Look at the code with one branch instruction and 9 instructions between branch and branch target instruction. Assuming the VLIW architecture has 4-slot and variable length instructions.
In PMBATA (Program Memory with Boundary Aligned Branch Targets) the nops were inserted by the assembler in memory to place Target at the boundary. Whereas in PMBNABT (Program Memory with Boundary Not Aligned Branch Targets), the nops were eliminated reducing the code size. But in PMBNABT the Target and instruction slots 11, 12, 13 which should be executed in parallel (as per the assembly program) were spread across two words in program memory which takes two clocks to load rising a stall. This stall effect will be predominant for most frequently visited targets those falls under this case.
Heuristic Approach to Reduce Stall Penalty [1]: A profile driven heuristic can be used to ensure that the most frequently executed targets do not cross a packet boundary. This eliminates the branch penalty for such targets requiring minimal increase in code size. In this approach, the compiler needs to profile the program and annotate the assembly code with the frequency with which each branch target is visited via a taken branch. (If compiler is not there user can write a script file that can scan the user assembly code and give this information) Before starting the actual process of assembly, the assembler first sorts all branch targets from the highest to the lowest dynamic frequency and then classifies them one-by-one as not permitted to cross a packet boundary (non-crossing) by keeping track of two cumulative values: dynamic fraction of targets already classified as non-crossing (initially 0.0) and static fraction of targets that may potentially cross a packet boundary (initially 1.0). At the point when the dynamic fraction of non-crossing targets is equal or larger to the static fraction of potentially crossing targets, the process stops and all the remaining unprocessed targets are classified as potentially crossing a packet boundary.
Example: (This is my interpretation of the Approach. Please cross verify once)
Let us say an assembly program has 5 branch targets and their frequency is as shown below.
(Frequency= 4 means this is a target for four different branches)
-------------------------
Heuristic Approach to Reduce Stall Penalty [1]: A profile driven heuristic can be used to ensure that the most frequently executed targets do not cross a packet boundary. This eliminates the branch penalty for such targets requiring minimal increase in code size. In this approach, the compiler needs to profile the program and annotate the assembly code with the frequency with which each branch target is visited via a taken branch. (If compiler is not there user can write a script file that can scan the user assembly code and give this information) Before starting the actual process of assembly, the assembler first sorts all branch targets from the highest to the lowest dynamic frequency and then classifies them one-by-one as not permitted to cross a packet boundary (non-crossing) by keeping track of two cumulative values: dynamic fraction of targets already classified as non-crossing (initially 0.0) and static fraction of targets that may potentially cross a packet boundary (initially 1.0). At the point when the dynamic fraction of non-crossing targets is equal or larger to the static fraction of potentially crossing targets, the process stops and all the remaining unprocessed targets are classified as potentially crossing a packet boundary.
Example: (This is my interpretation of the Approach. Please cross verify once)
Let us say an assembly program has 5 branch targets and their frequency is as shown below.
(Frequency= 4 means this is a target for four different branches)
Target Frequency
-------------------------
BT-1 2
BT-2 3
BT-3 4
BT-4 1
BT-5 6
-------------------------
After sorting the targets will be rearranged as,
-------------------------
Target Frequency
-------------------------
BT-5 6
BT-3 4
BT-2 3
BT-3 2
BT-4 1
-------------------------
Now the fractions df-non-crossing (dynamic fraction of targets already classified as non-crossing) and sf-crossing (static fraction of targets that may potentially cross a packet boundary) will be initialized to 0.0 and 1.0 respectively. Now the following tables show the results of iterations.
Iteration-0: df-non-crossing = 0.0 and sf-crossing = 1.0 df-non-crossing >= sf-crossing is false mark the current instruction as non-crossing.
Iteration-0: df-non-crossing = 0.0 and sf-crossing = 1.0 df-non-crossing >= sf-crossing is false mark the current instruction as non-crossing.
---------------------------------------------
Target Frequency Result
---------------------------------------------
BT-5 6 non-crossing
BT-3 4
BT-2 3
BT-3 2
BT-4 1
---------------------------------------------
Update fractions: df-non-crossing = (6/16 = 0.375) and sf-crossing = (4/5 = 0.8)
Iteration-1: df-non-crossing = 0.375 and sf-crossing = 0.8 df-non-crossing >= sf-crossing is false mark the current instruction as non-crossing.
---------------------------------------------
Target Frequency Result
---------------------------------------------
BT-5 6 non-crossing
BT-3 4 non-crossing
BT-2 3
BT-3 2
BT-4 1
---------------------------------------------
Update fractions: df-non-crossing = (10/16 = 0.625) and sf-crossing = (3/5 = 0.6)
Iteration-2: df-non-crossing = 0.625 and sf-crossing = 0.6 df-non-crossing >= sf-crossing is true mark the current instruction as non-crossing.
---------------------------------------------
Target Frequency Result
---------------------------------------------
BT-5 6 non-crossing
BT-3 4 non-crossing
BT-2 3 crossing
BT-3 2 crossing
BT-4 1 crossing
---------------------------------------------
Next, the assembler assembles the code assigning addresses to each instruction in sequential order. If a branch target instruction that has been classified as non-crossing is found to cross a packet boundary, it is aligned to the next packet boundary and the EOP bit is set in the preceding instruction. This heuristic essentially tries to achieve an even balance between dynamic fraction of non-crossing targets (causing increase in code size) and static fraction of potentially crossing targets (causing branch penalty) to effectively trade off branch penalty and code size.
[1] Automatic Design of VLIW and EPIC Instruction Formats
Shail Aditya, B. Ramakrishna Rau, Richard Johnson*Compiler and Architecture Research HP laboratories Palo Alto
Comments
Post a Comment