



or1200 execution units
by Unknown on Feb 16, 2004 |
Not available! | ||
I'm currently reviewing the gcc target backend. I'm looking at the
function unit definitions. I'm trying to work out how they correspond to the hardware, and I want to make sure I understand the hardware right. Here's how I understand the or1200 implementation, please correct and enhance. There are these independent units (optional ones in[]) : - ALU (shift, add, compare, move, logic, extend [, mul, mac]) - LSU (load, store) - SYSTEM (mfspr, mtspr) One instruction per cycle is decoded. All units are able to accept one instruction per cycle. All SYSTEM and ALU instructions except for "mul" have their results ready after one cycle. "mul" and "mac" instructions have their results ready after three cycles. "load" instructions are complete after 2 cycles on a cache hit. ???: On a miss (or no cache), the load will complete after (Mem access time + 1) cycles. "store" instructions are complete after 1 cycle on hit, or (Mem access time) cycles on a miss. Things I'm not clear on: - How does the MMU affect LSU execution time? - Can the LSU accept another load/store while a load/store is currently in execution? (Would a store instruction have to wait, like, 10 cycles when a load instruction missed?) Based on these assumptions, this is the gcc function unit definition I'm proposing: (define_function_unit "alu" 1 0 (eq_attr "type" "shift,add,logic,extend,move,compare") 1 1) (define_function_unit "alu" 1 0 (eq_attr "type" "mul") 1 3) (define_function_unit "lsu" 1 0 (eq_attr "type" "load") 2 2) (define_function_unit "lsu" 1 0 (eq_attr "type" "store") 1 1) I'm also pondering how to implement an option to freely specify LSU latencies from the gcc command line for cache-less implementations. Heiko |
or1200 execution units
by Unknown on Feb 17, 2004 |
Not available! | ||
Pretty much correct everything that you said in your original email.
"mul" and "mac" instructions have their results ready after three cycles.
"mac" takes effectively one clock cycle if there is no data hazard. Otherwise it takes 3. Things I'm not clear on: - How does the MMU affect LSU execution time? It doesn't. Except in some rare cases when you flip from one page to the other during fetch. This happens rarely with 8KB pages helping in this. When there is page flip it takes an additional clock cycle to fetch first instruction on next page.
- Can the LSU accept another load/store while a load/store is currently
in execution? (Would a store instruction have to wait, like, 10 cycles when a load instruction missed?) Unfortunately this is calar design and as such every insn has to execute (with an exception of mac). So if load is delayed i nexecution, everything behind load has to wait. Based on these assumptions, this is the gcc function unit definition I'm proposing: (define_function_unit "alu" 1 0 (eq_attr "type" "shift,add,logic,extend,move,compare") 1 1) (define_function_unit "alu" 1 0 (eq_attr "type" "mul") 1 3) (define_function_unit "lsu" 1 0 (eq_attr "type" "load") 2 2) (define_function_unit "lsu" 1 0 (eq_attr "type" "store") 1 1) I'm also pondering how to implement an option to freely specify LSU latencies from the gcc command line for cache-less implementations. This would be nice. Although I don't think it would be of much use since code scheduled with above rules would already be optimal for cache less sytems anyway. So what is the point? BTW what would be excellent is if you had a look at code size. It seems current toolchain is not very efficient when it comes to code size. I know that folks have used MIPS compiler in the past and mapped generated code to openrisc. The two architectures are similar enough that this is possible. I hear that MIPS compiler generates smaller codee than GCC. So investing in direction of code size would be very much valuable (I know a couple of openrisc based ASIC projects where code size is very important...) regards, Damjan
Heiko
_______________________________________________
http://www.opencores.org/mailman/listinfo/openrisc
|



