Hi, I found nothing related on the matter on how to present data with regard the phi1 clock. It seems in your document than the data goes first in the design on a phi2 clock edge. In the provided testbench you are using a phi1 process to input data (with no input delay). In my testbench the design works fine only when data input occurs on phi2 rising edge with a little input_delay. Can you give some clues here ?
Here are some errors reported by the sdc parser of my fpga
WARNING LOAD_SDC: Block "datain0" is not related to clock "phi1". Please check the SDC file validity. ... ... same for all bits of the data bus