For the parallel implementation, Figure 2 timing diagram in the PDF file shows that loading data_i (n-bits wide) requires n clock cycles. Shouldn't this be a single clock load of the n-bit data_i, and producing of parallel CRC in the next clock cycle?
Oh, I think I see my confusion now. Looks like the diagram is denoting MULTIPLE n-bit data_i words being clocked in one after another, and then the CRC is available after the last word has been clocked in.
You are absolutely right!