OpenCores
no use no use 1/3 Next Last
Architecture
by gil_savir on Sep 29, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021
Hi all,
Let us discuss architecture issues on this thread.

from "HW/SW partitioning" thread:
Then we may have agreements on the following points:
...
2) the basic encoding unit by HW is a macroblock. Should we encode several macroblocks (number of MBs can be configured by SW) one time to enhance througput? different QP for different MB may promote coding effiency, but at the price of degrading HW ability.

This might be tricky since some of the prediction algorithms may require previous macroblock to be already encoded. It will require some kind "MB hazards" prevention mechanism.

Maybe we should begin to draft out an architecture definition including the following:

1) basic parameters (technology process, expected frequency, throughput, max supported picture size...)
2) register definition
3) encoder operation flow (how HW and SW cooperating)

4) OpenRISC CPU specifications (I/D memory sizes, I/D Caches, interrupt controller, etc.)
5) ...
Architecture suggestion
by gil_savir on Oct 1, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021
Hi all,
I have idea for initial architecture. Since we are to implement large portion of x264 in HW, it seems to me that that in first stage, the HW IP core should be designed as independent as possible. This way, extending it and implementing it with different processors would be possible. Second stage should be implementation of the H.264 Encoder IP core within SoC based on OpenRISC processor and some more relevant open-cores peripherals (see attached figure).

For HW/SW i/f I suggest using descriptors and interrupts. The descriptor will contain info such as raw data start-address and size, destination address for encoded data, etc.

This will work as follows:
1. CPU: writes descriptor in memory.
2. CPU: writes descriptor address on the core (through WB slave).
3. CPU: asserts register bit "encode now" on the core (through WB slave).
4. CORE: fetches next raw MB/slice/frame from memory (through WB master) according to start address from the descriptor.
5. CORE: encodes MB/slice/frame and wraps it in NAL format.
6. CORE: writes the encoded MB/slice/frame to memory (through WB master) according to destination address from the descriptor.
7. CORE: repeats steps 4-6 until raw data is consumed.
8. CORE: updates relevant descriptor fields (encoding OK/encoding error, bytes written, etc.).
9. CORE: generates interrupt to inform CPU that the raw data is consumed.
10. repeat steps 1-9.

note that some CPU steps 1 and 2 can be "pipelined" during steps 4-9.

please comment, add, fix, improve, disapprove, ...

- gil
Architecture cont'
by gil_savir on Oct 1, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021
Attached below is initial blocks scheme for the design with the following blocks:

- Data bus master: i/f with data bus.

- master memory ctrl: (probably) state machine for fetching raw data from data bus and
transmitting encoded data to the data-bus.

- H.264 encoder core: receives raw data as input. performs inter/intra predicition, T, Q, Q', T, deblocking-filtering. maintains decoded data list buffer. outputs encoded data to the entropy encoder.

- entropy encoder: gets decoded data from H.264 encoder core and performs entropy encoding on it.

- NAL wrapper: received decoded data from the entropy encoder and relevant info from the H.264 encoder core, and wrap it in NAL format. Passes its output to the master memory ctrl for writing in memory.

- encoder control: (probably) state machine to control and synchronize general processes in the IP core (e.g. signaling the interrupt unit for raw data consumption.

- interrupt: generates required interrupts and updates interrupt registers.

- Config bus slave: i/f with the configuration bus.

- slave memory ctrl: (probably) state machine for handling read/write requests from the configuration bus.

- registers: contains register file with all relevant registers. e.g., status, descriptor address, "encode now", interrupt registers, ...

please comment, ...

- gil
RE: Architecture
by ckavalipati on Oct 2, 2009
ckavalipati
Posts: 19
Joined: Aug 3, 2009
Last seen: Oct 30, 2012

2. CPU: writes descriptor address on the core (through WB slave).
3. CPU: asserts register bit "encode now" on the core (through WB slave).

- gil


Can we have a descriptor queue ( a circular queue) that FW writes to? This queue will have PI (producer index) and CI (consumer index). FW or any other block can be the producer of the descriptors and update the PI and Core updates CI after processing the descriptor. We can have a interrupt message queue where the roles are reversed in terms of producer and consumer.
RE: Architecture
by gil_savir on Oct 3, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021

2. CPU: writes descriptor address on the core (through WB slave).
3. CPU: asserts register bit "encode now" on the core (through WB slave).

- gil


Can we have a descriptor queue ( a circular queue) that FW writes to? This queue will have PI (producer index) and CI (consumer index). FW or any other block can be the producer of the descriptors and update the PI and Core updates CI after processing the descriptor. We can have a interrupt message queue where the roles are reversed in terms of producer and consumer.


Please correct me if I'm wrong.
- What you mean by FW is CPU?
- and you'd like the option for the h264-encoder core to be able to handle more than one streams to be encoded?
- each of such stream is originated by a 'producer', then what is a consumer?
- then the core should do something like "context-switching" when it switches between producers?

It might be more clear if you give an example of such flow of event.

- gil
RE: Architecture
by ckavalipati on Oct 3, 2009
ckavalipati
Posts: 19
Joined: Aug 3, 2009
Last seen: Oct 30, 2012


- What you mean by FW is CPU?


Yes. I meant FW or software running in the CPU which produces the descriptors.


- and you'd like the option for the h264-encoder core to be able to handle more than one streams to be encoded?
- each of such stream is originated by a 'producer', then what is a consumer?


No. Only one stream. I thought we have only one Camera in the system.

By "producer" I mean any entity that creates the descriptors and submits them to H.264 Encoder on its descriptor queue. In our case, I think it is the FW in the SoC.

The "consumer" of these descriptors is H.264 Encoder core, which encodes the frames represented by the descriptors.

Does this make sense? I am still learning the encoding algorithm and I spent very little time with x264 code.





I meant FW to be CPU.
RE: Architecture
by ckavalipati on Oct 4, 2009
ckavalipati
Posts: 19
Joined: Aug 3, 2009
Last seen: Oct 30, 2012
I am a bit confused about the overall goal of the project.

The original proposal, atleast as I interpreted, was about developing a SoC that encodes video captured through a camera and streams it on one of the communication interfaces. (Actually the SoC diagram shows only USB 2.0 interface. May be we need Ethernet interface as well.)

Are we planning on building this SoC eventually?

I think we should understand the high level architecture first before defining the individual cores.

In the final SoC, I am thinking that Camera controller streams frames to H.264 Encoder, which encodes them and wraps them into NALUs and they are streamed over a communication interface. This path is mostly HW and with SW (or FW) playing very little role as necessary.

Comments?
RE: Architecture
by gil_savir on Oct 5, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021
I am a bit confused about the overall goal of the project.

The original proposal, at least as I interpreted, was about developing a SoC that encodes video captured through a camera and streams it on one of the communication interfaces. (Actually the SoC diagram shows only USB 2.0 interface. May be we need Ethernet interface as well.)

Are we planning on building this SoC eventually?
Yes we do. However, we should first implement the H.264 encoder (IP core) itself, and then use it within SoC, that might contain camera controller that provides raw stream and might contain Ethernet controller that will be used to stream the encoded data in NAL form (I assume NALU = NAL Unit?). The way I see it (and if I'm not mistaken, Jack also) this project should have two main stages:
1. develop (and verify) H.264-encoder IP core which performs encoding (receives raw data input and outputs encoded data, most probably in NAL format).
2. implement (and test) SoC using the H264-encoder IP core with OR1K as its host CPU, one (or more) raw stream producer(s), one (or more) encoded stream consumers, and any other peripheral(s) we see necessary.
I think we should understand the high level architecture first before defining the individual cores.
I agree.
In the document OC_avc_r0.02.pdf (from SVN repository:[/] [oc-h264-encoder/] [trunk/] [doc/] [misc/]) in the last paragraph (dealing with optimization), two possible approaches for encoder architecture are presented:
1. Using a set of small HW accelerators. each replaces small (computation intensive) portions of the h264 code. This way the entire encoding is successively switched between the CPU and the accelerators.
2. Using single h264 encoder core that performs great portion of the encoding process, which encapsulate many (computation intensive) functions.

I think that the first approach will not be very modular and will create high grade of dependency between the CPU and the HW. implementations with other CPUs might not be straightforward. The 2nd approach, as I see it, is more modular and provides higher portability and extendability.

In the final SoC, I am thinking that Camera controller streams frames to H.264 Encoder, which encodes them and wraps them into NALUs and they are streamed over a communication interface. This path is mostly HW and with SW (or FW) playing very little role as necessary.


I think that eventually, the SoC developed in the 2nd part of our project will look pretty similar to what you describe here. However, I think that the (raw data) producer(s), and the (encoded data) consumer(s) shouldn't be hardwired to the H264 encoder IP core. Such implementation is very rigid and lacks flexibility. Instead, I think that the encoder should get raw data from one area of the memory (allocated by the FW/CPU) and store the encoded data on other area of the memory (allocated as well by the FW/CPU). This way, the H264-encoder IP core logic shouldn't be aware of the (raw data) producer(s) or the (encoded data) consumer(s). The CPU/FW will control, synchronize, and allocate memory for raw-data and encoded-data streams. This way, at the first stage of our project, we shouldn't be considered with the nature of producers and consumers.

Anyhow, this is only a proposition for architecture. please fill free to comment it and to present any other architecture proposition you find fit.

- gil
RE: Architecture
by gijoprems on Oct 5, 2009
gijoprems
Posts: 13
Joined: Jun 17, 2008
Last seen: Jan 8, 2014
I think that the encoder should get raw data from one area of the memory (allocated by the FW/CPU) and store the encoded data on other area of the memory (allocated as well by the FW/CPU). This way, the H264-encoder IP core logic shouldn't be aware of the (raw data) producer(s) or the (encoded data) consumer(s). The CPU/FW will control, synchronize, and allocate memory for raw-data and encoded-data streams.


I think this is how it should be. Encoder should only be aware of the source and destination along with the encoding parameters.

Btw, could you please brief a bit on descriptors? What are all the contents of that?

-Gijo
RE: Architecture
by jackoc on Oct 5, 2009
jackoc
Posts: 13
Joined: Sep 5, 2009
Last seen: May 16, 2010
Hi all,
I have idea for initial architecture. Since we are to implement large portion of x264 in HW, it seems to me that that in first stage, the HW IP core should be designed as independent as possible. This way, extending it and implementing it with different processors would be possible. Second stage should be implementation of the H.264 Encoder IP core within SoC based on OpenRISC processor and some more relevant open-cores peripherals (see attached figure).

For HW/SW i/f I suggest using descriptors and interrupts. The descriptor will contain info such as raw data start-address and size, destination address for encoded data, etc.

This will work as follows:
1. CPU: writes descriptor in memory.
2. CPU: writes descriptor address on the core (through WB slave).
3. CPU: asserts register bit "encode now" on the core (through WB slave).
4. CORE: fetches next raw MB/slice/frame from memory (through WB master) according to start address from the descriptor.
5. CORE: encodes MB/slice/frame and wraps it in NAL format.
6. CORE: writes the encoded MB/slice/frame to memory (through WB master) according to destination address from the descriptor.
7. CORE: repeats steps 4-6 until raw data is consumed.
8. CORE: updates relevant descriptor fields (encoding OK/encoding error, bytes written, etc.).
9. CORE: generates interrupt to inform CPU that the raw data is consumed.
10. repeat steps 1-9.

note that some CPU steps 1 and 2 can be "pipelined" during steps 4-9.

please comment, add, fix, improve, disapprove, ...

- gil



Hi, gil

I agree a lot with your ideas, it's great. Let me clear it using my understanding, suppose raw yuv data is one frame, then the SW/HW cooperation flow will be:

1) SW writes descriptors (YUV address, Bit stream address, picture info., slice info., ...) to H264 HW, and then asserts register bit "encode now".

2) H264 HW start encoding one macroblock, write out bitstream, store neighbour info. (for Intra MB) or MVs (for Inter MB) to internal buffers (or external memory), updates status info., and then HW asserts interrupt to CPU.

3) repeat step 2 until all MBs are encoded.

Am I right?

Maybe we should clear the basic unit that HW encodes when "start now" is asserted by SW, MB or Slice, or several MBs within a slice, or even one picture? any ideas?

I suggest that the basic unit(number of MBs to be encoded at one time) should be configured by SW.

Regards,

Jack


RE: Architecture
by jackoc on Oct 5, 2009
jackoc
Posts: 13
Joined: Sep 5, 2009
Last seen: May 16, 2010



Hi, gil

I suggest that all the descriptors should be implemented by registers within H264 encoder core to enhance performance and reduce external memory bandwidth.

Jack
RE: Architecture
by gil_savir on Oct 5, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021
I think that the encoder should get raw data from one area of the memory (allocated by the FW/CPU) and store the encoded data on other area of the memory (allocated as well by the FW/CPU). This way, the H264-encoder IP core logic shouldn't be aware of the (raw data) producer(s) or the (encoded data) consumer(s). The CPU/FW will control, synchronize, and allocate memory for raw-data and encoded-data streams.


I think this is how it should be. Encoder should only be aware of the source and destination along with the encoding parameters.

Btw, could you please brief a bit on descriptors? What are all the contents of that?

-Gijo


Hi Gijo,

Descriptor can be seen as a kind of data structure that resides in memory that is shared between two devices. typically these are CPU/FW and another module (e.g., SATA/USB/Ethernet controller, wifi radio transmitter, etc.). the descriptor can be compared to C struct that contains several fields and bits. Descriptors are used for communication between host CPU/FW and the device. Both may write and read descriptor fields.

In our case, descriptor should contain (at least) the following fields:
- raw data start address and size allocated by CPU/FW.
- start address and size of memory (allocated by CPU/FW) for encoded data to be written.

other fields should be determined according to the nature and the extents of the h264-encoder. These could be:
- slice info (boundaries, type,...)
- frame info (size, start, end)
- memory location allocated for reference frames (in case the h264-encoder core will not contatin dedicated storage for reference frames)
- next descriptor address (if we want the possibility of descriptors linked-list)
- "encoding succeeded bit" that the encoder can assert when encoding was terminated with no errors/exceptions.
- ...

Anyway, the descriptor's fields should be determined according to the nature of the encoder and communication required between the CPU/FW and the encoder.

I think that we need to debate the extension of the encoder and its internal architecture. One important issue to discuss is: at what level should the encoder be "independent" (== aware of). I think that the 3 most obvious options are frame/slice/MB.

Another important issue to decide is if we should implement reference-frames storage on the main memory or in dedicate storage inside the encoder core? This issue is tight with the decision of largest frame the encoder should be able to process. I'm afraid that in the case of large frames, dedicated reference storage will be infeasible.

please comment,

- gil
RE: Architecture
by gijoprems on Oct 5, 2009
gijoprems
Posts: 13
Joined: Jun 17, 2008
Last seen: Jan 8, 2014

Thanks Gil, for the clarification. As I understand, it's similar to the DMA descriptors.
Do we intend to allow chaining of encoding requests and that's why we need to have descriptors? Otherwise, it could have done with a set of registers only; correct me if I am wrong.


at what level should the encoder be "independent" (== aware of). I think that the 3 most obvious options are frame/slice/MB.


I think it should be slice, because if we use slice we will solve issues like dependency on previous MBs and the reduction in video latency due to full frame encoding(if we go by frame level encoding).


Another important issue to decide is if we should implement reference-frames storage on the main memory or in dedicate storage inside the encoder core?

I would suggest to use external memory for all the temporary storage. So, during the encoder configuration, we can give the scratch start address as an input parameter to the encoder.

-Gijo
RE: Architecture
by gil_savir on Oct 5, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021

Thanks Gil, for the clarification. As I understand, it's similar to the DMA descriptors.
Do we intend to allow chaining of encoding requests and that's why we need to have descriptors? Otherwise, it could have done with a set of registers only; correct me if I am wrong.

We can do it with registers instead of descriptors. We can provide both options (registers and descriptors) to coexist side by side as well.
In principle, CPU-encoder communication through registers can be faster. especially if we provide "shadow-registers" to copy original registers content, such that the CPU could write the following input stream parameters into the registers during encoding of previous raw data chunk.
However, since the encoder core is meant for SoC, it won't have all the CPU time for itself. Providing option for CPU-encoder communication through descriptors chaining may contribute to distribution of CPU load.
Providing both options will not be difficult. It might also be implemented as core instantiation option.


at what level should the encoder be "independent" (== aware of). I think that the 3 most obvious options are frame/slice/MB.

I think it should be slice, because if we use slice we will solve issues like dependency on previous MBs and the reduction in video latency due to full frame encoding(if we go by frame level encoding).

I agree.


Another important issue to decide is if we should implement reference-frames storage on the main memory or in dedicate storage inside the encoder core?

I would suggest to use external memory for all the temporary storage. So, during the encoder configuration, we can give the scratch start address as an input parameter to the encoder.

I agree.
RE: Architecture
by apurvbhatt on Oct 5, 2009
apurvbhatt
Posts: 1
Joined: Jun 30, 2009
Last seen: Jul 23, 2013
All,

I think reading raw YUV data and writing output encoded stream into external memory can take a lot of memory bandwidth.

Can we have some agent in between encoder IP core and memory ( something like DMA ), which can read/write data from memory upon request from encoder IP core.

I believe processing done by encoder IP would be much more faster in this case. Please correct me if I am wrong here.

Thanks & Regards,
apurv
no use no use 1/3 Next Last
© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.