RE: Architecture
by dnareshkumar on Oct 5, 2009 |
dnareshkumar
Posts: 1 Joined: May 27, 2009 Last seen: Mar 22, 2011 |
||
hai,
i think this project meant for low power. by placing DMA we ofeering more power dissipation. And more over bandwidth consideration here not a major constraint (since before displaying memory required and nowdays memory is very cheap, so i think not big deal) |
RE: Architecture
by marcus.erlandsson on Oct 5, 2009 |
marcus.erlandsson
Posts: 38 Joined: Nov 22, 2007 Last seen: Mar 7, 2013 |
||
hai,
i think this project meant for low power. by placing DMA we ofeering more power dissipation. And more over bandwidth consideration here not a major constraint (since before displaying memory required and nowdays memory is very cheap, so i think not big deal) Many designs today uses FPGA's, so low power isn't really the issue. I would say that performance is more critical. BR, Marcus |
RE: Architecture
by ckavalipati on Oct 5, 2009 |
ckavalipati
Posts: 19 Joined: Aug 3, 2009 Last seen: Oct 30, 2012 |
||
We can do it with registers instead of descriptors. We can provide both options (registers and descriptors) to coexist side by side as well. In principle, CPU-encoder communication through registers can be faster. especially if we provide "shadow-registers" to copy original registers content, such that the CPU could write the following input stream parameters into the registers during encoding of previous raw data chunk. However, since the encoder core is meant for SoC, it won't have all the CPU time for itself. Providing option for CPU-encoder communication through descriptors chaining may contribute to distribution of CPU load. Providing both options will not be difficult. It might also be implemented as core instantiation option. My original suggestion was to make Encoder aware of a descriptor queue implemented in a common shared memory accessible to SoC cores. The raw frames from Camera can be converted into decsriptors and posted to the Encoder's input descriptor queue. Encoder writes the NAL Units back to the shared memory (probably on an outbound NALU queue). This is a very high level view, I probably need to put together a diagram to explain this better. Encoder has well defined interface as to what input queues it expects and how it generates the encoded packets on an output queue. |
RE: Architecture
by ckavalipati on Oct 5, 2009 |
ckavalipati
Posts: 19 Joined: Aug 3, 2009 Last seen: Oct 30, 2012 |
||
I think that eventually, the SoC developed in the 2nd part of our project will look pretty similar to what you describe here. However, I think that the (raw data) producer(s), and the (encoded data) consumer(s) shouldn't be hardwired to the H264 encoder IP core. Such implementation is very rigid and lacks flexibility. Agreed. This is not what I meant. Instead, I think that the encoder should get raw data from one area of the memory (allocated by the FW/CPU) and store the encoded data on other area of the memory (allocated as well by the FW/CPU). Agreed. The descriptor represents this, is it not? This way, the H264-encoder IP core logic shouldn't be aware of the (raw data) producer(s) or the (encoded data) consumer(s). Agreed. Can't the Camera controller generate these descriptors and provide them to the Encoder? I am glossing-over lot of details here. But from very level point-of-view am I correct to expect that the Streaming Video Encoding System being built in this project works this way? The CPU/FW will control, synchronize, and allocate memory for raw-data and encoded-data streams. This way, at the first stage of our project, we shouldn't be considered with the nature of producers and consumers. I agree. But the path of video capture and encoding should move to HW as much as possible. If we pay attention to this overall goal from the beginning it helps. |
RE: Architecture
by ckavalipati on Oct 5, 2009 |
ckavalipati
Posts: 19 Joined: Aug 3, 2009 Last seen: Oct 30, 2012 |
||
In our case, descriptor should contain (at least) the following fields: - raw data start address and size allocated by CPU/FW. - start address and size of memory (allocated by CPU/FW) for encoded data to be written. other fields should be determined according to the nature and the extents of the h264-encoder. These could be: - slice info (boundaries, type,...) - frame info (size, start, end) - memory location allocated for reference frames (in case the h264-encoder core will not contatin dedicated storage for reference frames) - next descriptor address (if we want the possibility of descriptors linked-list) - "encoding succeeded bit" that the encoder can assert when encoding was terminated with no errors/exceptions. at what level should the encoder be "independent" (== aware of). I think that the 3 most obvious options are frame/slice/MB. I think it should be slice, because if we use slice we will solve issues like dependency on previous MBs and the reduction in video latency due to full frame encoding(if we go by frame level encoding). I agree. So the FW/CPU breaks the raw video frames/pictures into slices and creates descriptor chain and submits this to Encoder. Correct? May be I am confused but don't we need some prediction information to break pictures into slices? How does this work when Camera controller is streaming frames in real time? Does FW/CPU buffer the frames and create the descriptor chain and write it to Encoder? I am trying to think of a scheme where raw frames/pictures are given to the Encoder with little FW intervention. |
RE: Architecture
by gil_savir on Oct 6, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
I think reading raw YUV data and writing output encoded stream into external memory can take a lot of memory bandwidth.
Can we have some agent in between encoder IP core and memory ( something like DMA ), which can read/write data from memory upon request from encoder IP core. I believe processing done by encoder IP would be much more faster in this case. Please correct me if I am wrong here. we intend to use direct memory access through shared wishbone (WB) bus. the h264-encoder core will contain WB master interface for accessing the memory bus directly. no cpu intervention will be required.
i think this project meant for low power. by placing DMA we ofeering more power dissipation. And more over bandwidth consideration here not a major constraint (since before displaying memory required and nowdays memory is very cheap, so i think not big deal)
Memory SIZE is indeed very cheap nowadays. However, memory BANDWIDTH is still an issue when implementing SoC. The SoC architecture will have at least one (WB) bus that will be used to connect the memory (controller) with the CPU, with the h264-encoder core, and with other cores that will be decided later (e.g., Ethernet, USB, SATA, camera controllers, etc.). most of these cores, if not all of them, will have to access the memory. Therefore, memory bandwidth has to be dealt with care. low power is one goal of this project. Yet, I can't see how we can implement it without DMA. - gil |
RE: Architecture
by gil_savir on Oct 6, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
In our case, descriptor should contain (at least) the following fields: - raw data start address and size allocated by CPU/FW. - start address and size of memory (allocated by CPU/FW) for encoded data to be written. other fields should be determined according to the nature and the extents of the h264-encoder. These could be: - slice info (boundaries, type,...) - frame info (size, start, end) - memory location allocated for reference frames (in case the h264-encoder core will not contatin dedicated storage for reference frames) - next descriptor address (if we want the possibility of descriptors linked-list) - "encoding succeeded bit" that the encoder can assert when encoding was terminated with no errors/exceptions. at what level should the encoder be "independent" (== aware of). I think that the 3 most obvious options are frame/slice/MB. I think it should be slice, because if we use slice we will solve issues like dependency on previous MBs and the reduction in video latency due to full frame encoding(if we go by frame level encoding). I agree. So the FW/CPU breaks the raw video frames/pictures into slices and creates descriptor chain and submits this to Encoder. Correct? May be I am confused but don't we need some prediction information to break pictures into slices? How does this work when Camera controller is streaming frames in real time? Does FW/CPU buffer the frames and create the descriptor chain and write it to Encoder? I am trying to think of a scheme where raw frames/pictures are given to the Encoder with little FW intervention. I'm trying to understand how slicing is done in the x264 code. The H.264 standard does not specify it (since it only defines decoding). If someone knows how slicing is done, or where to find such information, please share. understanding slicing is crucial for deciding if it should be done in FW or HW. |
RE: Architecture
by gil_savir on Oct 6, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
Can't the Camera controller generate these descriptors and provide them to the Encoder? I am glossing-over lot of details here.
The camera controller shouldn't be aware of the h264-encoder core. To be more specific, its not the camera controller, but a camera controller, since there are many camera controllers available. it can also be any other source for raw YUV (4:2:0) stream (e.g., Ethernet, HDMI controller, etc.). Therefore, the FW/CPU is the only entity that should "know how to talk" to the h264-encoder core (through descriptors/registers). You can't expect any possible raw-stream producer to know ho to talk to the encoder. |
RE: Architecture
by ckavalipati on Oct 6, 2009 |
ckavalipati
Posts: 19 Joined: Aug 3, 2009 Last seen: Oct 30, 2012 |
||
May be we should let Encoder do the slicing and provide it with frames.
What does most Camera controllers provide? Frames? The input to Encoder should be in a form that is common across Camera controllers. This avoids FW pre-processing the data. |
RE: Architecture
by gil_savir on Oct 7, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
If we want the encoder to work on level higher than MB, then slice, frame, or GOP might be the choices. since slice might contain at most entire frame, then the encoder must be able to deal with a whole frame.
It appears that until one month ago, x264 supported single-sliced frames only. the x264 version we used for profiling is by coincidence the first to enable slicing. However, the default is no slicing. It seems that there are few scenarios where slicing is required. When slicing is done, it is usually done artificially. i.e., the user specifies number of slices per frame (4 is common), and the encoder simply divides the frame to 4 even-size rectangle slices. more advanced slicing schemes are done with multi-pass encoding. However, frame slicing reduces video quality. It is used for error-resilience and for enabling (in some cases) faster decoding. That being said, I suggest: 1. we'll (initially) encode using 1 slice per frame. we'll then add support for simple slicing (user/FW decides how many rectangle slices per frame). 2. the descriptors will provide one slice at a time. this way, descriptor size is constant. - gil |
RE: Architecture
by ckavalipati on Oct 8, 2009 |
ckavalipati
Posts: 19 Joined: Aug 3, 2009 Last seen: Oct 30, 2012 |
||
[q The camera controller shouldn't be aware of the h264-encoder core. To be more specific, its not the camera controller, but a camera controller, since there are many camera controllers available. it can also be any other source for raw YUV (4:2:0) stream (e.g., Ethernet, HDMI controller, etc.). Therefore, the FW/CPU is the only entity that should "know how to talk" to the h264-encoder core (through descriptors/registers). You can't expect any possible raw-stream producer to know ho to talk to the encoder.
This is fair. We can have the FW be the glue between video source and encoder. But the "slice-descriptor-chain" seems a bit cumbersome to me. How long this chain should be and how in real time FW can build it? Having said this, I must add that this is a great idea for the start. If I can think of any other possible interface I will propose. (I am thinking about a Circular Frame Buffer in memory where FW can write to and Encoder can read from. The idea is half-baked and I have to work on some more details before I can discuss it here. ) |
RE: Architecture
by gil_savir on Oct 8, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
the "slice-descriptor-chain" seems a bit cumbersome to me. How long this chain should be and how in real time FW can build it?
There are many SoCs out there where FW builds descriptor-chains in real time. I used to work with 2 kinds myself (WiFi radio controllers, SATA controllers).Anyhow, I think we should also provide alternative for descriptors. i.e., direct encoder-registers manipulation by FW. We would leave to the user (at the stage of IP core configuration for instantiation) and/or to FW to choose communication method. I think that our approach at this stage of the project should be a design of H264-encoder IP core, as general as possible. In the end, it shouldn't be only useful in our SoC. it should be made as easy as possible for anyone to use/extend it for his/hers purpose.
(I am thinking about a Circular Frame Buffer in memory where FW can write to and Encoder can read from. The idea is half-baked and I have to work on some more details before I can discuss it here. )
This is example of the generality of the design. If the encoder is only aware of start address and size of the raw data, a circular buffer can be implemented in few different ways. the attached sketch shows it. The leftmost is a naive implementation, using one descriptor only, where FW has to updates the descriptor only when accessible. the two other gives a 3-staged circular buffer. Personally, I guess we will (eventually) use the rightmost circular buffer, with 2 or 3 stages, since raw stream will be copied once to a continuous chunk of the memory.
The point is, IP core should be implemented as general as possible, leaving many options and flexibility for higher levels (FW, users, other designers, ...). |
RE: Architecture
by gil_savir on Oct 9, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
What does most Camera controllers provide? Frames?
Most camera controllers (sometimes called "freame grabbers") provide raw video stream in frames.
The input to Encoder should be in a form that is common across Camera controllers. This avoids FW pre-processing the data.
I think Chander is right. There are painfully too many formats used (see www.fourcc.org). There are the RGB family and the YCbCr (sometimes wrongly called YUV) familiy. The YCbCr is divided to packed and plannar formats. as far as I know, the IYUV/I420 plannar) format is pretty popular. I suggest we shall start with support for this format, and later we will add support for more formats if we find it necessary (in FW or HW).
- gil |
RE: Architecture
by ckavalipati on Oct 10, 2009 |
ckavalipati
Posts: 19 Joined: Aug 3, 2009 Last seen: Oct 30, 2012 |
||
I have a proposal for how the flow can be between FW and Encoder from a high level view. To simplify things I outlined only inbound flow into Encoder. I wish I could make a diagram to explain this better but at this time I have only textual description. If anything is not clear please let me know.
Overview: There will be a "Video Stream Buffer" in memory that FW can write to and Encoder can read from. The location of this Video Stream Buffer is determined by FW (part of SoC configuration) and it programs its location and size to Encoder through registers. Encoder may place certain restrictions on alignment, minimum size etc. (TBD) FW keeps writing frames or slices of video data into this buffer in memory. And Encoder keeps reading video slices from it. We can have a Slice Length Register to control the size of a slice. Synchronization between FW and Encoder happens through two pointers called producer index and consumer index. FW owns producer index and Encoder owns consumer index. There will be registers representing these indexes. Flow: (I have omitted lot of details to simplify the description) 1. At the time of initialization FW programs Video Stream Buffer's address and size registers. (FW programs lot more registers at the time of initialization. It takes the core out of reset, programs video stream properties that need to be defined such as color space, image size etc. and encoding properties if any and other configuration settings...) 2. When FW receives video data it writes it to Video Stream Buffer and updates "Video Stream Buffer Producer Index Register". 3. Then Encoder starts reading video data in slices and starts encoding it. As it starts consuming video data (slices) it keeps updating consumer index at programmed location in memory (we can have "Video Stream Buffer Consumer Index Address Register") and generates interrupts. Consumer index gets updated every time interrupt is generated. Encoder will support interrupt avoidance features to throttle interrupts. FW will fine tune them for optimal performance. (TBD) 4. FW processes the interrupts and writes some more data and updates the producer index (through the register). If producer index reaches end of the buffer it wraps to the beginning of the buffer. The consumer index works the same way. When producer index and consumer index point to the same location the buffer is considered empty. This is the reset state. If producer index reaches a position, where if it advances by another slice length it equals consumer index, it indicates buffer full condition. If buffer is full FW backs off and waits for Encoder to consume video data i.e. waits for interrupt before writing some more data. Steps 2,3 and 4 are repeated by FW during video streaming. |
RE: Architecture
by gijoprems on Oct 10, 2009 |
gijoprems
Posts: 13 Joined: Jun 17, 2008 Last seen: Jan 8, 2014 |
||
Sounds good; but we should implement Gil's idea on descriptor based configuration in parallel, I guess.
-Gijo |