A survey of high efficiency context addaptive binary arithmetic coding hardware implementations in high efficiency video coding standard

This paper provides an overview of CABAC hardware implementations for HEVC targeting high quality, low power video applications, addresses challenges of exploiting it in different application scenarios and then recommends several predictive research trends in the future. VNU Journal of Science: Comp. Science & Com. Eng, Vol. 35, No. 2 (2019) 1-22 Original Article A Survey of High-Efficiency Context-Addaptive Binary Arithmetic Coding Hardware Implementations in High-Efficiency Video Coding Standa

Thể loại Tài liệu miễn phí Toán học

Số trang 22

Ngày tạo 9/24/2020 9:40:57 PM +00:00

Loại tệp PDF

Kích thước 1.81 M

Tên tệp

Tải A survey of high efficiency context addaptive bina... (.pdf)

Xem mẫu

VNU Journal of Science: Comp. Science & Com. Eng, Vol. 35, No. 2 (2019) 1-22 Original Article A Survey of High-Efficiency Context-Addaptive Binary Arithmetic Coding Hardware Implementations in High-Efficiency Video Coding Standard Dinh-Lam Tran, Viet-Huong Pham, Hung K. Nguyen, Xuan-Tu Tran* Key Laboratory for Smart Integrated Systems (SISLAB), VNU University of Engineering and Technology, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam Received 18 April 2019 Revised 07 July 2019; Accepted 20 August 2019 Abstract: High-Efficiency Video Coding (HEVC), also known as H.265 and MPEG-H Part 2, is the newest video coding standard developed to address the increasing demand for higher resolutions and frame rates. In comparison to its predecessor H.264/AVC, HEVC achieved almost double of compression performance that is capable to process high quality video sequences (UHD 4K, 8K; high frame rates) in a wide range of applications. Context-Adaptive Baniray Arithmetic Coding (CABAC) is the only entropy coding method in HEVC, whose principal algorithm is inherited from its predecessor. However, several aspects of the method that exploits it in HEVC are different, thus HEVC CABAC supports better coding efficiency. Effectively, pipeline and parallelism in CABAC hardware architectures are prospective methods in the implementation of high performance CABAC designs. However, high data dependence and serial nature of bin-to-bin processing in CABAC algorithm pose many challenges for hardware designers. This paper provides an overview of CABAC hardware implementations for HEVC targeting high quality, low power video applications, addresses challenges of exploiting it in different application scenarios and then recommends several predictive research trends in the future. Keywords: HEVC, CABAC, hardware implementation, high throughput, power saving. 1. Introduction * the ISO/IEC produced MPEG-1 and MPEG-4 ITU-T/VCEG and ISO/IEC-MPEG are the Visual; then these two organizations jointly two main dominated international organizations produced the H.262/MPEG-2 Video and that have developed video coding standards [1]. H.264/MPEG-4 Advanced Video Coding The ITU-T produced H.261 and H.263 while (AVC) standards. The two jointly-developed standards have had a particularly strong impact _______ and have found their ways into a wide variety of * Corresponding author. E-mail address: tutx@vnu.edu.vn products that are increasingly prevalent in our https://doi.org/10.25073/2588-1086/vnucsce.233 daily lives. As the diversity of services, the 1
2 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 popularity of HD and beyond HD video formats throughput performance, hardware resource, (e.g., 4k×2k or 8k×4k resolutions) have been an and low power consumption. emerging trend, it is necessary to have higher This paper provides an overview of HEVC CABAC, the state-of-the-art works relating to coding efficiency than that of H.264/MPEG-4 the development of high-efficient hardware AVC. This resulted in the newest video coding implementations which provide high throughput standard called High Efficiency Video Coding performance and low power consumption. (H.265/HEVC) that developed by Joint Moreover, the key techniques and Collaborative Team on Video Coding corresponding design strategies used in (JCT-VC) [2]. HEVC standard has been CABAC implementation are summarized to designed to achieve multiple goals, including achieve the above objectives. coding efficiency, ease of transport system Following this introductory section, the integration, and data loss resilience. The new remaining part of this paper is organized as follows: Section 2 is a brief introduction of video coding standard offers a much more HEVC standard, CABAC principle and its efficient level of compression than its general architecture. Section 3 reviews state-of- predecessor H.264, and is particularly suited to the-art CABAC hardware architecture designs higher-resolution video streams, where and detailed assess these works in different bandwidth savings of HEVC are about 50% [3, aspects. Section 4 presents the evaluation and 4]. Besides maintaining coding efficiency, prediction of forthcoming research trends in processing speed, power consumption and area CABAC implementation. Some conclusions cost also need to be considered in the and remarks are given in Section 5. development of HEVC to meet the demands for higher resolution, higher frame rates, and 2. Background of high-efficiency video battery-based applications. coding and context-adaptive binary Context Adaptive Binary Arithmetic arithmetic coding Coding (CABAC), which is one of the entropy 2.1. High-efficiency video coding - coding coding methods in H.264/AVC, is the only form principle and architecture, enhanced features of entropy coding exploited in HEVC [7]. and supported tools Compared to other forms of entropy coding, such as context adaptive variable length coding 2.1.1. High-efficiency video coding principle (CAVLC), HEVC CABAC provides As a successor of H.264/AVC in the considerable higher coding gain. However, due development process of video coding to several tight feedback loops in its standardization, HEVC’s video coding layer architecture, CABAC becomes a well-known design is based on conventional block-based throughput bottle-neck in HEVC architecture as hybrid video coding concepts, but with some it is difficult for paralleling and pipelining. In important differences compared to prior addition, this also leads to high computation standards [3]. These differences are the method and hardware complexity during the of partition image pixels into Basic Processing development of CABAC architectures for Unit, more prediction block partitions, more targeted HEVC applications. Since the standard intra-prediction mode, additional SAO filter and published, numerous worldwide researches additional high-performance supported coding have been conducted to propose hardware Tools (Tile, WPP). The block diagram of architectures for HEVC CABAC that trade off HEVC architecture is shown in Figure 1. multi goals including coding efficiency, high
D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 3 in a decoded picture buffer to be used for the predictions of subsequent pictures. Because HEVC encoding architecture consists of the identical decoding processes to reconstruct the reference data for prediction and the residual data along with its prediction information are transmitted to the decoding side, then the generated prediction versions of the encoder and decoder are identical. 2.1.2. Enhancement features and supported tools Figure 1. General architecture of HEVC encoder [1]. a. Basic processing unit Instead of Macro-block (1616 pixels) in The process of HEVC encoding to generate H.264/AVC, the core coding unit in HEVC compliant bit-stream is typical as follows: standard is Coding Tree Unit (CTU) with a - Each incoming frame is partitioned into maximum size up to 6464 pixels. However, squared blocks of pixels ranging from 6464 to the size of CTU is varied and selected by the 88. While coding blocks of the first picture in encoder, resulting in better efficiency for a video sequ0065nce (and of the first picture at encoding higher resolution video formats. Each each clean random-access point into a video CTU consists of Coding Tree Blocks (CTBs), in sequence) are intra-prediction coded (i.e., the which each of them includes luma, chroma spatial correlations of adjacent blocks), all Coding Blocks (CBs) and associated syntaxes. remaining pictures of the sequence or between Each CTB, whose size is variable, is partitioned random-access points, inter-prediction coding into CUs which consists of Luma CB and modes (the temporally correlations of blocks Chroma CBs. In addition, the Coding Tree between frames) are typically used for most Structure is also partitioned into Prediction blocks. The residual data of inter-prediction Units (PUs) and Transform Units (TUs). An coding mode is generated by selecting of example of block partitioning of video data is reference pictures and motion vectors (MV) to depicted in Figure 2. An image is partitioned be applied for predicting samples of each block. into rows of CTUs of 6464 pixels which are By applying intra- and inter- predictions, the further partitioned into CUs of different sizes residual data (i.e., the differences between the (88 to 3232). The size of CUs depends on the original block and its prediction) is transformed detailed level of the image [5]. by a linear spatial transform, which will produce transform coefficients. Then these coefficients are scaled, quantized and entropy coded to produce coded bit strings. These coded bit strings together with prediction information are packed and transmitted as a bit-stream format. - In HEVC architecture, the block-wise processes and quantization are main causes of artifacts of reconstructed samples. Then the two loop filters are applied to alleviate the impact of these artifacts on the reference data for better predictions. - The final picture representation (that is a Figure 2. Example of CTU structure in HEVC. duplicate of the output of the decoder) is stored
4 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 b. Inter-prediction intra-prediction. However, HEVC has 35 Luma The major changes in the inter prediction of intra-prediction modes compared with 9 in the HEVC compared with H.264/AVC are in H.264/AVC, thus provide more flexibility and prediction block (PB) partitioning and fractional coding efficiency than its predecessor [7], see sample interpolation. HEVC supports more PB Figure 4. partition shapes for inter picture-predicted CBs 2 3 4 as shown in Figure 3 [6]. 5 6 In Figure 3, the partitioning modes of 0: Planar 7 8 1: DC 8 PART−2N×2N, PART−2N×N, and 9 10 PART−N×2N (with M=N/2) indicate the cases 11 1 12 when the CB is not split, split into two 13 14 equal-size PBs horizontally, and split into two 15 16 6 equal-size PBs vertically, respectively. 34 17 18 33 PART−N×N specifies that the CB is split into 32 31 21 20 19 3 4 30 22 2928 four equal-sizes PBs, but this mode is only 2726252423 7 5 0 supported when the CB size is equal to the H.265/HEVC H.264/AVC smallest allowed CB size. Figure 4. Comparison of Intra prediction in HEVC and H.264/AVC [7]. d. Sample Adaptive Offset filter SAO (Sample Adaptive Offset) filter is the new coding tool of the HEVC in comparison with H.264/AVC. Unlike the De-blocking filter that removes artifacts based on block boundaries, SAO mitigates artifacts of samples due to transformation and quantization operations. This tool supports a better quality of reconstructed pictures, hence providing higher Figure 3. Symmetric and asymmetric of prediction compression performance [7]. block partitioning. e. Tile and Wave-front Parallel Processing Besides that, PBs in HEVC could be the Tile is the ability to split a picture into asymmetric motion partitions (AMPs), in which rectangular regions that helps increasing the each CB is split into two different-sized PBs capability of parallel processing as shown in such as PART-2N×nU, PART-2N×nD, Figure 5 [5]. This is because tiles are encoded PART-nL×2N, and PART-nR×2N [1]. The with some shared header information and they flexible splitting of PBs makes HEVC able to are decoded independently. Each tile consists of support higher compression performance an integer number of CTUs. The CTUs are compared to H.264/AVC. processed in a raster scan order within each tile, c. Intra-prediction and the tiles themselves are processed in the HEVC uses block-based intra-prediction to same way. Prediction based on neighboring tiles take advantage of spatial correlation within a is disabled, thus the processing of each tile is picture and it follows the basic idea of angular independent [5, 7].
D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 5 Column boundaries While the H.264/AVC uses two entropy coding methods (CABAC and CALVC), HEVC CTU CTU specifies only CABAC entropy coding method. Figure 8 describes the block diagram of tile 1 tile 2 tile 3 HEVC CABAC encoder. The principal algorithm of CABAC has remained the same as CTU in its predecessor; however, the method used to exploit it in HEVC has different aspects (will be tile 4 tile 5 tile 6 discussed below). As a result, HEVC CABAC supports a higher throughput than that of H.264/AVC, particularly the coding efficiency enhancement and parallel processing capability tile 7 tile 8 tile 9 [1, 8, 9]. This will alleviate the throughput bottleneck existing in H.264/AVC, therefore Row HEVC becomes the newest video coding standard that can be applied for high resolution boundaries Figure 5. Tiles in HEVC frame [5]. video formats (4K and beyond) and real-time Wave-front Parallel Processing (WPP) is a video transmission applications. Here are tool that allows re-initializing CABAC at the several important improvements according to beginning of each line of CTUs. To increase the Binarization, Context Selection and Binary adaptability of CABAC to the content of the Arithmetic Encoding [8]. video frame, the coder is initialized once the statistics from the decoding of the second CTU in the previous row are available. Re-initialization of the coder at the start of each row makes it possible to begin decoding a row before the processing of the preceding row has been completed. The ability to start coding a row of CTUs before completing the previous one will enhance CABAC coding efficiency. As illustrated in Figure 7, a picture is Figure 7. Representation of WPP to enhance processed by a four-thread scheme which coding efficiency. speeds up the encoding time for high Context throughput implementation. To maintain coding Memory dependencies required for each CTU such as Context each one can be encoded correctly once the left, A L b top-left, top and top-right are already encoded, Bin value Regular CABAC should start encoding CTUs at the Context Modeler context model Engine Coded bits current row after at least two CTUs of the Syntax elements previous row finish (Figure 6). Regular bitstream Bin string Regular/bypass Bypass 2.2. Context-adaptive binary arithmetic coding Binarizer mode switch Bypass for high-efficiency video coding (principle, Bin value Engine architecture) and its differences from the one Coded bits for H.264 Binary Arithmetic Encoder 2.2.1. Context-adaptive binary arithmetic coding’s principle and architecture Figure 8. CABAC encoder block diagram [6].
6 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 Binarization: This is a process of mapping different hardware architectures of CABAC can Syntax elements into binary symbols (bins). be found in [10-14]. Various binarization forms such as Exp-Golomb, fixed length, truncated unary and Context custom are used in HEVC. The combinations of Modeler different binarizations are also allowed where Encoded the prefix and suffix are binarized differently SE_type pLPS SE bins bits FIFO Bin_idx vMPS such as truncated rice (truncated unary - fixed Renormalizer Bit generator FIFO FIFO Regular length combination) or truncated unary - bins Context bin Binarizer Exp-Golomb combination [7]. encoder Context Selection: The context modeling and selection are used to accurately model the Bypass bin probability of each bin. The probability of bins Bypass encoder depends on the type of syntax elements it bins belongs to, the bin index within the syntax Binary Arithmetic Encoder elements (e.g., most significant bin or least significant bin) and the properties of spatially Figure 9. General hardware architecture of CABAC neighboring coding units. HEVC utilizes several encoder [10]. hundred different context models, thus it is necessary to have a big Finite State Machine Besides the three main blocks above, it also (FSM) for accurately context selection of each comprises several other functional modules Bin. In addition, the estimated probability of the such as buffers (FIFOs), data router selected context model is updated after each bin is (Multiplexer and De-multiplexer). Syntax encoded or decoded [7]. Elements (SE) from the other processes in Binary Arithmetic Encoding (BAE): BAE HEVC architecture (Residual Coefficients, will compress Bins into bits (i.e., multiple bins SAO parameters, Prediction mode…) have to can be represented by a single bit); this allows be buffered at the input of CABAC encoder syntax elements to be represented by a before feeding the Binarizer. In CABAC, the fractional number of bits, which improves general hardware architecture of Binarizer can coding efficiency. In order to generate be characterized in Figure 10. bit-streams from Bins, BAE involves several Based on SE value and type, the Analyzer processes such as recursive sub-interval & Controller will select an appropriate division, range and offset updates. The encoded binarization process, which will produce bin bits represent an offset that, when converted to string and bin length, accordingly. HEVC a binary fraction, selects one of the two standard defines several basic binarization sub-intervals, which indicates the value of the processes such as FL (Fixed Length), TU decoded bin. After every decoded bin, the range (Truncated Unary), TR (Truncated Rice), and is updated to equal the selected sub-interval, EGk (kth order Exponential Golomb) for almost and the interval division process repeats itself. SEs. Some other SEs such as CALR In order to effectively compress the bins to bits, (Coeff_Abs_Level_Remaining) and QP_Delta the probability of the bins must be accurately estimated [7]. (cu_qp_delta_abs) utilize two or more 2.2.2. General CABAC hardware combinations (Prefix and Suffix) of these basic architecture binarization processes [15, 16]. There are also CABAC algorithm includes three main simplified custom binarization formats that are functional blocks: Binarizer, Context Modeler, mainly based on LUT, for other SEs like Inter and Arithmetic Encoder (Figure 9). However, Pred Mode, Intra Pred Mode, and Part Mode.
D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 7 SE values SE type 9 (CABAC architecture), depending on bin type Custom format modules (bypass or regular), the current bin will be routed Inter Pred Mode FL into bypass coded engine or context coded engine. Intra Pred Mode The first coded engine is implemented much TU Controller Part Mode simpler without context selection and range TR updating. The coding algorithm of the later one is CALR depicted in Figure 13. EGk QP Delta Combined format modules bin bin string length s Figure 10. General hardware architecture of a binarizer. These output bin strings and their bin lengths are temporarily stored at bins FIFO. Depending on bin types (Regular bins or Bypass Bins), the De-multiplexer will separate and route them to context bin encoder or bypass bin encoder. While bypass bins are encoded in a simpler manner, which will not necessary to Figure 12. General hardware architecture of context estimate their probability, regular bins need to modeller [7]. be determined their appropriate probably models for encoding. These output bins are put into Bit Generator to form output bit-stream of rLPS=LUT(pState, range[7:6]) rMPS = Range - rLPS the encoder. The general hardware architecture of No YES CABAC context modeler is illustrated in Figure valBin != valMPS? 12. At the beginning of each coding process, it Range = rLPS Range = rMPS is necessary to initialize the context for CABAC Low = Low + rMPS according to its standard specifications, when context table is loaded data from ROM. No pState!= 0? Depending on Syntax Element data, bin-string from binarizer and neighbor data, the controller YES valMPS = !valMPS will calculate the appropriate address to access and load the corresponding probability model pState = LUT(pState) pState = LUT(pState) from Context Memory for encoding the current bin. Once the encoding process of the current bin is completed, the context model is updated Renormalization and written back to Context RAM for encoding the next Bin (Figure 11). Figure 13. Encoding algorithm of regular coded bin Binary Arithmetic Encoder (BAE) is the (recommended by ITU-T). last process in CABAC architecture which will generate encoded bit based on input bin from Figure 14 presents our proposed BAE Binarizer and corresponding probability model architecture with multiple bypass bin processing from Context Modeler. As illustrated in Figure to improve the efficiency. The process of BAE
8 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 can be divided into 4 stages: Sub-intervals several new coding tools and throughput division (stage 1 - Packet information extraction improvement oriented-techniques, statistics of and rLPS look-up), Range updating (stage 2 - bins types are significantly changed compared Range renormalization and pre-multiple bypass to H.264 as shown in Table 1. bin multiplication), Low updating (stage 3 - Low renormalization and outstanding bit look- Table 1. Statistics of bin types in HEVC and up), and Bits output (stage 4 - Coded bit H.264/AVC standards [8] construction and calculation of the number of valid coded bits). The inputs to our architecture Common Context Bypass Terminate condition (%) (%) (%) are encapsulated into packets in order to enable configuration multiple-bypass-bin processing. Each packet could be a regular or terminate bin or even a H.264/AVC Hierarchical B 80.5 13.6 5.9 group of bypass bins. The detailed Hierarchical P 79.4 12.2 8.4 implementation of these stages can be found HEVC Intra 67.9 32.0 0.1 in our previous work [17]. Low delay P 7.2 20.8 1.0 InputPacket Low delay B 78.2 20.8 1.0 / 10 state Random 73.0 26.4 0.6 Packet analyser Stage 1 / rLPS 3 access Table /8 /3 /2 /4 rLPSs EPLen isMPS mode EPBits Obviously, in most condition 9 /8 /4 /3 configurations, HEVC shows a fewer portion of LPSShift /2 / range LUT /8 Stage 2 Renorm. rLPS / range 9 / 13 Context coded bin and Termination Bins, MPSShift Renorm. rMPS incEP 0 /2 incEP whereas Bypass bins occupy considerably rMPS nShift bypass portion in the total number of input bins. / 13 Renorm. HEVC also uses less number of Contexts Low bypass Stage 3 low Least Significant Zero (LSZ) LUT (154) than that of H.264/AVC (441) [1, 8]; hence HEVC consumes less memory for Shifter context storage than H.264/AVC that leads to Shifted renormLow OSCnt nShift2 /7 /3 /3 better hardware cost. Coefficient level syntax Stage 4 Encoded bit Valid-bit length elements that represent residual data occupies generator calculator up to 25% of total bins in CABAC. While / 37 /6 H.264/AVC utilizes TRU+EGk binarization Coded bits Number of valid bits method for this type of Syntax Element, HEVC Figure 14. Hardware implementation of regular bin uses TrU+FL (Truncated Rice) which generates encoding [17]. fewer bins (53 vs. 15) [7, 8]. This will alleviate the workload for Binary arithmetic encoding 2.2.3. Differences between context-adaptive which contributes to enhance the CABAC binary arithmetic coding in high-efficiency throughput performance. The method of video coding and the one in H.264/AVC characterizing syntax elements for coefficient In terms of CABAC algorithm, Binary levels in HEVC is also different from arithmetic coding in HEVC is the same with H.264/AVC which lead to possibility to group H.264, which is based on recursive sub-interval the same context coded bins and group bypass division to generate output coded bits for input bins together for throughput enhancement as bins [7]. However, because HEVC exploits illustrated in Figure 15 [8].
D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 9 SIG SIG SIG ALG1 ALG1 ALG2 S S S S ALRe ALRe ALRe a) Reducing the number of context coded bins m m m Regular coded bypass coded HEVC algorithm supports to significantly reduce the number of context coded bins for Figure 15. Group same regular bins and bypass bins syntax elements such as motion vectors and to increase throughput. coefficient level. The underlying cause of this Table 2. Reduction in workload and memory of reduction is the relational proportion of context HEVC over H.264/AVC [8] coded bins and bypass coded bins. While H.264/AVC uses a large amount of context coded Metric H.264/AVC HEVC Reduction bins for syntax elements, HEVC only uses the Max regular 78825 882 9x first few context coded bins and the remaining coded bins bins are bypass coded. Table 3 summarizes the Max bypass 13056 13417 1x bins reduction in context coded bins for various Max total bins 20882 14301 1.5x syntax elements. Number of 441 154 3x Table 3. Comparison of bypass bins number [9] contexts Line buffer for 30720 1024 30x Syntax element AVC HEVC 4K×2K Motion vector difference 9 2 Coefficient 8×8×9-bits 4×4×3-bits 12x Coefficient level 14 1 or 2 storage Reference index 31 2 Initialization 1746×16-bits 442×8-bits 8x Delta QP 53 5 Table Remainder of intra prediction 3 0 This arrangement of bins gives better mode chances to propose parallelized and pipelined CABAC architectures. Overall differences b) Grouping of bypass bins between HEVC and H.264/AVC in terms of Once the number of context coded bin is input workload and memory usage are shown in reduced, bypass bins occupy a significant Table 2. portion of the total bins in HEVC. Therefore, overall CABAC throughput could be notably improved by applying a technique called 3. High-efficiency video coding context- “grouping of bypass bins” [9]. The underlying adaptive binary arithmetic coding principle is to process multiple bypass bins per implementations: State-of-the-Art cycle. Multiple bypass bins can only be 3.1. High throughput design strategies processed in the same cycle if bypass bins appear consecutively in the bin stream [7]. In HEVC, CABAC has been modified all of Thus, long runs of bypass bins result in higher its components in terms of both algorithms and throughput than frequent switching between architectures for throughput improvements. For bypass and context coded bins. Table 4 Binarization and Context Selection processes, summarizes the syntax elements where bypass there are commonly five techniques to improve the throughput of CABAC in HEVC. These grouping was used. techniques are reducing context code bins, Table 4. Syntax Element for group of bypass grouping bypass bins together, grouping the bins [9] same context bins together and reducing the Syntax element Nbr of SEs total number of bins [7]. These techniques have Motion vector difference 2 strong impacts on architect design strategies of BAE in particular and the whole CABAC as Coefficient level 16 well for throughput improvement targeting 4K, Coefficient sign 16 8K UHD video applications. Remainder of intra prediction mode 4
10 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 c) Grouping bins with the Same Context multiple-bin single BAE core and high speed Processing multiple context coded bins in BAE core [19]. the same cycle is a method to improve CABAC The objective of these solutions is to throughput. This often requires speculative increase the product of the number of processed calculations for context selection. The amount bins/clock cycle and the clock speed. In of speculative computations, which will be the hardware designs for high performance cause for critical path delay, increases if bins purpose, these criteria (bins/clock and clock using different contexts and context selection speed) should be traded-off for each specific logic are interleaved. Thus, to reduce speculative circumstance as example depicted in Figure 16. computations hence critical path delay, bins should be reordered such that bins with the same contexts and context selection logic are grouped together so that they are likely to be processed in the same cycle [4, 8, 9]. This also reduces context switching resulting in fewer memory accesses, which also increases throughput and power consumption as well. d) Reducing the total number of bins The throughput of CABAC could be enhanced by reducing its workload, i.e. decreasing the total number of bins that it needs Figure 16. Relationship between throughput, clock to process. For this technique, the total number frequency and bins/cycle [19]. of bins was reduced by modifying the binarization algorithm of coefficient levels. The Over the past five-year period, there has coefficient levels account for a significant been a significant effort from various research portion on average 15 to 25% of the total groups worldwide focusing on hardware number of bins [18]. In the binarization process, solutions to improve throughput performance of unlike combined TrU + EGk in AVC, HEVC HEVC CODEC in general and CABAC in uses combined TrU + FL that produce much particular. Table 5 and Figure 18 show smaller number of output bins, especially for highlighted work in CABAC hardware design coefficient value above 12. As a result, on for high performance. average the total number of bins was reduced in Throughput performance and hardware HEVC by 1.5x compared to AVC [18]. design cost are the two focusing design criteria Binary Arithmetic Encoder is considered as in the above work achievements. Obviously, the main cause of throughput bottle-neck as it they are contrary and have to be trade-off consists of several loops due to data during design for specific applications. The dependencies and critical path delays. chart shows that some work achieved high Fortunately, by analyzing and exploiting throughput with large area cost [14, 19] and statistical features, serial relations between vice versa [11-13]. Some others [20-22] BAE and other CABAC components to achieved very high throughput but consumed alleviate these dependencies and delays, the moderate, even low area. It does not conflict throughput performance could be substantially with the above conclusion, because these works improved [4]. This was the result of a series of only focused on BAE design, thus consuming modifications in BAE architectures and less area than those focusing on whole CABAC hardware implementations such as paralleled implementation. These designs usually achieve multiple BAE, pipeline BAE architectures, significant throughput improvements because BAE is the most throughput bottle-neck in
D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 11 CABAC algorithm and architecture. Therefore, The key techniques and strategies that its improvement has huge effects on the exploited in this work are based on analyzing overall design (Figure 17). statistics and characteristics of residual Syntax Peng et al. [11] proposed a CABAC hardware Elements (SE). These residual data bins occupy a architecture, as shown in Figure 19 which not significant portion in total bins of CABAC, thus only supports high throughput by a parallel an efficient coding method of this type of SE will strategy but also reduce hardware cost. contribute to the whole CABAC implementation. j Table 5. State-of-the-art high performance CABAC implementations Works [11] [12] [13] [23] [21] [20] [22] [19] [14] (year) (2013) (2015) (2015) (2017) (2018) (2016) (2016) (2014) (2013) Bins/clk 1.18 2.37 1 3.99 4.94 4.07 1.99 4.37 4.4 Max 357 380 158 625 537 436.7 1110 420 402 Frequency (MHz) Max 439 900 158 2499 2653 1777 2219 1836 1769 throughput (Mbins/s) Tech. 130 130 180 65 65 40 65 90 65 Area (kGate)48.94 31.1 45.1 11.2 33 20.39 5.68 111 148 Design Parallel Area Fully Combined 8-stage High 4-stage Combined High speed, strategies CM efficient CABAC Parallel, pipeline speed pipeline Parallel, Multi-bin (Whole Multi-bin pipelined pipeline in multi multi bin BAE pipeline in pipeline CABAC Binarizer and BAE bins BAE both BAE architecture design) parallel BAE BAE and CM CABAC Authors propose a method of rearranging this processing speed differences between CABAC SE structure, Context selection and binarization to sub-modules. support parallel architecture and hardware reduction. Firstly, SEs represent residual data [6] (last_significant_coeff_x,last_significant_coeff_y, coeff_abs_ level_greater1_flags, coeff_abs_level_greater2_flag, coeff_ abs_level_remaining and coeff_sign_flag) in a coded sub-block which are grouped by their types as they are independent context selection. Then context coded and bypass coded bins Figure 18. High performance CABAC hardware are separated. implementations. The rearranged structure of SEs for residual data is depicted in Figure 20. This proposed RAM_NEIGHBOU CABAC RAM_CTX_MODE technique allows context selections to be R_INFO(8*128) controller L (256*7) paralleled, thus improve context selection throughput 1.3x on average. Because of bypass Context Model CtxId Parallel in se x Binary stream (CM) serial out Bin bit coded bins are grouped together, they are data buffer CtxId arithmetic stream Bins engine encoded in parallel that contributes to Binarizer (PISO) x (BAE) throughput improvement as well. A PISO (Parallel In Serial Out) buffer is inserted in Figure 19. CABAC encoder with proposed parallel CABAC architecture to harmonize the CM [11].
12 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 TU last position Coded sub block … Coded sub block objectives of the above techniques and design solutions are both critical path delay reduction Coded sub CSBF SCF GTR1 GTR2 sign CALR and the increase of the number of processed block bins per clock cycle, thus improving CABAC Context coded bins bypass coded bins throughput. To support multi-bin BAE, they proposed a cascaded 4-bin BAE as shown in Figure 20. Syntax structure of residual data in Figure 22. CABAC encoder [11]. In Figure 22, because of bin-to-bin dependency and critical delay in the stage 2 of For hardware cost reduction: the design of a BAE process, the cascaded architecture will context-based adaptive CALR binarization further expand this delay that degrades clock hardware architecture can save hardware speed, hence reducing the throughput resource while maintaining throughput performance. Two techniques (Pre-norm, HPC) performance. The bin length is adaptively are applied to solve this issue, in which pre-norm updated in accordance with cRice Parameter will shorten the critical delay of stage 2 and HPC (cRiceParam). The hardware solution for CALR will reduce cascaded 4-bin processing time. binarization process applied in CABAC design Bins & context is shown in Figure 21. Stage 1 cRiceParam CARL rLPStab rLPStab rLPStab rLPStab generation generation generation generation cTRMax=4cRiceParm Stage 4 suffix prefix Bits output FL: size = cRiceParam value = synV[size-1:0] bin string Figure 22. Proposed hardware architecture of cascaded 4-bin BAE [19]. Pre-norm implementation in Figure 23(a) is Figure 21. Adaptive binarization implementation of original stage 2 of BAE architecture, while CARL [11]. Figure 23(b) will remove the normalization D.Zhou et al. [19] focuses on designing of from the stage 2 to the stage 1, which is much an ultra-high throughput VLSI CABAC less processing delay. encoder that supports UHDTV applications. By To further support the cascaded 4-bin BAE analyzing CABAC algorithms and statistics of architecture, they proposed LH-rLPS to data, authors propose and implement in alleviate the critical delay of range updating hardware a series of throughput improvement through this proposed multi-bin architecture. techniques (pre-normalization, Hybrid Path The conventional architecture is illustrated in Coverage, Look-ahead rLPS, bypass bin Figure 24, where the cascaded 2-bin range splitting and State Dual Transition). The updating undergoes two LUTs.
D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 13 state1 state2 state rLPStab renorm. rLPStab x4 x4 bin2 == mps2 bin == mps renorm. x4 rLPStab 4-4 router ff ff ff ff ff [7:6] [7:6] rLPS2 range1 LUT1 LUT2’ [7:6] LUT range’2 range rLPS renorm range’ -
14 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 state A BPBS scheme is proposed to split the bypass bins from the bin stream together with rLPStab renorm. their relative positions are forwarded through a x4 9’d0 Bitwise NOT dedicated channel - a PIPO (Parallel In Parallel Out) storage - until re-merged into the MUX1 rLPS generation bin-stream before the low update stage, as ff range shown in Figure 27. This will result in range [7:6] LUT updating improving BAE throughput as it is possible to MUX2 -
D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 15 pStateIdx1b ValMPS1b binVal1b mode2b In the proposed CABAC hardware ff6b ff1b ff1b ff2b Stage 1 architecture, binarization module is designed as rLPStab bin == mps ? shown in Figure 31, where a multiple and ff32b ff1b ff2b Stage 2 parallel SEs can be processed for the following [7:6] LUT rLPS renorm pipeline stage of the whole architecture. - MPS Range9b
16 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 working at high speed. This pipelined architecture is also applied in two-bin BAE architecture to improve throughput performance (Figure 35, 33). In the work done by B. Vizzotto et al. [12], for the purpose of encoding UHD video contents, an area efficient and high throughput CABAC encoder architecture is presented. To achieve the desired objectives, two design strategies have been proposed to modify CABAC hardware architecture. Firstly, parallel binarization architecture has been designed to meet the requirement of high throughput as Figure 38. Parallel binarization architecture [12]. depicted in Figure 38. The proposed Custom core architecture supports encoding multi SEs for C EGk TU TR multi-bin CABAC architecture. However, C EGk MPS (Symbol) TU FL bitsOutstanding ADD MUX instead of a parallel binarization for each format rLPS Shifter which consumes large hardware a MUX range LDZ MUX ADD heterogeneous eight-functional-core binarizer ADD Shifter bitstream has been presented to save area cost. This codlRange ADD low MUX LDZ heterogeneous architecture consists of eight Shifter cores that can process up to 6 SEs per clock codlLow cycle due to the duplication of Custom, TU and EGk cores in the design (Figure 5). Figure 39. Renormalization architecture RenormE in BAE [12]. ivlCurrRange < 256 The second solution focuses on speeding up renormalization process of BAE. Based on the Low RenormE Leading Zero Detector (LDZ) proposal ivlCurrRange = ivlCurrRange
D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 17 Once bypass bins are separated from regular In this 8-stage pipeline BAE architecture, bins to avoid critical processing in range all advanced techniques (PN rLPS, LH rLPS update, there are possibilities to process and BPBS) to improve throughput are multi-bypass bins in the following stage, i.e. integrated. The first stage is rLPS pre-selection low updating, for throughput improvement. To using PN rLPS technique. The second is regular utilize MBBP in 8-stage pipeline BAE core bin range updating that applies LH rLPS in without degradation of the operating frequency, seven cores scheme, followed by PIPO it is necessary to separate and pipelined process write/read regular buffer before re-merging to low update for bypass bins as the proposed bypass stream at the fourth stage. As architecture shown in Figure 40. In this mentioned, the next two stages are catered for hardware implementation, low update algorithm Low updating process with a five-core scheme. of bypass bin is separated into two sub-stages to The seventh stage is five-core OB (Output Bits) realize two multiply operations, which can that is separated from low updating for reducing balance processing delay to other stages in the process delay purpose. The last stage is for final pipeline architecture. In addition, these multiple bit generation. operations are alternated by combinations of 3.2. Low power design strategies adder, shifter and multiplexer to reduce critical delay and area cost. The application of low-power techniques (clock-gating, power-gating, DVFS) to design “0000000000” Range0 12 Lb HEVC hardware architectures for low power Range 9 Range1 10 consumption is quite diverse and complicated,
18 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 Syntax Elements BPBS technique and bypass bins are grouped together, it is possible to turn off bypass bin SE0 SE1 SE2 SE3 processing path while long bursts of regular BC BC BC BC bins are processed and vice versa. In the regular bin processing path, it is also possible to proposed power saving to turn off LPS bin processing part binstring in BAE architecture. These power saving Figure 43. Four-core parallel binarization solutions in BAE architecture can be realized by architecture [24]. exploiting Clock Gating technique. For high throughput requirement, four-Binarization Core (BC) architecture is proposed (Figure 43), in which hardware architecture of each BC is shown in Figure 42. Based on statistical analysis concluded above, AND-based Operand Isolation technique is inserted into each BC for low power purpose as shown in Figure 44. Except for FL format that occupies a significant portion in Binarization, the proposed low power technique is embedded into the inputs of all other binarization processes. Then the less frequently used binarization formats will be deactivated and isolated. As a result, the power consumption of binarizer is reduced by 20% on average. Figure 45. Clock Gating for low-power BAE architecture [26]. Ramos et al. [26] proposed a novel Multiple Bypass Bins Processing (MBBP) Low-power Figure 44. AND-based isolation operand for BAE architecture that supports 8K UHD video. low-power binarization architecture [24]. They proposed a multiple bins architecture that can process multiple bins per clock which can BAE is another sub-module in CABAC that give opportunities to minimize clock frequency can apply low power technique for energy while still be able to satisfy 8K UHD video saving in hardware design. Unlike Binarizer application. The lower frequency can be applied that exploits the statistical analysis of SE types the lesser power consumption can be achieved. to propose appropriate low power solutions, Moreover, this is also a four-stage pipeline BAE works with bin types (regular, bypass and BAE, in which Clock Gating technique can be MPS/LPS). Use the same method, the statistics applied for stage pipeline registers based on analysis concluded that the occurrence of data path for each kind of bins as stated above. regular bins is more frequent than bypass ones These clock-gated pipeline registers will and MPS regular bins tend to occur in longer contribute to energy saving by an appropriate bursts than LPS regular Bins. Therefore, when control mechanism. As a result, the BAE bypass bins and regular bins are separated by architecture shown in Figure 45 is capable to
D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 19 process 8k UHD video sequences at the is accomplished by contributions from above- minimum clock frequency that leads to a power named and research trends are categorized saving of 14.26%. accordingly. For the objective of the utilization of HEVC standard in video transmission 4. New research trends systems, the ongoing research trends could be divided into the following themes: Since the compression efficiency of HEVC  Optimizing algorithms and hardware is almost double of that in H.264/AVC, HEVC architectures for HEVC codec to meet the is considered as the promising standard for demand of resource-constrained devices. video applications today. However, the  Developing adaptive encoding schemes to improvement of compression efficiency comes network conditions when applying in live at the cost of increased computational broadcasting and real-time streaming over complexity, large processing delays, higher wireless networks. resource and energy consumption compared to  Proposing reconfigurable video coding H.264/AVC. These fundamental problems (RVC) architectures that are able to significantly affect the realization of HEVC effectively adopt HEVC standard into standards in today’s video applications. The existing systems and infrastructures, where emerging challenges are the exploitation of the previous standards have been already HEVC standard in widespread video integrated. applications, where real-time high quality UHD The first theme is ongoing conventional video is transmitted over the limited bandwidth research direction and it could be considered wireless media (broadcast TV, mobile network, the underlying foundation that others based on. satellite communication and TV) and network Mobile video applications have started to delivery [27]. In these video services, it is dominate the global mobile data traffic in recent necessary to transmit a large amount of real- years. In most mobile video communication time video data over a bandwidth-limiting systems, mobile users will be equipped with transmission medium and unstable channel mobile devices, which are typically resource quality that affects video quality. In addition, constrained in terms of storage, computation most of the terminal devices (mobile phones, and processing capacity, energy (battery) and tablet, camcorders…) in these video network bandwidth [28, 29]. This issue has transmission systems are resource constrained been already addressed and drawn a tremendous in terms of storage, computation and processing research since commencing the first version of HEVC standard. As mentioned in previous capacity, energy (battery) and network sections, to support resource-constrained bandwidth, which are the other hindrance for applications, most of the components in HEVC the realization of HEVC standard [28, 29]. architecture have been assessed and amended to Recently, both academic and industrial enhance its performance. However, the sectors have focused on the explorations of necessity of this improvement is always at a solutions for overcoming the above challenges high demand to better support future video and it can be predicted that these research applications. Appling convolution neural trends will be continued in the future as long as network and machine learning to propose HEVC standard has been not fully adopted into adaptation algorithms, e.g., QP adaptation, modern video services [30]. It is obvious that could be a potential method for this research these challenges will be posed to transmission direction [31, 32]. Additionally, the more media operators, video service providers as well flexibly and adaptively encoding schemes for as terminal holders. Thus, a complete solution integrating HEVC into existing infrastructure at
20 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 any transmitting media condition are also been few related studies of this theme of HEVC promising directions. Thus, there should be for MPEG DASH applications. adaptive algorithms to estimate the HEVC The last theme of future research trends in encoder model parameters and perform online HEVC could be predicted. There are numerous encoder coding configuration optimization [33]. standards of video coding that are mutually For the second research direction, the incompatible (syntax and encoded data stream). advance in coding efficiency of HEVC standard Nevertheless, most of the supported coding makes it become a candidate in limited Tools for standards are the same. The sole bandwidth video communication system, difference is the input parameters used for each particularly in wireless media such as TV of these tools in the specific standard. Hence, broadcasting, satellite communication and mobile networks. In these wireless media, the there may be a change of paradigm in quality of service depends not only on channel development future codec named RVC. This bandwidth but also heavily the variation in new design paradigm allows implementing a set environmental conditions [29]. The challenge of common video coding functional blocks, arises in the application scenarios of live then depending on the requirements of given broadcasting and real-time streaming over the standard, appropriate parameters are chosen for wireless networks. These services impose a each block. Consequently, encoded bit-stream very high rate, large amounts of data traversing will consist of descriptive information of these through the networks. The feasible solution for blocks for decoding [30]. this challenge is to dynamically adapt encoding schemes to network conditions for a better trade-off between quality of services, efficiency 5. Conclusion in exploiting encoding and network capabilities. Rate control has always been a potential In this survey, the latest video coding research area that supports dynamically adapt standard HEVC/H.265 overview is given and encoding solution in wireless live video its advancements, especially the detail services. However, there has been insufficient developments of CABAC are discussed and research of rate control for HEVC/H.265 and summarized. The double coding efficiency of the complexities of algorithm and computation HEVC compared with its predecessor is the in HEVC rate control schemes are higher than result of a series of modifications in most that of previous standards. This obstacle has components and leads to a prospective been hindering the adoption of HEVC/H.265 in integration of the standard into modern video its real-time streaming over mobile wireless applications. However, the significant increase networks. Therefore, it is necessary to develop in computation requirements, processing delay low-complex and highly-efficient rate control and consequently energy consumption hindered schemes for H.265/HEVC to improve its this progression. Literature review on hardware network adaptability and enable its applications architecture and implementation strategies for to various mobile wireless streaming scenarios highly efficient CABAC targeting high quality [34]. Besides of rate control method, content UHD resolution, real-time applications is aware segment length optimization (at GOP level) and Tile-based encoding schemes allow provided. The challenges in the design and effective deployment of HEVC in MPEG application of HEVC always exist as video DASH (Dynamic Adaptive Streaming over applications are diverse due to human demands HTTP) [35]. This type of HEVC application is of progressive visual experience. Thus, the an emerging research area because of survey also addresses the challenges of utilizing increasingly high traffic of high-quality live HEVC in different video application areas and video over the Internet. However, there have predicts several future research trends.

nguon tai.lieu . vn

Toán học Môi trường Vật lý Sinh học Địa Lý Hoá học Nông - Lâm - Ngư Cơ khí - Chế tạo máy Tiếng Anh phổ thông Khoa học ứng dụng Nông - Lâm Kiến thức tổng hợp Giáo dục học Xã hội học