- Trang Chủ
- Toán học
- A survey of high efficiency context addaptive binary arithmetic coding hardware implementations in high efficiency video coding standard
Xem mẫu
- VNU Journal of Science: Comp. Science & Com. Eng, Vol. 35, No. 2 (2019) 1-22
Original Article
A Survey of High-Efficiency Context-Addaptive Binary
Arithmetic Coding Hardware Implementations
in High-Efficiency Video Coding Standard
Dinh-Lam Tran, Viet-Huong Pham, Hung K. Nguyen, Xuan-Tu Tran*
Key Laboratory for Smart Integrated Systems (SISLAB),
VNU University of Engineering and Technology, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam
Received 18 April 2019
Revised 07 July 2019; Accepted 20 August 2019
Abstract: High-Efficiency Video Coding (HEVC), also known as H.265 and MPEG-H Part 2, is
the newest video coding standard developed to address the increasing demand for higher
resolutions and frame rates. In comparison to its predecessor H.264/AVC, HEVC achieved almost
double of compression performance that is capable to process high quality video sequences (UHD
4K, 8K; high frame rates) in a wide range of applications. Context-Adaptive Baniray Arithmetic
Coding (CABAC) is the only entropy coding method in HEVC, whose principal algorithm is
inherited from its predecessor. However, several aspects of the method that exploits it in HEVC
are different, thus HEVC CABAC supports better coding efficiency. Effectively, pipeline and
parallelism in CABAC hardware architectures are prospective methods in the implementation of
high performance CABAC designs. However, high data dependence and serial nature of bin-to-bin
processing in CABAC algorithm pose many challenges for hardware designers. This paper
provides an overview of CABAC hardware implementations for HEVC targeting high quality, low
power video applications, addresses challenges of exploiting it in different application scenarios
and then recommends several predictive research trends in the future.
Keywords: HEVC, CABAC, hardware implementation, high throughput, power saving.
1. Introduction * the ISO/IEC produced MPEG-1 and MPEG-4
ITU-T/VCEG and ISO/IEC-MPEG are the Visual; then these two organizations jointly
two main dominated international organizations produced the H.262/MPEG-2 Video and
that have developed video coding standards [1]. H.264/MPEG-4 Advanced Video Coding
The ITU-T produced H.261 and H.263 while (AVC) standards. The two jointly-developed
standards have had a particularly strong impact
_______ and have found their ways into a wide variety of
*
Corresponding author.
E-mail address: tutx@vnu.edu.vn products that are increasingly prevalent in our
https://doi.org/10.25073/2588-1086/vnucsce.233
daily lives. As the diversity of services, the
1
- 2 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22
popularity of HD and beyond HD video formats throughput performance, hardware resource,
(e.g., 4k×2k or 8k×4k resolutions) have been an and low power consumption.
emerging trend, it is necessary to have higher This paper provides an overview of HEVC
CABAC, the state-of-the-art works relating to
coding efficiency than that of H.264/MPEG-4
the development of high-efficient hardware
AVC. This resulted in the newest video coding
implementations which provide high throughput
standard called High Efficiency Video Coding performance and low power consumption.
(H.265/HEVC) that developed by Joint Moreover, the key techniques and
Collaborative Team on Video Coding corresponding design strategies used in
(JCT-VC) [2]. HEVC standard has been CABAC implementation are summarized to
designed to achieve multiple goals, including achieve the above objectives.
coding efficiency, ease of transport system Following this introductory section, the
integration, and data loss resilience. The new remaining part of this paper is organized as
follows: Section 2 is a brief introduction of
video coding standard offers a much more
HEVC standard, CABAC principle and its
efficient level of compression than its general architecture. Section 3 reviews state-of-
predecessor H.264, and is particularly suited to the-art CABAC hardware architecture designs
higher-resolution video streams, where and detailed assess these works in different
bandwidth savings of HEVC are about 50% [3, aspects. Section 4 presents the evaluation and
4]. Besides maintaining coding efficiency, prediction of forthcoming research trends in
processing speed, power consumption and area CABAC implementation. Some conclusions
cost also need to be considered in the and remarks are given in Section 5.
development of HEVC to meet the demands for
higher resolution, higher frame rates, and 2. Background of high-efficiency video
battery-based applications. coding and context-adaptive binary
Context Adaptive Binary Arithmetic arithmetic coding
Coding (CABAC), which is one of the entropy 2.1. High-efficiency video coding - coding
coding methods in H.264/AVC, is the only form principle and architecture, enhanced features
of entropy coding exploited in HEVC [7]. and supported tools
Compared to other forms of entropy coding,
such as context adaptive variable length coding 2.1.1. High-efficiency video coding principle
(CAVLC), HEVC CABAC provides As a successor of H.264/AVC in the
considerable higher coding gain. However, due development process of video coding
to several tight feedback loops in its standardization, HEVC’s video coding layer
architecture, CABAC becomes a well-known design is based on conventional block-based
throughput bottle-neck in HEVC architecture as hybrid video coding concepts, but with some
it is difficult for paralleling and pipelining. In important differences compared to prior
addition, this also leads to high computation standards [3]. These differences are the method
and hardware complexity during the of partition image pixels into Basic Processing
development of CABAC architectures for Unit, more prediction block partitions, more
targeted HEVC applications. Since the standard intra-prediction mode, additional SAO filter and
published, numerous worldwide researches additional high-performance supported coding
have been conducted to propose hardware Tools (Tile, WPP). The block diagram of
architectures for HEVC CABAC that trade off HEVC architecture is shown in Figure 1.
multi goals including coding efficiency, high
- D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 3
in a decoded picture buffer to be used for the
predictions of subsequent pictures.
Because HEVC encoding architecture
consists of the identical decoding processes to
reconstruct the reference data for prediction and
the residual data along with its prediction
information are transmitted to the decoding
side, then the generated prediction versions of
the encoder and decoder are identical.
2.1.2. Enhancement features and
supported tools
Figure 1. General architecture of HEVC encoder [1]. a. Basic processing unit
Instead of Macro-block (1616 pixels) in
The process of HEVC encoding to generate H.264/AVC, the core coding unit in HEVC
compliant bit-stream is typical as follows: standard is Coding Tree Unit (CTU) with a
- Each incoming frame is partitioned into maximum size up to 6464 pixels. However,
squared blocks of pixels ranging from 6464 to the size of CTU is varied and selected by the
88. While coding blocks of the first picture in encoder, resulting in better efficiency for
a video sequ0065nce (and of the first picture at encoding higher resolution video formats. Each
each clean random-access point into a video CTU consists of Coding Tree Blocks (CTBs), in
sequence) are intra-prediction coded (i.e., the which each of them includes luma, chroma
spatial correlations of adjacent blocks), all Coding Blocks (CBs) and associated syntaxes.
remaining pictures of the sequence or between Each CTB, whose size is variable, is partitioned
random-access points, inter-prediction coding into CUs which consists of Luma CB and
modes (the temporally correlations of blocks Chroma CBs. In addition, the Coding Tree
between frames) are typically used for most Structure is also partitioned into Prediction
blocks. The residual data of inter-prediction Units (PUs) and Transform Units (TUs). An
coding mode is generated by selecting of example of block partitioning of video data is
reference pictures and motion vectors (MV) to depicted in Figure 2. An image is partitioned
be applied for predicting samples of each block. into rows of CTUs of 6464 pixels which are
By applying intra- and inter- predictions, the further partitioned into CUs of different sizes
residual data (i.e., the differences between the (88 to 3232). The size of CUs depends on the
original block and its prediction) is transformed detailed level of the image [5].
by a linear spatial transform, which will produce
transform coefficients. Then these coefficients are
scaled, quantized and entropy coded to produce
coded bit strings. These coded bit strings together
with prediction information are packed and
transmitted as a bit-stream format.
- In HEVC architecture, the block-wise
processes and quantization are main causes of
artifacts of reconstructed samples. Then the two
loop filters are applied to alleviate the impact of
these artifacts on the reference data for
better predictions.
- The final picture representation (that is a
Figure 2. Example of CTU structure in HEVC.
duplicate of the output of the decoder) is stored
- 4 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22
b. Inter-prediction intra-prediction. However, HEVC has 35 Luma
The major changes in the inter prediction of intra-prediction modes compared with 9 in
the HEVC compared with H.264/AVC are in H.264/AVC, thus provide more flexibility and
prediction block (PB) partitioning and fractional coding efficiency than its predecessor [7], see
sample interpolation. HEVC supports more PB Figure 4.
partition shapes for inter picture-predicted CBs 2
3
4
as shown in Figure 3 [6]. 5
6
In Figure 3, the partitioning modes of 0: Planar 7 8
1: DC 8
PART−2N×2N, PART−2N×N, and 9
10
PART−N×2N (with M=N/2) indicate the cases 11 1
12
when the CB is not split, split into two 13
14
equal-size PBs horizontally, and split into two 15
16
6
equal-size PBs vertically, respectively. 34
17
18
33
PART−N×N specifies that the CB is split into 32
31
21
20
19
3 4
30 22
2928
four equal-sizes PBs, but this mode is only 2726252423
7 5
0
supported when the CB size is equal to the H.265/HEVC H.264/AVC
smallest allowed CB size.
Figure 4. Comparison of Intra prediction in HEVC
and H.264/AVC [7].
d. Sample Adaptive Offset filter
SAO (Sample Adaptive Offset) filter is the
new coding tool of the HEVC in comparison
with H.264/AVC. Unlike the De-blocking filter
that removes artifacts based on block
boundaries, SAO mitigates artifacts of samples
due to transformation and quantization
operations. This tool supports a better quality of
reconstructed pictures, hence providing higher
Figure 3. Symmetric and asymmetric of prediction
compression performance [7].
block partitioning.
e. Tile and Wave-front Parallel Processing
Besides that, PBs in HEVC could be the Tile is the ability to split a picture into
asymmetric motion partitions (AMPs), in which rectangular regions that helps increasing the
each CB is split into two different-sized PBs capability of parallel processing as shown in
such as PART-2N×nU, PART-2N×nD, Figure 5 [5]. This is because tiles are encoded
PART-nL×2N, and PART-nR×2N [1]. The with some shared header information and they
flexible splitting of PBs makes HEVC able to are decoded independently. Each tile consists of
support higher compression performance an integer number of CTUs. The CTUs are
compared to H.264/AVC. processed in a raster scan order within each tile,
c. Intra-prediction and the tiles themselves are processed in the
HEVC uses block-based intra-prediction to same way. Prediction based on neighboring tiles
take advantage of spatial correlation within a is disabled, thus the processing of each tile is
picture and it follows the basic idea of angular independent [5, 7].
- D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 5
Column boundaries While the H.264/AVC uses two entropy
coding methods (CABAC and CALVC), HEVC
CTU CTU specifies only CABAC entropy coding method.
Figure 8 describes the block diagram of
tile 1 tile 2 tile 3 HEVC CABAC encoder. The principal
algorithm of CABAC has remained the same as
CTU in its predecessor; however, the method used to
exploit it in HEVC has different aspects (will be
tile 4 tile 5 tile 6 discussed below). As a result, HEVC CABAC
supports a higher throughput than that of
H.264/AVC, particularly the coding efficiency
enhancement and parallel processing capability
tile 7 tile 8 tile 9
[1, 8, 9]. This will alleviate the throughput
bottleneck existing in H.264/AVC, therefore
Row
HEVC becomes the newest video coding
standard that can be applied for high resolution
boundaries
Figure 5. Tiles in HEVC frame [5].
video formats (4K and beyond) and real-time
Wave-front Parallel Processing (WPP) is a video transmission applications. Here are
tool that allows re-initializing CABAC at the several important improvements according to
beginning of each line of CTUs. To increase the Binarization, Context Selection and Binary
adaptability of CABAC to the content of the Arithmetic Encoding [8].
video frame, the coder is initialized once the
statistics from the decoding of the second CTU
in the previous row are available.
Re-initialization of the coder at the start of each
row makes it possible to begin decoding a row
before the processing of the preceding row has
been completed. The ability to start coding a
row of CTUs before completing the previous
one will enhance CABAC coding efficiency.
As illustrated in Figure 7, a picture is Figure 7. Representation of WPP to enhance
processed by a four-thread scheme which coding efficiency.
speeds up the encoding time for high
Context
throughput implementation. To maintain coding Memory
dependencies required for each CTU such as Context
each one can be encoded correctly once the left, A
L b
top-left, top and top-right are already encoded,
Bin value Regular
CABAC should start encoding CTUs at the Context Modeler
context model Engine
Coded bits
current row after at least two CTUs of the Syntax elements
previous row finish (Figure 6). Regular bitstream
Bin
string Regular/bypass Bypass
2.2. Context-adaptive binary arithmetic coding Binarizer mode switch
Bypass
for high-efficiency video coding (principle, Bin value
Engine
architecture) and its differences from the one Coded bits
for H.264 Binary Arithmetic Encoder
2.2.1. Context-adaptive binary arithmetic
coding’s principle and architecture Figure 8. CABAC encoder block diagram [6].
- 6 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22
Binarization: This is a process of mapping different hardware architectures of CABAC can
Syntax elements into binary symbols (bins). be found in [10-14].
Various binarization forms such as
Exp-Golomb, fixed length, truncated unary and Context
custom are used in HEVC. The combinations of Modeler
different binarizations are also allowed where Encoded
the prefix and suffix are binarized differently SE_type pLPS
SE bins bits FIFO
Bin_idx vMPS
such as truncated rice (truncated unary - fixed
Renormalizer
Bit generator
FIFO FIFO
Regular
length combination) or truncated unary - bins Context bin
Binarizer
Exp-Golomb combination [7]. encoder
Context Selection: The context modeling
and selection are used to accurately model the Bypass bin
probability of each bin. The probability of bins Bypass encoder
depends on the type of syntax elements it bins
belongs to, the bin index within the syntax Binary Arithmetic Encoder
elements (e.g., most significant bin or least
significant bin) and the properties of spatially Figure 9. General hardware architecture of CABAC
neighboring coding units. HEVC utilizes several encoder [10].
hundred different context models, thus it is
necessary to have a big Finite State Machine Besides the three main blocks above, it also
(FSM) for accurately context selection of each comprises several other functional modules
Bin. In addition, the estimated probability of the such as buffers (FIFOs), data router
selected context model is updated after each bin is (Multiplexer and De-multiplexer). Syntax
encoded or decoded [7]. Elements (SE) from the other processes in
Binary Arithmetic Encoding (BAE): BAE HEVC architecture (Residual Coefficients,
will compress Bins into bits (i.e., multiple bins SAO parameters, Prediction mode…) have to
can be represented by a single bit); this allows be buffered at the input of CABAC encoder
syntax elements to be represented by a before feeding the Binarizer. In CABAC, the
fractional number of bits, which improves general hardware architecture of Binarizer can
coding efficiency. In order to generate be characterized in Figure 10.
bit-streams from Bins, BAE involves several Based on SE value and type, the Analyzer
processes such as recursive sub-interval & Controller will select an appropriate
division, range and offset updates. The encoded binarization process, which will produce bin
bits represent an offset that, when converted to string and bin length, accordingly. HEVC
a binary fraction, selects one of the two standard defines several basic binarization
sub-intervals, which indicates the value of the
processes such as FL (Fixed Length), TU
decoded bin. After every decoded bin, the range
(Truncated Unary), TR (Truncated Rice), and
is updated to equal the selected sub-interval,
EGk (kth order Exponential Golomb) for almost
and the interval division process repeats itself.
SEs. Some other SEs such as CALR
In order to effectively compress the bins to bits,
(Coeff_Abs_Level_Remaining) and QP_Delta
the probability of the bins must be accurately
estimated [7]. (cu_qp_delta_abs) utilize two or more
2.2.2. General CABAC hardware combinations (Prefix and Suffix) of these basic
architecture binarization processes [15, 16]. There are also
CABAC algorithm includes three main simplified custom binarization formats that are
functional blocks: Binarizer, Context Modeler, mainly based on LUT, for other SEs like Inter
and Arithmetic Encoder (Figure 9). However, Pred Mode, Intra Pred Mode, and Part Mode.
- D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 7
SE values SE type 9 (CABAC architecture), depending on bin type
Custom format modules (bypass or regular), the current bin will be routed
Inter Pred Mode
FL into bypass coded engine or context coded engine.
Intra Pred Mode The first coded engine is implemented much
TU Controller Part Mode simpler without context selection and range
TR updating. The coding algorithm of the later one is
CALR
depicted in Figure 13.
EGk QP Delta
Combined format modules
bin bin
string length
s
Figure 10. General hardware architecture
of a binarizer.
These output bin strings and their bin
lengths are temporarily stored at bins FIFO.
Depending on bin types (Regular bins or
Bypass Bins), the De-multiplexer will separate
and route them to context bin encoder or bypass
bin encoder. While bypass bins are encoded in a
simpler manner, which will not necessary to Figure 12. General hardware architecture of context
estimate their probability, regular bins need to modeller [7].
be determined their appropriate probably
models for encoding. These output bins are put
into Bit Generator to form output bit-stream of rLPS=LUT(pState, range[7:6])
rMPS = Range - rLPS
the encoder.
The general hardware architecture of No
YES
CABAC context modeler is illustrated in Figure valBin != valMPS?
12. At the beginning of each coding process, it Range = rLPS
Range = rMPS
is necessary to initialize the context for CABAC Low = Low + rMPS
according to its standard specifications, when
context table is loaded data from ROM. No
pState!= 0?
Depending on Syntax Element data, bin-string
from binarizer and neighbor data, the controller YES
valMPS = !valMPS
will calculate the appropriate address to access
and load the corresponding probability model
pState = LUT(pState) pState = LUT(pState)
from Context Memory for encoding the current
bin. Once the encoding process of the current
bin is completed, the context model is updated
Renormalization
and written back to Context RAM for encoding
the next Bin (Figure 11).
Figure 13. Encoding algorithm of regular coded bin
Binary Arithmetic Encoder (BAE) is the (recommended by ITU-T).
last process in CABAC architecture which will
generate encoded bit based on input bin from Figure 14 presents our proposed BAE
Binarizer and corresponding probability model architecture with multiple bypass bin processing
from Context Modeler. As illustrated in Figure to improve the efficiency. The process of BAE
- 8 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22
can be divided into 4 stages: Sub-intervals several new coding tools and throughput
division (stage 1 - Packet information extraction improvement oriented-techniques, statistics of
and rLPS look-up), Range updating (stage 2 - bins types are significantly changed compared
Range renormalization and pre-multiple bypass to H.264 as shown in Table 1.
bin multiplication), Low updating (stage 3 -
Low renormalization and outstanding bit look- Table 1. Statistics of bin types in HEVC and
up), and Bits output (stage 4 - Coded bit H.264/AVC standards [8]
construction and calculation of the number of
valid coded bits). The inputs to our architecture Common Context Bypass Terminate
condition (%) (%) (%)
are encapsulated into packets in order to enable configuration
multiple-bypass-bin processing. Each packet
could be a regular or terminate bin or even a H.264/AVC Hierarchical B 80.5 13.6 5.9
group of bypass bins. The detailed Hierarchical P 79.4 12.2 8.4
implementation of these stages can be found HEVC Intra 67.9 32.0 0.1
in our previous work [17].
Low delay P 7.2 20.8 1.0
InputPacket Low delay B 78.2 20.8 1.0
/ 10
state
Random 73.0 26.4 0.6
Packet analyser
Stage 1
/
rLPS 3 access
Table
/8 /3 /2 /4
rLPSs EPLen isMPS mode EPBits Obviously, in most condition
9 /8
/4
/3
configurations, HEVC shows a fewer portion of
LPSShift
/2
/ range
LUT
/8
Stage 2
Renorm. rLPS /
range
9 / 13 Context coded bin and Termination Bins,
MPSShift
Renorm. rMPS
incEP 0
/2
incEP whereas Bypass bins occupy considerably
rMPS nShift bypass portion in the total number of input bins.
/ 13
Renorm.
HEVC also uses less number of Contexts
Low
bypass
Stage 3
low Least Significant
Zero (LSZ) LUT
(154) than that of H.264/AVC (441) [1, 8];
hence HEVC consumes less memory for
Shifter
context storage than H.264/AVC that leads to
Shifted renormLow OSCnt nShift2
/7 /3 /3 better hardware cost. Coefficient level syntax
Stage 4
Encoded bit Valid-bit length elements that represent residual data occupies
generator calculator
up to 25% of total bins in CABAC. While
/ 37 /6
H.264/AVC utilizes TRU+EGk binarization
Coded bits Number of valid bits
method for this type of Syntax Element, HEVC
Figure 14. Hardware implementation of regular bin uses TrU+FL (Truncated Rice) which generates
encoding [17]. fewer bins (53 vs. 15) [7, 8]. This will alleviate
the workload for Binary arithmetic encoding
2.2.3. Differences between context-adaptive which contributes to enhance the CABAC
binary arithmetic coding in high-efficiency throughput performance. The method of
video coding and the one in H.264/AVC characterizing syntax elements for coefficient
In terms of CABAC algorithm, Binary levels in HEVC is also different from
arithmetic coding in HEVC is the same with H.264/AVC which lead to possibility to group
H.264, which is based on recursive sub-interval the same context coded bins and group bypass
division to generate output coded bits for input bins together for throughput enhancement as
bins [7]. However, because HEVC exploits illustrated in Figure 15 [8].
- D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 9
SIG SIG SIG ALG1 ALG1 ALG2 S S S S ALRe ALRe ALRe a) Reducing the number of context coded bins
m m m
Regular coded bypass coded HEVC algorithm supports to significantly
reduce the number of context coded bins for
Figure 15. Group same regular bins and bypass bins syntax elements such as motion vectors and
to increase throughput. coefficient level. The underlying cause of this
Table 2. Reduction in workload and memory of reduction is the relational proportion of context
HEVC over H.264/AVC [8] coded bins and bypass coded bins. While
H.264/AVC uses a large amount of context coded
Metric H.264/AVC HEVC Reduction bins for syntax elements, HEVC only uses the
Max regular 78825 882 9x first few context coded bins and the remaining
coded bins bins are bypass coded. Table 3 summarizes the
Max bypass 13056 13417 1x
bins reduction in context coded bins for various
Max total bins 20882 14301 1.5x syntax elements.
Number of 441 154 3x Table 3. Comparison of bypass bins number [9]
contexts
Line buffer for 30720 1024 30x Syntax element AVC HEVC
4K×2K Motion vector difference 9 2
Coefficient 8×8×9-bits 4×4×3-bits 12x Coefficient level 14 1 or 2
storage Reference index 31 2
Initialization 1746×16-bits 442×8-bits 8x Delta QP 53 5
Table Remainder of intra prediction 3 0
This arrangement of bins gives better mode
chances to propose parallelized and pipelined
CABAC architectures. Overall differences b) Grouping of bypass bins
between HEVC and H.264/AVC in terms of Once the number of context coded bin is
input workload and memory usage are shown in reduced, bypass bins occupy a significant
Table 2. portion of the total bins in HEVC. Therefore,
overall CABAC throughput could be notably
improved by applying a technique called
3. High-efficiency video coding context- “grouping of bypass bins” [9]. The underlying
adaptive binary arithmetic coding principle is to process multiple bypass bins per
implementations: State-of-the-Art cycle. Multiple bypass bins can only be
3.1. High throughput design strategies processed in the same cycle if bypass bins
appear consecutively in the bin stream [7].
In HEVC, CABAC has been modified all of Thus, long runs of bypass bins result in higher
its components in terms of both algorithms and throughput than frequent switching between
architectures for throughput improvements. For bypass and context coded bins. Table 4
Binarization and Context Selection processes,
summarizes the syntax elements where bypass
there are commonly five techniques to improve
the throughput of CABAC in HEVC. These grouping was used.
techniques are reducing context code bins, Table 4. Syntax Element for group of bypass
grouping bypass bins together, grouping the bins [9]
same context bins together and reducing the
Syntax element Nbr of SEs
total number of bins [7]. These techniques have
Motion vector difference 2
strong impacts on architect design strategies of
BAE in particular and the whole CABAC as Coefficient level 16
well for throughput improvement targeting 4K, Coefficient sign 16
8K UHD video applications. Remainder of intra prediction mode 4
- 10 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22
c) Grouping bins with the Same Context multiple-bin single BAE core and high speed
Processing multiple context coded bins in BAE core [19].
the same cycle is a method to improve CABAC The objective of these solutions is to
throughput. This often requires speculative increase the product of the number of processed
calculations for context selection. The amount bins/clock cycle and the clock speed. In
of speculative computations, which will be the hardware designs for high performance
cause for critical path delay, increases if bins purpose, these criteria (bins/clock and clock
using different contexts and context selection speed) should be traded-off for each specific
logic are interleaved. Thus, to reduce speculative circumstance as example depicted in Figure 16.
computations hence critical path delay, bins
should be reordered such that bins with the same
contexts and context selection logic are grouped
together so that they are likely to be processed in
the same cycle [4, 8, 9]. This also reduces context
switching resulting in fewer memory accesses,
which also increases throughput and power
consumption as well.
d) Reducing the total number of bins
The throughput of CABAC could be
enhanced by reducing its workload, i.e.
decreasing the total number of bins that it needs Figure 16. Relationship between throughput, clock
to process. For this technique, the total number frequency and bins/cycle [19].
of bins was reduced by modifying the
binarization algorithm of coefficient levels. The Over the past five-year period, there has
coefficient levels account for a significant been a significant effort from various research
portion on average 15 to 25% of the total groups worldwide focusing on hardware
number of bins [18]. In the binarization process, solutions to improve throughput performance of
unlike combined TrU + EGk in AVC, HEVC HEVC CODEC in general and CABAC in
uses combined TrU + FL that produce much particular. Table 5 and Figure 18 show
smaller number of output bins, especially for highlighted work in CABAC hardware design
coefficient value above 12. As a result, on for high performance.
average the total number of bins was reduced in Throughput performance and hardware
HEVC by 1.5x compared to AVC [18]. design cost are the two focusing design criteria
Binary Arithmetic Encoder is considered as in the above work achievements. Obviously,
the main cause of throughput bottle-neck as it they are contrary and have to be trade-off
consists of several loops due to data during design for specific applications. The
dependencies and critical path delays. chart shows that some work achieved high
Fortunately, by analyzing and exploiting throughput with large area cost [14, 19] and
statistical features, serial relations between vice versa [11-13]. Some others [20-22]
BAE and other CABAC components to achieved very high throughput but consumed
alleviate these dependencies and delays, the moderate, even low area. It does not conflict
throughput performance could be substantially with the above conclusion, because these works
improved [4]. This was the result of a series of only focused on BAE design, thus consuming
modifications in BAE architectures and less area than those focusing on whole CABAC
hardware implementations such as paralleled implementation. These designs usually achieve
multiple BAE, pipeline BAE architectures, significant throughput improvements because
BAE is the most throughput bottle-neck in
- D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 11
CABAC algorithm and architecture. Therefore, The key techniques and strategies that
its improvement has huge effects on the exploited in this work are based on analyzing
overall design (Figure 17). statistics and characteristics of residual Syntax
Peng et al. [11] proposed a CABAC hardware Elements (SE). These residual data bins occupy a
architecture, as shown in Figure 19 which not significant portion in total bins of CABAC, thus
only supports high throughput by a parallel an efficient coding method of this type of SE will
strategy but also reduce hardware cost. contribute to the whole CABAC implementation.
j
Table 5. State-of-the-art high performance CABAC implementations
Works [11] [12] [13] [23] [21] [20] [22] [19] [14]
(year) (2013) (2015) (2015) (2017) (2018) (2016) (2016) (2014) (2013)
Bins/clk 1.18 2.37 1 3.99 4.94 4.07 1.99 4.37 4.4
Max 357 380 158 625 537 436.7 1110 420 402
Frequency
(MHz)
Max 439 900 158 2499 2653 1777 2219 1836 1769
throughput
(Mbins/s)
Tech. 130 130 180 65 65 40 65 90 65
Area (kGate)48.94 31.1 45.1 11.2 33 20.39 5.68 111 148
Design Parallel Area Fully Combined 8-stage High 4-stage Combined High speed,
strategies CM efficient CABAC Parallel, pipeline speed pipeline Parallel, Multi-bin
(Whole Multi-bin pipelined pipeline in multi multi bin BAE pipeline in pipeline
CABAC Binarizer and BAE bins BAE both BAE architecture
design) parallel BAE BAE and CM CABAC
Authors propose a method of rearranging this processing speed differences between CABAC
SE structure, Context selection and binarization to sub-modules.
support parallel architecture and hardware
reduction. Firstly, SEs represent residual data [6]
(last_significant_coeff_x,last_significant_coeff_y,
coeff_abs_ level_greater1_flags,
coeff_abs_level_greater2_flag, coeff_
abs_level_remaining and coeff_sign_flag) in a
coded sub-block which are grouped by their types
as they are independent context selection. Then
context coded and bypass coded bins Figure 18. High performance CABAC hardware
are separated. implementations.
The rearranged structure of SEs for residual
data is depicted in Figure 20. This proposed RAM_NEIGHBOU CABAC RAM_CTX_MODE
technique allows context selections to be R_INFO(8*128) controller L (256*7)
paralleled, thus improve context selection
throughput 1.3x on average. Because of bypass Context Model CtxId Parallel in
se
x Binary
stream (CM) serial out Bin bit
coded bins are grouped together, they are data buffer CtxId
arithmetic
stream
Bins engine
encoded in parallel that contributes to Binarizer (PISO) x (BAE)
throughput improvement as well. A PISO
(Parallel In Serial Out) buffer is inserted in Figure 19. CABAC encoder with proposed parallel
CABAC architecture to harmonize the CM [11].
- 12 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22
TU last position Coded sub block … Coded sub block objectives of the above techniques and design
solutions are both critical path delay reduction
Coded sub CSBF SCF GTR1 GTR2 sign CALR
and the increase of the number of processed
block bins per clock cycle, thus improving CABAC
Context coded bins bypass coded bins throughput. To support multi-bin BAE, they
proposed a cascaded 4-bin BAE as shown in
Figure 20. Syntax structure of residual data in Figure 22.
CABAC encoder [11]. In Figure 22, because of bin-to-bin
dependency and critical delay in the stage 2 of
For hardware cost reduction: the design of a BAE process, the cascaded architecture will
context-based adaptive CALR binarization further expand this delay that degrades clock
hardware architecture can save hardware speed, hence reducing the throughput
resource while maintaining throughput performance. Two techniques (Pre-norm, HPC)
performance. The bin length is adaptively are applied to solve this issue, in which pre-norm
updated in accordance with cRice Parameter will shorten the critical delay of stage 2 and HPC
(cRiceParam). The hardware solution for CALR will reduce cascaded 4-bin processing time.
binarization process applied in CABAC design
Bins & context
is shown in Figure 21.
Stage 1
cRiceParam CARL
rLPStab rLPStab rLPStab rLPStab
generation generation generation generation
cTRMax=4cRiceParm Stage 4
suffix
prefix Bits output
FL: size = cRiceParam
value = synV[size-1:0] bin string Figure 22. Proposed hardware architecture of
cascaded 4-bin BAE [19].
Pre-norm implementation in Figure 23(a) is
Figure 21. Adaptive binarization implementation of
original stage 2 of BAE architecture, while
CARL [11].
Figure 23(b) will remove the normalization
D.Zhou et al. [19] focuses on designing of from the stage 2 to the stage 1, which is much
an ultra-high throughput VLSI CABAC less processing delay.
encoder that supports UHDTV applications. By To further support the cascaded 4-bin BAE
analyzing CABAC algorithms and statistics of architecture, they proposed LH-rLPS to
data, authors propose and implement in alleviate the critical delay of range updating
hardware a series of throughput improvement through this proposed multi-bin architecture.
techniques (pre-normalization, Hybrid Path The conventional architecture is illustrated in
Coverage, Look-ahead rLPS, bypass bin Figure 24, where the cascaded 2-bin range
splitting and State Dual Transition). The updating undergoes two LUTs.
- D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 13
state1 state2
state rLPStab renorm.
rLPStab
x4 x4
bin2 == mps2
bin == mps renorm.
x4 rLPStab
4-4 router
ff ff ff ff ff
[7:6] [7:6] rLPS2
range1 LUT1 LUT2’
[7:6] LUT range’2
range rLPS renorm range’ -
- 14 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22
state
A BPBS scheme is proposed to split the
bypass bins from the bin stream together with
rLPStab renorm.
their relative positions are forwarded through a x4
9’d0 Bitwise NOT
dedicated channel - a PIPO (Parallel In Parallel
Out) storage - until re-merged into the MUX1
rLPS
generation
bin-stream before the low update stage, as ff
range
shown in Figure 27. This will result in range
[7:6]
LUT
updating
improving BAE throughput as it is possible to MUX2
-
- D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 15
pStateIdx1b ValMPS1b binVal1b mode2b
In the proposed CABAC hardware ff6b ff1b ff1b ff2b
Stage 1
architecture, binarization module is designed as rLPStab bin == mps ?
shown in Figure 31, where a multiple and
ff32b ff1b ff2b Stage 2
parallel SEs can be processed for the following [7:6]
LUT rLPS renorm
pipeline stage of the whole architecture. -
MPS
Range9b
- 16 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22
working at high speed. This pipelined architecture
is also applied in two-bin BAE architecture to
improve throughput performance (Figure 35, 33).
In the work done by B. Vizzotto et al. [12],
for the purpose of encoding UHD video
contents, an area efficient and high throughput
CABAC encoder architecture is presented. To
achieve the desired objectives, two design
strategies have been proposed to modify
CABAC hardware architecture. Firstly, parallel
binarization architecture has been designed to
meet the requirement of high throughput as Figure 38. Parallel binarization architecture [12].
depicted in Figure 38. The proposed
Custom core
architecture supports encoding multi SEs for C EGk TU TR
multi-bin CABAC architecture. However, C EGk
MPS (Symbol)
TU FL
bitsOutstanding ADD MUX
instead of a parallel binarization for each format rLPS Shifter
which consumes large hardware a MUX
range
LDZ MUX
ADD
heterogeneous eight-functional-core binarizer ADD Shifter
bitstream
has been presented to save area cost. This codlRange
ADD low
MUX LDZ
heterogeneous architecture consists of eight
Shifter
cores that can process up to 6 SEs per clock codlLow
cycle due to the duplication of Custom, TU and
EGk cores in the design (Figure 5).
Figure 39. Renormalization architecture
RenormE in BAE [12].
ivlCurrRange < 256
The second solution focuses on speeding up
renormalization process of BAE. Based on the
Low RenormE
Leading Zero Detector (LDZ) proposal
ivlCurrRange = ivlCurrRange
- D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 17
Once bypass bins are separated from regular In this 8-stage pipeline BAE architecture,
bins to avoid critical processing in range all advanced techniques (PN rLPS, LH rLPS
update, there are possibilities to process and BPBS) to improve throughput are
multi-bypass bins in the following stage, i.e. integrated. The first stage is rLPS pre-selection
low updating, for throughput improvement. To using PN rLPS technique. The second is regular
utilize MBBP in 8-stage pipeline BAE core bin range updating that applies LH rLPS in
without degradation of the operating frequency, seven cores scheme, followed by PIPO
it is necessary to separate and pipelined process write/read regular buffer before re-merging to
low update for bypass bins as the proposed bypass stream at the fourth stage. As
architecture shown in Figure 40. In this mentioned, the next two stages are catered for
hardware implementation, low update algorithm Low updating process with a five-core scheme.
of bypass bin is separated into two sub-stages to The seventh stage is five-core OB (Output Bits)
realize two multiply operations, which can that is separated from low updating for reducing
balance processing delay to other stages in the process delay purpose. The last stage is for final
pipeline architecture. In addition, these multiple bit generation.
operations are alternated by combinations of
3.2. Low power design strategies
adder, shifter and multiplexer to reduce critical
delay and area cost. The application of low-power techniques
(clock-gating, power-gating, DVFS) to design
“0000000000”
Range0 12 Lb HEVC hardware architectures for low power
Range 9 Range1
10
consumption is quite diverse and complicated,
- 18 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22
Syntax Elements BPBS technique and bypass bins are grouped
together, it is possible to turn off bypass bin
SE0 SE1 SE2 SE3 processing path while long bursts of regular
BC BC BC BC
bins are processed and vice versa. In the regular
bin processing path, it is also possible to proposed
power saving to turn off LPS bin processing part
binstring in BAE architecture. These power saving
Figure 43. Four-core parallel binarization solutions in BAE architecture can be realized by
architecture [24]. exploiting Clock Gating technique.
For high throughput requirement,
four-Binarization Core (BC) architecture is
proposed (Figure 43), in which hardware
architecture of each BC is shown in Figure 42.
Based on statistical analysis concluded above,
AND-based Operand Isolation technique is
inserted into each BC for low power purpose as
shown in Figure 44. Except for FL format that
occupies a significant portion in Binarization,
the proposed low power technique is embedded
into the inputs of all other binarization
processes. Then the less frequently used
binarization formats will be deactivated and
isolated. As a result, the power consumption of
binarizer is reduced by 20% on average.
Figure 45. Clock Gating for low-power BAE
architecture [26].
Ramos et al. [26] proposed a novel Multiple
Bypass Bins Processing (MBBP) Low-power
Figure 44. AND-based isolation operand for BAE architecture that supports 8K UHD video.
low-power binarization architecture [24]. They proposed a multiple bins architecture that
can process multiple bins per clock which can
BAE is another sub-module in CABAC that give opportunities to minimize clock frequency
can apply low power technique for energy while still be able to satisfy 8K UHD video
saving in hardware design. Unlike Binarizer application. The lower frequency can be applied
that exploits the statistical analysis of SE types the lesser power consumption can be achieved.
to propose appropriate low power solutions, Moreover, this is also a four-stage pipeline
BAE works with bin types (regular, bypass and BAE, in which Clock Gating technique can be
MPS/LPS). Use the same method, the statistics applied for stage pipeline registers based on
analysis concluded that the occurrence of data path for each kind of bins as stated above.
regular bins is more frequent than bypass ones These clock-gated pipeline registers will
and MPS regular bins tend to occur in longer contribute to energy saving by an appropriate
bursts than LPS regular Bins. Therefore, when control mechanism. As a result, the BAE
bypass bins and regular bins are separated by architecture shown in Figure 45 is capable to
- D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22 19
process 8k UHD video sequences at the is accomplished by contributions from above-
minimum clock frequency that leads to a power named and research trends are categorized
saving of 14.26%. accordingly. For the objective of the utilization
of HEVC standard in video transmission
4. New research trends systems, the ongoing research trends could be
divided into the following themes:
Since the compression efficiency of HEVC
Optimizing algorithms and hardware
is almost double of that in H.264/AVC, HEVC
architectures for HEVC codec to meet the
is considered as the promising standard for
demand of resource-constrained devices.
video applications today. However, the
Developing adaptive encoding schemes to
improvement of compression efficiency comes
network conditions when applying in live
at the cost of increased computational
broadcasting and real-time streaming over
complexity, large processing delays, higher
wireless networks.
resource and energy consumption compared to
Proposing reconfigurable video coding
H.264/AVC. These fundamental problems
(RVC) architectures that are able to
significantly affect the realization of HEVC
effectively adopt HEVC standard into
standards in today’s video applications. The
existing systems and infrastructures, where
emerging challenges are the exploitation of
the previous standards have been already
HEVC standard in widespread video
integrated.
applications, where real-time high quality UHD
The first theme is ongoing conventional
video is transmitted over the limited bandwidth research direction and it could be considered
wireless media (broadcast TV, mobile network, the underlying foundation that others based on.
satellite communication and TV) and network Mobile video applications have started to
delivery [27]. In these video services, it is dominate the global mobile data traffic in recent
necessary to transmit a large amount of real- years. In most mobile video communication
time video data over a bandwidth-limiting systems, mobile users will be equipped with
transmission medium and unstable channel mobile devices, which are typically resource
quality that affects video quality. In addition, constrained in terms of storage, computation
most of the terminal devices (mobile phones, and processing capacity, energy (battery) and
tablet, camcorders…) in these video network bandwidth [28, 29]. This issue has
transmission systems are resource constrained been already addressed and drawn a tremendous
in terms of storage, computation and processing research since commencing the first version of
HEVC standard. As mentioned in previous
capacity, energy (battery) and network
sections, to support resource-constrained
bandwidth, which are the other hindrance for
applications, most of the components in HEVC
the realization of HEVC standard [28, 29]. architecture have been assessed and amended to
Recently, both academic and industrial enhance its performance. However, the
sectors have focused on the explorations of necessity of this improvement is always at a
solutions for overcoming the above challenges high demand to better support future video
and it can be predicted that these research applications. Appling convolution neural
trends will be continued in the future as long as network and machine learning to propose
HEVC standard has been not fully adopted into adaptation algorithms, e.g., QP adaptation,
modern video services [30]. It is obvious that could be a potential method for this research
these challenges will be posed to transmission direction [31, 32]. Additionally, the more
media operators, video service providers as well flexibly and adaptively encoding schemes for
as terminal holders. Thus, a complete solution integrating HEVC into existing infrastructure at
- 20 D.L. Tran et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 2 (2019) 1-22
any transmitting media condition are also been few related studies of this theme of HEVC
promising directions. Thus, there should be for MPEG DASH applications.
adaptive algorithms to estimate the HEVC The last theme of future research trends in
encoder model parameters and perform online HEVC could be predicted. There are numerous
encoder coding configuration optimization [33]. standards of video coding that are mutually
For the second research direction, the incompatible (syntax and encoded data stream).
advance in coding efficiency of HEVC standard Nevertheless, most of the supported coding
makes it become a candidate in limited Tools for standards are the same. The sole
bandwidth video communication system,
difference is the input parameters used for each
particularly in wireless media such as TV
of these tools in the specific standard. Hence,
broadcasting, satellite communication and
mobile networks. In these wireless media, the there may be a change of paradigm in
quality of service depends not only on channel development future codec named RVC. This
bandwidth but also heavily the variation in new design paradigm allows implementing a set
environmental conditions [29]. The challenge of common video coding functional blocks,
arises in the application scenarios of live then depending on the requirements of given
broadcasting and real-time streaming over the standard, appropriate parameters are chosen for
wireless networks. These services impose a each block. Consequently, encoded bit-stream
very high rate, large amounts of data traversing will consist of descriptive information of these
through the networks. The feasible solution for blocks for decoding [30].
this challenge is to dynamically adapt encoding
schemes to network conditions for a better
trade-off between quality of services, efficiency 5. Conclusion
in exploiting encoding and network capabilities.
Rate control has always been a potential In this survey, the latest video coding
research area that supports dynamically adapt standard HEVC/H.265 overview is given and
encoding solution in wireless live video its advancements, especially the detail
services. However, there has been insufficient developments of CABAC are discussed and
research of rate control for HEVC/H.265 and summarized. The double coding efficiency of
the complexities of algorithm and computation HEVC compared with its predecessor is the
in HEVC rate control schemes are higher than result of a series of modifications in most
that of previous standards. This obstacle has components and leads to a prospective
been hindering the adoption of HEVC/H.265 in integration of the standard into modern video
its real-time streaming over mobile wireless applications. However, the significant increase
networks. Therefore, it is necessary to develop in computation requirements, processing delay
low-complex and highly-efficient rate control and consequently energy consumption hindered
schemes for H.265/HEVC to improve its
this progression. Literature review on hardware
network adaptability and enable its applications
architecture and implementation strategies for
to various mobile wireless streaming scenarios
highly efficient CABAC targeting high quality
[34]. Besides of rate control method, content
UHD resolution, real-time applications is
aware segment length optimization (at GOP
level) and Tile-based encoding schemes allow provided. The challenges in the design and
effective deployment of HEVC in MPEG application of HEVC always exist as video
DASH (Dynamic Adaptive Streaming over applications are diverse due to human demands
HTTP) [35]. This type of HEVC application is of progressive visual experience. Thus, the
an emerging research area because of survey also addresses the challenges of utilizing
increasingly high traffic of high-quality live HEVC in different video application areas and
video over the Internet. However, there have predicts several future research trends.
nguon tai.lieu . vn