# A 36-Gb/s 1.3-mW/Gb/s Duobinary-Signal Transmitter Exploiting Power-Efficient Cross-Quadrature Clocking Multiplexers With Maximized Timing Margin

Yong Chen<sup>®</sup>, Member, IEEE, Pui-In Mak<sup>®</sup>, Senior Member, IEEE, Chirn Chye Boon<sup>®</sup>, Senior Member, IEEE, and Rui P. Martins<sup>®</sup>, Fellow, IEEE

Abstract—For wireline transmitters delivering a high-speed multi-level signal, such as pulse-amplitude-modulation-4 or duobinary, a high-performance multiplexer (MUX) is critical to serialize the low-speed parallel data into one full-speed output. To enhance the power efficiency and data eye's opening, this paper proposes a universal 2-to-1 MUX, featuring a cross-quadrature clocking technique to enlarge the timing margin, and a simplified three-latch topology without delay buffers to boost the internal bandwidth (BW). The MUX ratios are extendable to 4-to-2 and 4-to-1, and their benefits are exemplified via a duobinary-signal transmitter. It further includes an output driver unifying the MUX-and-SUM operation, a BW-extended single-to-differential converter, and an active-inductor-embedded clock buffer for swing enhancement. Also, a predictive method for estimating the duobinary-signal data-dependent jitter according to the load capacitance of the output driver is developed. Fabricated in 65-nm CMOS, the transmitter exhibits a figure-of-merit of 1.3 mW/Gb/s at 36 Gb/s, while occupying a compact die area of  $0.037 \text{ mm}^2$ .

*Index Terms*—Bandwidth (BW), cross-quadrature clocking, data-dependent jitter (DDJ), duobinary, multilevel signaling, CMOS, multiplexer (MUX), figure-ofmerit (FOM), timing margin, latch, D-type flip-flop (DFF), selector.

Y. Chen is with the State-Key Laboratory of Analog and Mixed-Signal VLSI, University of Macau, Macau 999078, China (e-mail: ychen@umac.mo). P.-I. Mak is with the State-Key Laboratory of Analog and Mixed-Signal

P.-I. Mak is with the State-Key Laboratory of Analog and Mixed-Signal VLSI, University of Macau, Macau 999078, China, and also with the Faculty of Science and Technology, Department of ECE, University of Macau, Macau 999078, China (e-mail: pimak@umac.mo).

C. C. Boon is with Nanyang Technological University, Singapore 639798. R. P. Martins is with the State-Key Laboratory of Analog and Mixed-Signal VLSI, University of Macau, Macau 999078, China, and also with the Department of ECE, Faculty of Science and Technology, University of Macau, Macau 999078, China, on leave from the Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisboa, Portugal (e-mail: rmartins@umac.mo).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSI.2018.2829725

## I. INTRODUCTION

**H** IGH-SPEED wireline transmitters rely on their multiplexers (MUXs) to serialize the low-speed parallel data from the sub-branches, into one full-rate data stream at the output. As multi-level signals like pulse-amplitudemodulation-4 (PAM4) [1]–[4] and duobinary [5]–[12] emerge as more bandwidth (BW)-efficient alternatives of the conventional non-return-to-zero (NRZ) [13], [14], both 4-to-2 and 4-to-1 MUXs become the critical blocks during the serialization of high-speed data, to squeeze the power consumption while securing a wide opening of the data eye. In fact, the timing margin of each multiplexing step should be maximized, no matter it is located at the same or different clock frequency of the data-serialized path. Maximizing the timing margin of high-speed MUXs continues as an important topic for wireline systems [15], [16].

This paper introduces a universal 2-to-1 MUX that features a cross-quadrature clocking technique to maximize the timing margin, and a simplified three-latch topology without delay buffers to improve the internal BW. The techniques are then extended to the design of 4-to-2 and 4-to-1 MUXs for developing a duobinary-signal transmitter. To further enhance the performance, an output driver unifying the MUX-and-SUM operation is proposed. As the BW of a transmitter's data path is mainly limited at the output stage, the data-dependent jitter (DDJ) there caused by the inter-symbol interference (ISI) effect dominates the total data timing jitter, no matter it is for NRZ, duobinary or PAM4. Herein, a predictive method for estimating the duobinary-signal DDJ according to the load capacitance is developed. For the clock path, a BW-extended single-to-differential converter (S2D) and an active-inductorembedded clock buffer are introduced. The transmitter measures a figure-of-merit (FOM) of 1.3 mW/Gb/s at 36 Gb/s, which is favorably comparable with the state-of-the-art.

Section II describes the existing and proposed 2-to-1 MUXs. The technique is extendable to 4-to-2/1 MUXs as discussed in Section III. Section IV focuses on the output driver and its related duobinary-signal DDJ analysis. Section V details the simulation and experimental results of the entire

1549-8328 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.

Manuscript received November 9, 2017; revised March 12, 2018; accepted April 21, 2018. Date of publication June 7, 2018; date of current version August 3, 2018. This work was supported in part by the University of Macau under Grant MYRG2017-00167-AMSV and in part by VIRTUS, Nanyang Technological University, Singapore. This paper was recommended by Associate Editor I. F. Chen. (*Corresponding author: Yong Chen.*)



Fig. 1. A typical 2-to-1 MUX. (a) The timing adjuster is to generate a timing delay  $(t_{dd})$  in the upper branch; (b) timing chart of (a) without a timing adjuster; (c) timing parameters; (d) theoretical timing margin.

duobinary-signal transmitter. The conclusions are drawn in Section VI.

## II. EXISTING AND PROPOSED 2-TO-1 MUXs

This section describes the background of 2-to-1 MUX design, and discusses the design complexity, timing margin and current consumption of the existing and proposed 2-to-1 MUXs.

#### A. Background

The basic function of a 2-to-1 MUX [Fig. 1(a)] is to route independently two-way random inputs  $(D_1 \text{ and } D_2)$  at a half rate of f/2-(bit/s), into one serialized output (Dout) at a full rate of f-(bit/s), according to the half-rate [f/2-(Hz)] control signal (CK<sub>s</sub>). Before the level-sensitive selector, a timing adjuster is entailed to generate a timing delay (t<sub>dd</sub>) between the two inputs such that there is adequate timing margin for both left edge of  $D_1$  and right edge of  $D_2$  [Fig. 1(b)], reducing the data jitter of the serialized data eye. To study the timing margin, a conceptual model is built [Fig. 1(c)] in which the timing margin (t<sub>mx</sub>) is a time slot between the original data edge and selected data edge. When CKs is low and holding a half unit interval (UI), the selector routes  $b_1$  of  $D_2$  into  $D_{out}$ with two timing margins, t<sub>m1</sub> and t<sub>m2</sub>, which are formulated by  $t_{cd}$  and  $0.5 - t_{cd}$ , respectively. When  $D_2$  is fixed as the reference, one can sweep the clock-to-data delay (tcd) from 0 to 0.5UI [Fig. 1(d)]. At  $t_{cd} = 0$ ,  $t_{m1}$  and  $t_{m2}$  are equal to 0 and 0.5UI, respectively. As the clock delay moves toward 0.25UI,  $t_{m1}$  increases and  $t_{m2}$  reduces linearly, and finally converges to 0.25UI at  $t_{cd} = 0.25UI$ . Within the range of 0.25 to 0.5UI, such two timing margins spread and reach the other extreme ( $t_{m1} = 0.5$  UI and  $t_{m2} = 0$ ). As a result, the maximum timing margin (0.25UI) can be achieved at the instant when the low level of CKs is located at the data's center.

When  $CK_s$  is high,  $a_1$  of  $D_{1d}$  with two timing margins  $(t_{m3}, t_{m4})$  is sent to  $D_{out}$  after delayed by  $D_1$ . They are derived to  $0.5 + t_{cd} - t_{dd}$  and  $t_{dd} - t_{cd}$ , respectively. We can initially preset  $t_{dd}$ , and shift  $t_{cd}$  to search for the simultaneous maximum between the four timing margins  $(t_{m1-4})$  [Fig. 1 (d)]. When  $t_{dd}$  increases from 0 to 1UI, the simultaneous maximum steadily climbs up to a peak of 0.25UI at  $t_{dd} = 0.5UI$  (i.e.,  $t_{cd} = 0.25UI$ ). As it will decline to 0 afterwards, the theoretical maximum timing margins appear when the two-way inputs stagger a 0.5UI data delay, and when the clock delay is 0.25UI.

## B. Existing Five-Latch 2-to-1 MUX

The five-latch 2-to-1 MUX of [1] and [15]–[17] is shown in Fig. 2(a). The timing adjuster in front of the selector (SEL<sub>1</sub>) consists of an array of master-slave D flip-flip (DFF) and Dtype latch in one branch, and a standalone master-slave DFF in another. Their differential clocks of f/2-(Hz) are generated from a divider-by-2 (DIV2) with clock buffers. Whether the falling edge of the clock  $(CK_s)$  aligns the data center [Fig. 2(b)], the D-type latch generates a 0.5UI phase delay (i.e.,  $t_{dd} = 0.5$ UI) between Ad and Bd. The two-way f/2-(bit/s) inputs are simultaneously delayed by 0.5UI in Fig. 2(b). Both Ad and Bd are selected in the middle of the 1UI data slot by the 0.25UI-delayed clock (CK<sub>s</sub>), which can be generated by a fixed or tunable delay buffer, to construct the f-(bit/s) serialized output data. Hence, the theoretical maximum timing margins are realized between the data and clock. In fact, both the fivelatch [18] and six-latch [19] 2-to-1 MUXs employing multiphase clock architecture target a maximum timing margin, but still the delay buffer is essential for timing control.

### C. Existing Three-Latch 2-to-1 MUX

The three-latch 2-to-1 MUX [Fig. 2(c)] eliminates one DFF and reduces the capacitive load of the clock buffer [20]. Yet, it still entails a phase delay of 0.5UI between Ad and Bd.

 TABLE I

 Comparison of Different Implementations of 2-to-1 MUXs

|                                      | No. of<br>Selector | Latch Clock/<br>Selector Clock    | No. of Delay<br>Latch | Delay<br>Buffer | Clock<br>Buffer | Phase<br>Margin     | Total<br>Power |
|--------------------------------------|--------------------|-----------------------------------|-----------------------|-----------------|-----------------|---------------------|----------------|
| Five-latch 2-to-1 MUX                | 1xl                | CK <sub>d</sub> / CK <sub>s</sub> | 5xl                   | 0.5UI<br>/ 1I   | 6xl             | Maximum<br>(0.25UI) | 12xl           |
| Three-latch 2-to-1 MUX               | 1xl                | CK <sub>d</sub> / CK <sub>s</sub> | 3xl                   | 0.5UI<br>/ 1I   | 4xl             | Maximum<br>(0.25UI) | 9xl            |
| Three-latch 2-to-1 MUX<br>(Proposed) | 1xl                | IP / QP                           | 3xl                   | No              | 4xl             | Maximum<br>(0.25UI) | 8xl            |



Fig. 2. Timing margin of 2-to-1 MUX under different implementation: (a) five-latch topology, (b) is its timing charts when the clock falling edge aligns the data center, respectively; (c) three-latch topology, (d) is its timing charts when the clock falling edge aligns the data center, respectively; (e) Proposed three-latch topology, (f) is its timing charts when the clock falling edge aligns the data center, respectively.

The phase of Ad leads that of Bd by 0.5UI, no matter the falling edge of the clock  $(CK_d)$  is located at the data center [Fig. 2(d)]. This 2-to-1 MUX still entails a delay buffer to maximize the timing margin, similar to the five-latch 2-to-1 MUX.

## D. Proposed Three-Latch 2-to-1 MUX

The proposed three-latch 2-to-1 MUX [Fig. 2(e)] employs a cross-quadrature clocking technique to maximize the timing

margin while eliminating the delay buffers. The quadrature clocks (IP, QP) in the half-rate [f/2-(Hz)] domain are driven by the clock buffer, and are intrinsically generated by a DIV2 in the full-rate [f-(Hz)] domain. The DFF+latch array is controlled by IP, aiming at a timing delay of 0.5UI between Ad and Bd with the data rate of f/2-(bit/s). Ad and Bd are selected by a selector (SEL<sub>1</sub>) using the output QP directly. As a result,  $t_{dd} = 0.5UI$  and  $t_{cd} = 0.25UI$  can be satisfied simultaneously to maximize the timing margin [Fig. 2(f)]. In fact, such a cross-quadrature clocking technique is applicable to the five-latch 2-to-1 MUX.

Compared with the two retiming functions in the fivelatch 2-to-1 MUX [Fig. 2(a)], the three-latch 2-to-1 MUXs in Fig. 2(c) and (e) can avoid an additional DFF for the alignment. If we design the lower level of serialization (e.g., starts from 64-to-1 or 128-to-1), the five-latch-based MUX can be utilized at the previous stage for the alignment, while consuming low power. To further reduce the power, we can use the three-latch 2-to-1 MUX to implement the adjacent twostage serialization. At a data rate of <20 Gb/s, its compact layout (e.g., <25 × 25  $\mu$ m<sup>2</sup>) can minimize the misalignment between A and B.

### E. Power Consumption

The power consumption of the discussed 2-to-1 MUXs is summarized in Table I. The current consumption of a unit differential gate is denoted as  $1 \times I$ , under the same supply voltage (e.g., 1 V) in the f/2-(Hz) domain. The five-latch 2-to-1 MUX consumes  $12 \times I$  (i.e.,  $5 \times I$  for latch + DFF array,  $6 \times I$  for clock buffer and  $1 \times I$  for delay buffer). Although the existing three-latch topology already can save power, the delay buffers are still required. For the proposed threelatch topology, it consumes only  $8 \times I$  and eliminates the delay buffer, while preserving a maximum timing margin of 0.25UI.

# III. CONVENTIONAL AND PROPOSED 4-TO-2 AND 4-TO-1 MUXs

With two five-latch 2-to-1 MUXs [Fig. 3(a)], a 4-way quarter-rate data input of f/4-(bit/s) can be serialized into a 2-way data (E, F) of f/2-(bit/s) with the same phase, under the same clock. A pair of E and F goes through a five-latch array triggered by the clock of f/2-(Hz), resulting in two data streams (Ed, Fd) with a phase difference of 0.5UI at f/2-(bit/s). This scheme can be considered as a 4-to-2 MUX





(b)

a<sub>2</sub>

b<sub>2</sub>

C

d;

b

b<sub>1</sub> X

à

d١

(d)

C1 a1 d1 b1 C2 a2 d2

a<sub>2</sub>

b

C2

d<sub>2</sub>

b1

K C2

d<sub>1</sub>

(f)

X b<sub>1</sub> X a<sub>2</sub> X

d<sub>1</sub> C<sub>2</sub> d<sub>2</sub>

 $(c_1)(a_1)(d_1)(b_1)(c_2)(a_2)(d_2)(b_2)$ 

aı

C1

C1

 $a_2$ 

d

b;

b<sub>2</sub>

C1

a

b<sub>2</sub>

А

в

С

D

Ad

Bd

Cd Dd

f/4 IPs

f/4 QP₅ F

f/2 CKs

OUT

Е

f/4 IP f/4 QP a

b

d





Fig. 3. Timing-margin maximization of every MUX stage under different implementation: (a) typical 4-to-2/1 MUX, (b) is its timing charts when the clock falling edge aligns the data center, respectively; (c) Hardware-reduced 4-to-2/1 MUX, (d) is its timing charts when the clock falling edge aligns the data center, respectively; (e) Hardware-minimized 4-to-2/1 MUX (proposed), (f) is its timing charts when the clock falling edge aligns the data center, respectively.

|                                                         | No. of<br>Selector | Latch Clock/<br>Selector Clock                   | No. of Delay<br>Latch | Delay<br>Buffer     | Clock<br>Buffer | Phase<br>Margin     | Total<br>Power |
|---------------------------------------------------------|--------------------|--------------------------------------------------|-----------------------|---------------------|-----------------|---------------------|----------------|
| Five-latch 4-to-2 MUX<br>+ Differential Clock           | [0.5*2]xl          | Up-MUX: $CK_d / CK_s$<br>Down-MUX: $CK_d / CK_s$ | [0.5*10+5]            | 0.5UI<br>/ [0.5*2]I | [0.5*12+6]xl    | Maximum<br>(0.25UI) | 24xl           |
| Three-latch 4-to-2 MUX<br>+ Differential Clock          | [0.5*2]xl          | Up-MUX: $CK_d / CK_s$<br>Down-MUX: $CK_d / CK_s$ | [0.5*6+3]             | 0.5UI<br>/ [0.5*2]I | [0.5*8+4]xl     | Maximum<br>(0.25UI) | 16xl           |
| Three-latch 4-to-2 MUX<br>+ Quadrature Clock            | [0.5*2]xl          | Up-MUX: IP / IP<br>Down-MUX: QP / QP             | [0.5*6+0]             | 0.5UI<br>/ [0.5*2]I | [0.5*8+0]xl     | Maximum<br>(0.25UI) | 9xl            |
| Three-latch 4-to-2 MUX +<br>Quadrature Clock (Proposed) | [0.5*2]xl          | Up-MUX: IP / QP<br>Down-MUX: QP / IP             | [0.5*6+0]             | No                  | [0.5*8+0]xl     | Maximum<br>(0.25UI) | 8xl            |

 TABLE II

 COMPARISON OF POWER CONSUMPTION OF DIFFERENT 4-TO-2 MUXS

with time-interleaved 2-way outputs. Ed and Fd are added in the current mode to directly generate the duobinary signal at f-(bit/s). Besides, the selector (SEL<sub>3</sub>) multiplexes them into a f-(bit/s) NRZ signal at the middle of their time slot by the clock, forming eventually a 4-to-1 MUX [21], [22]. The current consumption of a unit differential gate in the quarterrate clock of f/4-(Hz) is assumed to be  $0.5 \times I$ . A 4-to-2 MUX with the five-latch array consumes totally  $24 \times I$ , as shown in Table II. Interestingly, the current consumption can be reduced to  $16 \times I$  by replacing each five-latch array by a three-latch array. Obviously, this low-power alternative is only feasible if the maximum timing margin at every multiplexing operation is preserved, as shown in the timing charts [Fig. 3(b)].

A conventional time-interleaved 4-to-2/1 MUX [23] is shown in Fig. 3(c). The intrinsic quadrature feature of the clock is employed. A three-latch 2-to-1 MUX using IP of f/4-(Hz) serializes A and B into E of f/2-(bit/s). On the contrary, C and D are routed to F of f/2-(bit/s) using QP of f/4-(Hz). E and F have a phase difference of 0.5UI at f/2-(bit/s), without the DFF + latch array in the f/2-(Hz) domain. They are combined to form a duobinary or NRZ signal at f-(bit/s). Two extra delay buffers are entailed to support each multiplexing aiming for a maximum timing margin, as shown in Fig. 3(d). The current consumption is reduced to  $9 \times I$ .

The proposed time-interleaved 4-to-2/1 MUX [Fig. 3(e)] with cross-quadrature clocking aims to lower current consumption. In the upper branch, a three-latch array and a selector (SEL<sub>1</sub>) use IP of f/4-(Hz) and QP of f/4-(Hz), respectively. Conversely, a three-latch array and a selector (SEL<sub>2</sub>) in the lower branch use QP of f/4-(Hz) and IP of f/4-(Hz), respectively. The serialized data E leads F by a 0.5UI delay at f/2-(bit/s). The generation of the f-(bit/s) duobinary or NRZ signal is the same as that in Fig. 2(e). The maximum timing margin is preserved, as shown in Fig. 3(f), while consuming a lower current of  $8 \times I$  by averting the delay buffer.

A precise delay or phase aligner [15] is entailed in the clock path to maximize the timing margin of the final selector (SEL<sub>3</sub>), transferring 4-to-2 MUX to 4-to-1 MUX [Fig. 3(a), (c) and (e)].



Fig. 4. Two MUXs as the output driver and their operation details.

## IV. PROPOSED MUX-AND-SUM OUTPUT DRIVER

## A. Architecture

The output driver consists of two 2-to-1 MUXs (Fig. 4) and an adder fully implemented with the current-mode logic (CML). As the half-rate serialization and full-rate summation are performed in the current domain simultaneously, no buffer or combiner is entailed after MUXs, i.e. the MUXs directly output the duobinary signal. Also, no buffer is placed before the MUX to reduce the number of blocks in the signal path, enabling a wider internal BW. When the 4-way f/4-(bit/s) PRBS signals pass through the MUXs, the f/4-(Hz) quadrature clocks IP/IN/QP/QN are employed to control the MUXs, such that the outputs are two streams of f/2-(bit/s) signal with a 0.5 phase difference [1 UI of f-(bit/s)]. At the final output, the data from the two streams are time-interleavedly combined, where the present bit of one stream sum with either the previous (leading) bit or the next (lagging) bit of the other stream, resulting in intentional ISI, namely, duobinarysignal. Supposing OP and IN are high, the voltage signals a<sub>1</sub> and  $c_1$  are selected by the respective MUX that also offers voltage-to-current conversion. The resultant current signals are combined at the output node directly, avoiding extra current-to-voltage and voltage-to-current conversions, which are otherwise entailed in the traditional topology by cascading



Fig. 5. Data jitter caused by ISI effects at the MUX-and-SUM output stage: (a) simplified analysis model, (b) simulated and calculated data jitter versus the total parasitic capacitance ( $C_L$ ) at 36 Gb/s under  $R_o = 150 \Omega$  and  $C_o = 17.5$  fF, (c) analysis process for the duobinary-signal testing and (d) analysis process for the NRZ-signal testing.

MUX and adder [12]. The combined signal,  $a_1 + c_1$ , will only be converted back to voltage level at the final output. By averting the BW-demanding voltage-mode stage, signal combining in the current domain relaxes the BW requirement of the system. Subsequently in the next clock phase, QN and IN are high, indicating that the current flows via the path to MUX\_Q in the upper branch, and path to MUX\_I in the lower branch, and thus the output is  $b_1 + c_1$ . Superior to passgate MUX in [12], the CML circuits here provide a sharper transition while mitigating the effect of serialization on the data rate. By merging the MUX and adder in one current-mode stage, the delay due to clk-to-Q of the MUX is alleviated, and the peak-to-peak voltage degradation resulting from the transmission gate is also averted.

## B. DDJ Analysis

The output driver of a transmitter has to deal with a large capacitive load  $(C_L)$  due the ESD and on/off-chip parasitics.

The induced ISI effect causes DDJ that dominates the total data timing jitter [24], [25]. This section shows a predictive method to estimate the duobinary-signal DDJ with respect to  $C_{L}$ .

To study the duobinary-signal DDJ, a simplified model of the MUX-and-SUM output stage is developed [Fig. 5(a)]. In the data path, the dominant pole is located at the output node ( $V_{outp}$ ), with a time constant  $\tau_{OD}$ . The secondary pole with a time constant  $\tau_{ID}$  is at the driven node Ad of the latch, since the time constants at the latch's and DFF's outputs are comparable. In the clock path, one pole with a time constant  $\tau_{IC}$  is formed by the clock buffer. Considering the small clock jitter at QP [Fig. 10(b)] provided by the off-chip clock, and the improvement of the eye quality [Fig. 11(a) and (b)] to be described in Section V, the impact of the clock path on the DDJ is minor. The performance of the data path dominates the DDJ as presented below.

When  $C_L$  is large, more ISI is induced and the output data jitter is penalized too [Fig. 5(b)]. One can use a 1<sup>st</sup>-order model to predict the DDJ, with the impulse response as given

$$h_1(t) = \frac{e^{-t/\tau_{OD}}}{\tau_{OD}} \tag{1}$$

where  $\tau_{OD} = (25//0.5R_O) \times (C_L + 4C_O)$ , in which  $R_O$  and  $C_O$  are the channel turn-on resistance and parasitic capacitance of the driving transistor in the data path, respectively. A combinational pulse signal  $p_{DB}(t)$  [Fig. 5(c), left], as the input test signal, is represented by

$$p_{DB}(t) = u(t) + u(t - T_B) - u(t - 2T_B) - u(t - 3T_B)$$
(2)

where u(t) is the unit step function modeling the rising edge, and  $T_B$  is the period of one bit at  $V_{outp}$ . The pulse response  $y_{DB1}(t)$  of the 1<sup>st</sup>-order model can be derived as

$$y_{DB1}(t) = p_{DB}(t) * h_1(t)$$
 (3)

Let  $T_{1\_DB}$  and  $T_{2\_DB}$  denote the threshold crossing times for the amplitude of 1.5 at the adjacent rising and falling edges [Fig. 5(c), middle], respectively. In (3), we can solve  $T_{1\_DB}$ and  $T_{2\_DB}$  from  $y_{DB1}(T_{1\_DB}) = 1.5$  and  $y_{DB1}(T_B + T_{2\_DB}) =$ 1.5, respectively. The closed-form solution of the ISI jitter is written as

$$\Delta T_{DB1} = T_{1\_DB} - T_{2\_DB} = \tau_{OD} ln \frac{e^{T_B/\tau_{OD}} + 1}{e^{T_B/\tau_{OD}} - e^{-T_B/\tau_{OD}} - 1}$$
(4)

By extracting the parasitics from simulations, the ISI jitter [Fig. 5 (b)] can be estimated easily by (4).

If  $\tau_{\text{ID}}$  is activated, the impulse response of the 2<sup>nd</sup>-order model can be derived as

$$h_2(t) = \frac{e^{-t/\tau_{ID}} - e^{-t/\tau_{OD}}}{\tau_{ID} - \tau_{OD}}$$
(5)

where  $\tau_{ID} = R_I C_{pd}$ , in which  $R_I$  and  $C_{pd}$  are the total parasitic resistance and capacitance at Ad, respectively; both can be estimated by simulations. Further, the pulse response  $y_{DB2}(t)$ of the 2<sup>nd</sup>-order model can be deduced as

$$y_{DB2}(t) = p_{DB}(t) * h_2(t)$$
(6)

From (6), one can solve  $y_{DB2}(T_{1_DB}) = 1.5$  and  $y_{DB2}(T_{2_DB} + T_B) = 1.5$  and obtain  $T_{1_DB}$  and  $T_{2_DB}$ . The two equations are rearranged as (7) and (8) at the bottom of this page. Note that (7) and (8) cannot be expressed as the closed-form solutions. Instead, a computer program (e.g., fzero function in MATLAB) can be used to estimate  $T_{1_DB}$  and  $T_{2_DB}$ . Thus, the calculated ISI jitter of  $\Delta T_{DB2} = T_{1_DB} - T_{2_DB}$  based on a 2<sup>nd</sup>-order model is obtained, and plotted in Fig. 5(b) together with the simulated jitter.

Two  $2T_B$  data with a phase difference of  $T_B$  are combined into the duobinary signal with  $T_B$  [Fig. 5(c), right], in which the upper and lower 'NRZ' signals are symmetrical

about 1 [Fig. 5(c), middle]. Therefore, we employ the analysis method of the NRZ-signal DDJ, that a unit pulse  $p_{NRZ}(t)$  [Fig. 5(d), left] can be the input test signal to estimate the NRZ-signal DDJ [Fig. 5(d), middle].

$$p_{NRZ}(t) = u(t - T_B) - u(t - 2T_B)$$
(9)

NRZ with  $T_B$  can be routed from two  $2T_B$  data with a phase difference of  $T_B$  by a selector [Fig. 5(d), right]. The parasitic resistance is  $R_O$ . Namely, the parasitic resistance in the turnon/off behavior [Fig. 5(c), right] is approximately half of that in the summation behavior [Fig. 5(d), right]. In the following analysis, one can replace  $\tau_{OD}$  by  $\tau_{OD_NRZ} = (25//R_O) \times (C_L + 2C_O)$  in both (1) and (5). Further, the two pulse responses  $y_{NRZ1}(t)$  and  $y_{NRZ2}(t)$  can be written as

$$y_{NRZ1}(t) = p_{NRZ}(t) * h_1(t)$$
(10)

$$y_{NRZ2}(t) = p_{NRZ}(t) * h_2(t)$$
 (11)

Here we denote  $T_{1\_NRZ}$  and  $T_{2\_NRZ}$  as the threshold crossing times for the amplitude of 0.5 at the adjacent rising and falling edges [Fig. 5(d), middle], respectively.  $T_{1\_NRZ}$  can be solved from  $y_{NRZ1}(T_{1\_NRZ}) = 0.5$ , whereas  $T_{2\_NRZ}$  can be solved from  $y_{NRZ1}(T_B + T_{2\_NRZ}) = 0.5$ . The closed-form solution of the ISI jitter is derived below and plotted in Fig. 5(b).

$$\Delta T_{NRZ1} = T_{1\_NRZ} - T_{2\_NRZ}$$
  
=  $-\tau_{OD\_NRZ} ln \left(1 - e^{-T_B/\tau_{OD\_NRZ}}\right)$  (12)

From (11), one can solve  $y_{NRZ2}(T_{1\_NRZ}) = 0.5$  and  $y_{NRZ2}(T_{2\_NRZ}+T_B) = 0.5$  and obtain  $T_{1\_NRZ}$  and  $T_{2\_NRZ}$ . The two equations are rearranged as (13) and (14) at the bottom of the next page. The fzero function can be applied to (13) and (14) to estimate  $T_{1\_NRZ}$  and  $T_{2\_NRZ}$ . Finally, the ISI jitter of  $\Delta T_{NRZ2} = T_{1\_NRZ} - T_{2\_NRZ}$  can be calculated as plotted in Fig. 5(b).

For the simulated jitters [Fig. 5(b)], the calculated jitters using (4) and (12) based on the 1<sup>st</sup>-order model are handy yet less accurate. The 2<sup>nd</sup>-order calculated ISI jitters ( $\Delta T_{DB2}$  and  $\Delta T_{NRZ2}$ ) are more precise, since the shapes between region A [Fig. 5(c), middle] and region B [Fig. 5(d), middle] are different. The former is impacted by the 0-to-1 transition along with the 1-to-2 transition, but the latter is only affected by the 0-to-1 transition.

# V. A 36-Gb/s DUOBINARY-SIGNAL TRANSMITTER PROTOTYPE

To deliver higher serial data rates, both PAM4 and duobinary formats are 2-fold more BW-efficient than the typical NRZ counterpart [13], [14]. Yet, these multi-level signals have modulation penalty [MP =  $10 \times \log_{10}(M - 1)$ ], which is 3 dB for duobinary with 3-level (M = 3), and 4.8 dB for PAM4 with 4-level (M = 4) [26]. In fact, the intrinsic advantage

$$\frac{\tau_{ID}e^{-T_{1\_DB}/\tau_{ID}}(e^{T_B/\tau_{ID}}+1) - \tau_{OD}e^{-T_{1\_DB}/\tau_{OD}}(e^{T_B/\tau_{OD}}+1)}{\tau_{ID} - \tau_{OD}} = 0.5$$
(7)

$$\frac{\tau_{ID}e^{-T_{2}DB/\tau_{ID}}(e^{T_B/\tau_{ID}} - e^{-T_B/\tau_{ID}} - 1) - \tau_{OD}e^{-T_{2}DB/\tau_{OD}}(e^{T_B/\tau_{OD}} - e^{-T_B/\tau_{OD}} - 1)}{e^{-T_{2}DB/\tau_{OD}}(e^{T_B/\tau_{OD}} - e^{-T_B/\tau_{OD}} - 1)} = 0.5$$
(8)

of duobinary can be over 1.8 dB, due to the lower hardware complexity of the demodulation when comparing with that of PAM4.

Generating a duobinary signal can be done in three different ways. 1) The duobinary stream can be created at the transmitter output by putting one-bit NRZ stream through a delay-and-add block [5]-[6], in order to construct the filtering function  $1 + z^{1}$ . 2) This filtering function can be fitted by combining the equalization at the transmitter or receiver and channel response [7]-[11], achieving the duobinary stream at the channel output, but is more susceptible to channel variation. Generally, the comparator-XOR demodulation at the receiver side is exploited to achieve duobinary-to-NRZ signal conversion. Yet, such conversion can be simplified by the full-rate decision-feedback equalization (DFE) [11]. 3) The full-rate DFE receiver demodulates the f-(bit/s) duobinary signal carrying a two-bit f/2-(bit/s) NRZ signal recurring to the time-interleaved 4-to-2 MUX [12]. This work selects the duobinary-signal transmitter with a target data rate of over 30 Gb/s for demonstrating the effectiveness of our proposed MUXs. The key feature is a cross-quadrature clocking scheme.

#### A. Architecture

The transmitter architecture [Fig. 6(a)] features two branches for its CML DFF + latch array: with one latch + one DFF in each, performing 4-way data retiming. In the upper branch, one data path A is retimed to Ad by the latch clocked with IP, while the data path B is retimed to Bd by a DFF under the same clock. As a result, the data transition of Ad occurs at the rising edge of IP, together with Bd at the falling edge, leading to a phase difference of 2UI between the two paths. Similarly, data C and data D in the lower branch are retimed by QP to generate a phase difference of 2UI between them. Yet, considering the 1UI phase difference between IP and QP inherently, the retimed data paths Ad, Cd, Bd, Dd develops four different phases with 1UI phase separation. A 4-to-2 MUX controlled by QP and IN is then connected to serialize four f/4-(bit/s) data into two f/2-(bit/s) data in the upper and lower branches, respectively. By properly matching the four phase clocks of f/4-(Hz) to four-phase data, the data will only be valid during their 2<sup>nd</sup> and 3<sup>rd</sup> quarter phases. In [12], 2UI is placed before the validation period, and zero UI after it. Differently here, the timing margin is improved by 2UI (one before and one after the validation period). When summing up the two streams (E and F) in the current mode, time-interleaving is naturally formed due to the 1UI phase difference between E and F. As a result, the duobinary-signal with 1UI appears at SUM. The timing diagram exhibiting the maximum timing margin at every MUX operation as shown



Fig. 6. (a) Architecture of the proposed duobinary-signal transmitter. It features a 2-latch + 2-DFF array for 4-way data retiming, a quadrature clocking scheme and a shunt-peaking load. (b) Its timing diagram showing the maximum timing margin at every MUX stage.



Fig. 7. (a) Proposed two-stage S2D converter and (b) the equivalent model of the proposed BW-enhancement technique (L = 60 nm).

in Fig. 6(b). The load is with the passive shunt-peaking  $(L_1)$  technique to enhance the BW.

# B. S2D Converter

Since only a single-ended clock generator up to 18 GHz is available in our lab, the transmitter integrates a S2D converter to drive the differential CML DIV2 circuit as depicted in Fig. 6(a). Compared with the typical topology [13], our S2D converter [Fig. 7(a)] features a grounded active inductor (GAI) [27] with a negative transconductance of  $-g_{m,Mn}$  from  $M_n$  to inject the extra current to the output nodes ( $V_{o1p,n}$  and  $V_{o2p,n}$ ), enhancing the high-frequency bandpass response of a LC-tank in the simplified model [Fig. 7(b)]. The concept is similar to the loss compensation (i.e., a Q factor of ~2) of the LC-tank in a voltage-controlled oscillator. As there is another equivalent

$$\frac{\tau_{ID}e^{-T_{1\_NRZ}/\tau_{ID}} - \tau_{OD\_NRZ}e^{-T_{1\_NRZ}/\tau_{OD\_NRZ}}}{\tau_{ID} - \tau_{OD\_NRZ}} = 0.5$$
(13)  
$$\frac{\tau_{ID}e^{-T_{2\_NRZ}/\tau_{ID}}(1 - e^{-T_B/\tau_{ID}}) - \tau_{OD\_NRZ}e^{-T_{2\_NRZ}/\tau_{OD}}(1 - e^{-T_B/\tau_{OD\_NRZ}})}{\tau_{ID} - \tau_{OD\_NRZ}} = 0.5$$
(14)



Fig. 8. (a) Simulated gain responses at  $V_{o2p}$  and  $V_{o2n}$ ; (b) their gain mismatch under different Ibn; (c) single-ended average swing [i.e.,  $(V_{o2p} + V_{o2n})/2$ ] and (d) their swing mismatch [i.e.,  $2(V_{o2p} - V_{o2n})/(V_{o2p} + V_{o2n})$ ] versus frequency when the input single-ended swing is set at 100 mV<sub>pp</sub>.

impedance  $(Z_L)$  in the voltage-to-current path directly loads to the LC-tank with a loss  $R_p$ , careful sizing can ensure max  $\{g_{m,Mn}\} \times (R_p || \text{ Re } \{Z_L\}) < 1$  to prevent instability (also confirmed by simulations).

The proposed S2D converter enhances the high-frequency gain from 5.53 to 10.63 dB, and extends the  $f_{-3dB}$  BW from 27.7 to 32 GHz along with the bias current (I<sub>bn</sub>) of the GAI [Fig. 8(a)]. The gain mismatch is <1 dB from dc to 40 GHz [Fig. 8(b)]. The transient simulations show that the single-ended average swing [Fig. 8(c)] at V<sub>o2p,n</sub> continues to increase, and is doubled at 25 GHz between I<sub>bn</sub> = 0.1 to 1.5 mA. The single-ended swing mismatch [Fig. 8(d)] in the time domain is <10% up to 25 GHz. The swing mismatch can be further absorbed by the subsequent CML DIV2.

# C. Clock Buffer

The clock buffer [Fig. 9(a)] is critical when balancing the output swing and power consumption. The proposed clock buffer with the GAI [Fig. 9(b)] offers the peaking at expected clock frequency, due to the occurrence of one zero  $(z_1 =$ g<sub>m.Mn</sub>/C<sub>n</sub>) inside the GAI [see Fig. 7(b)], and two high-Q complex poles:  $\omega_0 = \sqrt{[g_{m,Mn}/(R_{L1}C_{p1}C_n)]}$  and Q =  $\sqrt{\{(g_{m,Mn}R_{L1}C_{p1}C_{n})/[g_{m,Mn}R_{L1}C_{p1} + C_{n}(1 - g_{m,Mn}R_{L1})]\}},$ where C<sub>p1</sub> denotes the parasitic load capacitance. Under the same dc gain and power budget  $(I_{b1} = I_{b2} + 2I_{bn})$ , the gain peaking at  $f_1$  rises firstly and then drops as increasing the current ratio ( $\alpha = 2I_{bn}/I_{b2}$ ) between the GAI and input stage, the optimum peaking locates at  $\alpha$  of 0.81 [Fig. 9(c)]. The simulated clock swing increases by  $\sim 1.3 \times$  at 9 GHz [Fig. 9(d)] when compared with the conventional clock buffer, Yet, the phase variation appears in association with peaking on magnitude, resulting in an input-output time delay ( $\Delta t_{pro}$ )



Fig. 9. Clock buffer (a) conventional; (b) proposed; (c) the optimal gain peaking against  $\alpha = 2I_{bn}/I_{b2}$ ; (d) swing enhancement and (e) time delay  $\Delta t$  versus frequency when the input single-ended swing is set at 280 mV<sub>pp</sub>.

as given by

$$\Delta t_{pro} = \frac{1}{\omega} \arctan \frac{Q\omega^3 + (z_1\omega_0 - Q\omega_0^2)\omega}{(\omega_0 - Qz_1)\omega^2 + Qz_1\omega_0^2}$$
(15)

The simulated  $\Delta t_{\text{pro}}$  of ~2.7 ps [Fig. 9(e)] is higher than that (0.15 ps) of the conventional clock buffer, but is acceptable here when comparing with our targeted clock period (111 ps). In addition, the output random jitter [28] related to the voltage noise inside the clock buffer can be calculated as  $\Delta t_{\text{pro}}^2 = \Delta v_n^2 [C_{\text{p1}}/(I_{\text{b2}} + I_{\text{bn}})]^2$ , in which  $\Delta v_n^2$  is the total output noise voltage. As the frequency varies from 6.5 to 10 GHz, the simulated and calculated time delays are consistent, considering the output conductance and parasitic capacitance in (15). Out of the total simulated  $\Delta t_{\text{pro}}^2$  of 8.1 fs, 70% is due to the clock buffer before applying the GAI.

# D. Simulation and Experimental Results

Under  $2^7 - 1$  pseudo-random binary sequence (PRBS) signal, extensive transistor-level simulations are essential to confirm the robustness of the transmitter. The 36-Gb/s differential output data eye [Fig. 10(a)] shows the 3-level modulation duobinary signaling, exhibiting the horizontal opening of 0.932/0.929UI, and vertical opening of 42.2%/ 41.7%, which is the ratio of each vertical opening (VO<sub>1</sub> and VO<sub>2</sub>) and total differential swing (V<sub>out,pp</sub>). These performances are attributed



Fig. 10. Simulated transient eyes of the proposed duobinary-signal transmitter in Fig. 4(a): (a) 36-Gb/s differential output data ( $V_{out} = V_{outp} - V_{outn}$ ) and (b) 9-GHz differential clocks.

to the merged MUX-and-SUM output stage that steers directly the 4-way retiming data streams into one final high-speed duobinary data. Since the corresponding differential 9-GHz clock passes through the single-to-differential clock path (i.e., the off-chip clock runs into a pathway from S2D, DIV2 to clock buffer), high-quality matching of the amplitude and phase between IP-IN and QP-QN [Fig. 10(b)] is achieved. Also, the swing-enhanced clocks allow effective switching of the IP/IN/QP/QN-controlling transistors in the MUX-and-SUM output stage, reducing the impacts of their turn-on resistance to the output data. Compared the vertical [Fig. 11(a)] and horizontal [Fig. 11(b)] openings with and without the GAI, here both VO<sub>1</sub>/VO<sub>2</sub> and HO<sub>1</sub>/HO<sub>2</sub> are improved by  $\sim 10\%$  and  $\sim 0.05$  UI up to 40 Gb/s, respectively. At 45 Gb/s, such improvements extend to 15% and 0.15UI, respectively. The simulated clock-to-Q ( $t_{cq}$ ) [27] of ~17 ps in each dataretiming path can be optimized for the serialized duobinary signal with the data rate of 36 Gb/s, where the data jitter is minimum.

The influence of process variations on the vertical [Fig. 12(a)] and horizontal [Fig. 12(b)] openings was accessed under a fixed power budget. The simulated VO<sub>1</sub>/VO<sub>2</sub> and HO<sub>1</sub>/HO<sub>2</sub> are minimally 40.8% and 0.89 UI, respectively. Comparing the results of the two extreme process corners, SS and FF, the horizontal opening improves from 0.9/0.89UI to 0.95/0.946UI.

The transmitter in 65-nm CMOS occupies a tiny area of 0.027 mm<sup>2</sup> (Fig. 13). Out of the 46.8-mW total power from a 1.2-V supply, 23% is consumed by the DFF + latch array, 20% by the MUXs, 16% by the DIV2, and 41% by the clock buffers at 36 Gb/s. The high-speed data and clock signals are injected through the RF probes and cables. We applied the S-parameter of the off-chip connection into the simulation environment. The f/4-(bit/s) PRBS testing signal of  $2^7 - 1$  was generated by a J-BERT 4903B. We used the MSOV334A oscilloscope to capture the measured eye diagram [Fig. 14(a)] and time waveform [Fig. 14(b)] at 12 Gb/s, in which the



Fig. 11. Simulated (a) vertical opening and (b) horizontal opening of the transmitter in Fig. 4(a) under different clock buffers: with and without the GAI.



Fig. 12. Simulated (a) vertical opening and (b) horizontal opening of the transmitter in Fig. 4(a) under different process corners: Fast-Fast (FF), Slow-Slow (SS), Typical-Typical (TT), Fast-NMOS-Slow-PMOS (FS) and Slow-NMOS-Fast-PMOS (SF).



Fig. 13. Die photo of the fabricated duobinary-signal transmitter.

slow edges are caused by the loss from the non-ideal onchip transmission lines and auxiliary connections. As shown in Fig. 15(a), the measured eye widths and heights at 36 Gb/s are 0.5/0.49 UI and 52/40 mV, respectively. They worsen when compared with those in Fig. 10(a), due to the absence of equalization in this prototype. The measured eye at 36 Gb/s is similar to the simulated one with the estimated loss included (but not the random jitter). The measured and simulated eyes



Fig. 14. The measured (a) eye diagram and (b) time waveform of the transmitter at 12 Gb/s.



Fig. 15. (a) The simulated (left) and measured (right) eye diagrams of the duobinary-signal transmitter at 36 Gb/s and (b) the measured time waveform and eye diagram of the clock at 9 GHz.



Fig. 16. The measured (a) horizontal and (b) vertical openings against data rates.

are dominated by the ISI jitter. For the measured eye, the random jitter is induced by the supply noise and phase noise determined by the RF signal equipment. The measured 9-GHz clock [Fig. 15(b)] gives the ~50% duty cycle due to the compact (i.e., clock metal line <75  $\mu$ m) and symmetric clock layout, while turning off the input data. The measured and simulated results for the horizontal [Fig. 16(a)] and vertical [Fig. 16(b)] openings are consistent under different data rates.

TABLE III Performance Summary and Comparison

|                                                                                  | This Work                 | ISSCC'14 [1]              | ISSCC'17 [4]              | ISSCC'13 [5]              |
|----------------------------------------------------------------------------------|---------------------------|---------------------------|---------------------------|---------------------------|
| Technology                                                                       | 65nm CMOS                 | 65nm CMOS                 | 65nm CMOS                 | 28nm CMOS                 |
| Transmitter Architecture                                                         | 4 : 1                     | 2:1                       | 4 : 1                     | 8:1                       |
| Driver Topology                                                                  | CML +<br>Shunt peaking    | CML +<br>Shunt Peaking    | CML                       | CML                       |
| Modulation                                                                       | Duobinary                 | PAM4                      | PAM4                      | Duobinary                 |
| Data Rate (Gb/s)                                                                 | 36                        | 60                        | 56                        | 32                        |
| Clock Type                                                                       | Quarter-Rate              | Half-Rate                 | Quarter-Rate              | Quarter-Rate              |
| Vertical Opening VO <sub>1</sub> / VO <sub>2</sub> / VO <sub>3</sub> (mV) $^4$   | 52 / 40                   | 50 / 50 / 50 <sup>1</sup> | 200 / 200 / 200 1         | 20 / 20 <sup>1</sup>      |
| Full Magnitude                                                                   | 366 mV <sub>pp,diff</sub> | 250 mV <sub>pp,sing</sub> | 600 mV <sub>pp,sing</sub> | 500 mV <sub>pp,diff</sub> |
| Horizontal Opening HO <sub>1</sub> / HO <sub>2</sub> / HO <sub>3</sub> (UI) $^4$ | 0.5 / 0.49                | 0.6 / 0.4 / 0.4 1         | 0.7 / 0.62 / 0.7 1        | 0.6 / 0.6 1               |
| V <sub>DD</sub> (V)                                                              | 1.2                       | 1.2                       | 1/1.5                     | 0.9                       |
| Power <sup>3</sup> (mW)                                                          | 46.8                      | 261                       | 200                       | 97.9                      |
| Die Size (mm <sup>2</sup> )                                                      | 0.037                     | 0.48 <sup>2</sup>         | 0.8 <sup>2</sup>          | 0.16                      |
| FOM (mW/Gb/s)                                                                    | 1.3                       | 4.35                      | 3.57                      | 3.1                       |

<sup>1</sup> Extracted values from plots <sup>2</sup> Die size is estimated form the corresponding reference.

<sup>3</sup> Power includes retimer, divider, clock buffer and main transmitter.

 $^4$  VO\_3 and HO\_3 denote the third vertical and horizontal openings for PAM4.

Benchmarking with the prior art in Table III, this work succeeds in improving the FOM and data eye opening by averting several power-hungry blocks, while operating the transmitter in the current domain to maximize the internal BW. It is possible to reduce the power consumption further by migrating this work to a more advanced process (e.g. 28-nm CMOS), resulting in lower power consumption in the clock path. Also, replacing the 2-latch + 2-DFF array with high-speed digital circuits and employing the power-efficient divider-by-2 [30] are prospective.

#### VI. CONCLUSIONS

This paper has proposed a power-efficient cross-quadratureclocking 2-to-1 MUX, and its varieties 4-to-2 and 4-to-1 MUXs. They feature simplified hardware without the need of delay buffers, while preserving a maximum timing margin in each multiplexing step. The MUXs are embedded in the design of a 36-Gb/s duobinary-signal transmitter, which is further developed with other power- and BW-efficient circuit techniques (e.g., single-stage MUX-and-SUM operation) to improve the performance. Fabricated in 65-nm CMOS, the transmitter achieves a FOM of 1.3 mW/Gb/s at 36 Gb/s. The proposed 2-to-1 MUX is also applicable for the PAM4-signal transmitter to improve the power efficiency.

#### ACKNOWLEDGEMENT

The authors would like to thank Z. Yang, H. Guo, L. Kong, and C. Li.

#### REFERENCES

 P.-C. Chiang, H.-W. Hung, H.-Y. Chu, G.-S. Chen, and J. Lee, "60 Gb/s NRZ and PAM4 transmitters for 400 GbE in 65 nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2014, pp. 42–43.

- [2] A. Nazemi et al., "A 36 Gb/s PAM4 transmitter using an 8 b 18 GS/S DAC in 28 nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC)* Dig. Tech. Papers, Feb. 2015, pp. 58–59.
- [3] J. Kim et al., "A 16-to-40 Gb/s quarter-rate NRZ/PAM4 dual-mode transmitter in 14 nm CMOS," in *IEEE Int. Solid-State Circuits Conf.* (*ISSCC*) Dig. Tech. Papers, Feb. 2015, pp. 60–61.
- [4] P. J. Peng, J. F. Li, L. Y. Chen, and J. Lee, "A 56Gb/s PAM-4/NRZ transceiver in 40nm CMOS," in *IEEE Int. Solid-State Circuits Conf.* (*ISSCC*) Dig. Tech. Papers, Feb. 2017, pp. 110–111.
- [5] G. Miao, P. Ju, D. Ng, J. Khoury, and K. Lakshmikumar, "A fullyintegrated 10.5 to 13.5 Gbps transceiver in 0.13 μm CMOS," in *Proc. IEEE CICC*, Sep. 2003, pp. 595–598.
- [6] R. Tao and M. Berroth, "Monolithically integrated 5 Gb/s CMOS duobinary transmitter for optical communication systems," in *Proc. IEEE RFIC*, Jun. 2004, pp. 21–24.
- [7] J. H. Sinsky, M. Duelk, and A. Adamiecki, "High-speed electrical backplane transmission using duobinary signaling," *IEEE Trans. Microw. Theory Techn.*, vol. 59, no. 1, pp. 152–160, Jan. 2005.
- [8] K. Yamaguchi et al., "12 Gb/s duobinary signaling with ×2 oversampled edge equalization," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2005, pp. 70–71.
- [9] J. H. Sinsky, A. Konczykowska, A. Adamiecki, F. Jorge, and M. Duelk, "39.4 Gb/s data transmission over 24.4 meters of coaxial cable using duobinary signaling," in *IEEE MTT-S Int. Microw. Symp. Dig.*, Atlanta, GA, USA, Jun. 2008, pp. 197–200.
- [10] Y. Ban et al., "A wide-band, 5-tap transversal filter with improved testability for equalization up to 84 Gb/s," *IEEE Microw. Wireless Compon. Lett.*, vol. 25, no. 11, pp. 739–741, Nov. 2015.
- [11] Y.-M. Ying, I.-T. Lee, and S.-I. Liu, "A 20 Gb/s adaptive duobinary transceiver," in *Proc. IEEE Asian Solid State Circuits Conf. (A-SSCC)*, Nov. 2012, pp. 129–132.
- [12] Y. Ogata et al., "32 Gb/s 28 nm CMOS time-interleaved transmitter compatible with NRZ receiver with DFE," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2013, pp. 40–41.
- [13] J. Lee, M.-S. Chen, and H.-D. Wang, "Design and comparison of three 20-Gb/s backplane transceivers for duobinary, PAM4, and NRZ data," *IEEE J. Solid-State Circuits*, vol. 43, no. 9, pp. 2120–2133, Sep. 2008.
- [14] B. Min, K. Lee, and S. Palermo, "A 20 Gb/s triple-mode (PAM-2, PAM-4, and duobinary) transmitter," *Microelectron. J.*, vol. 43, no. 10, pp. 687–696, Oct. 2012.
- [15] J. Lee *et al.*, "Design of 56 Gb/s NRZ and PAM4 SerDes transceivers in CMOS technologies," *IEEE J. Solid-State Circuits*, vol. 50, no. 9, pp. 2061–2073, Sep. 2015.
- [16] J. Cao *et al.*, "OC-192 transmitter and receiver in standard 0.18-μm CMOS," *IEEE J. Solid-State Circuits*, vol. 37, no. 12, pp. 1768–1780, Dec. 2002.
- [17] A. Yazdi and M. M. Green, "A 40-Gb/s full-rate 2:1 MUX in 0.18-μm CMOS," *IEEE Trans. Microw. Theory Techn.*, vol. 59, no. 11, pp. 2879–2887, Nov. 2011.
- [18] H. Tao *et al.*, "40–43-Gb/s OC-768 16:1 MUX/CMU chipset with SFI-5 compliance," *IEEE J. Solid-State Circuits*, vol. 38, no. 12, pp. 2169–2180, Dec. 2003.
- [19] H. Lu, C. Su, and C. N. J. Liu, "A tree-topology multiplexer for multiphase clock system," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 56, no. 1, pp. 124–131, Jan. 2009.
- [20] T. Suzuki *et al.*, "A 50-Gbit/s 450-mW full-rate 4:1 multiplexer with multiphase clock architecture in 0.13-μm InP HEMT technology," *IEEE J. Solid-State Circuits*, vol. 42, no. 3, pp. 637–646, Mar. 2007.
- [21] M. Meghelli, A. V. Rylyakov, and L. Shan, "50-Gb/s SiGe BiCMOS 4:1 multiplexer and 1:4 demultiplexer for serial communication systems," *IEEE J. Solid-State Circuits*, vol. 37, no. 12, pp. 1790–1794, Dec. 2002.
- [22] T. Masuda *et al.*, "SiGe-HBT-based 54-Gb/s 4:1 multiplexer IC with full-rate clock for serial communication systems," *IEEE J. Solid-State Circuits*, vol. 40, no. 3, pp. 791–795, Mar. 2005.
- [23] T. Suzuki et al., "Under 0.5 W 50 Gb/s full-rate 4:1 MUX and 1:4 DEMUX in 0.13 μm InP HEMT technology," in *IEEE Int. Solid-State* Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2004, pp. 234–235.
- [24] B. Analui, J. F. Buckwalter, and A. Hajimiri, "Data-dependent jitter in serial communications," *IEEE Trans. Microw. Theory Techn.*, vol. 53, no. 11, pp. 3388–3397, Nov. 2005.
- [25] Y. Chen, P.-I. Mak, H. Yu, C. C. Boon, and R. Martins, "An areaefficient and tunable bandwidth-extension technique for a wideband CMOS amplifier handling 50+ Gb/s signaling," *IEEE Trans. Microw. Theory Techn.*, vol. 65, no. 12, pp. 4960–4975, Dec. 2017.
- [26] C. Cole, *Ideal SNR Penalties*, document IEEE 802.3 400 Gb/s Ethernet Task Force, Sep. 2014.

- [27] Y. Chen, P.-I. Mak, C. C. Boon, and R. P. Martins, "A 27-Gb/s timeinterleaved duobinary transmitter achieving 1.44-mW/Gb/s FOM in 65-nm CMOS," *IEEE Microw. Wireless Compon. Lett.*, vol. 27, no. 9, pp. 839–841, Sep. 2017.
- [28] T. C. Weigandt, "Low-phase-noise, low-timing-jitter design techniques for delay cell based VCOs and frequency synthesizers," Ph.D. dissertation, EECS Dept., UC Berkeley, Berkeley, CA, USA, 1998.
- [29] A. A. Hafez, M.-S. Chen, and C.-K. Yang, "A 32–48 Gb/s serializing transmitter using multiphase serialization in 65 nm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 50, no. 3, pp. 763–775, Mar. 2015.
- [30] J. W. Jung and B. Razavi, "A 25-Gb/s 5-mW CMOS CDR/deserializer," IEEE J. Solid-State Circuits, vol. 48, no. 3, pp. 684–697, Mar. 2013.



**Yong Chen** (S'10–M'11) received the B.Eng. degree in electronic and information engineering from the Communication University of China, Beijing, China, in 2005, and the Ph.D. degree in engineering degree in microelectronics and solid-state electronics from the Institute of Microelectronics of Chinese Academy of Sciences, Beijing, in 2010.

From 2010 to 2013, he was a Post-Doctoral Researcher with the Institute of Microelectronics, Tsinghua University, Beijing, China. From 2013 to 2016, he was a Research Fellow, where he was

responsible for high-speed (over 40 Gb/s) wire line communication and Low Energy Electronic Systems Project under the Singapore-MIT Alliance for Research and Technology on RF CMOS transceiver in VIRTUS/EEE, Nanyang Technological University, Singapore. He has been an Assistant Professor with the State Key Laboratory of Analog and Mixed-Signal VLSI, University of Macau, Macao, China, since 2016.

His research interests include analog/biomedical detection and RF integrated circuit, mm-wave system and circuit, and high-speed on-chip and chip-to-chip electrical/optical interconnects.



**Pui-In Mak** (S'00–M'08–SM'11) received the Ph.D. degree from the University of Macau (UM), Macao, China, in 2006. He is currently a Full Professor with the Faculty of Science and Technology-ECE, UM, and an Associate Director (Research) with the State Key Laboratory of Analog and Mixed-Signal VLSI, UM. His research interests include analog and radio-frequency (RF) circuits and systems for wireless, and multidisciplinary innovations. His involvements with the IEEE include an Editorial Board Member of the IEEE Press from 2014

to 2016 and a member of the Board-of-Governors of the IEEE Circuits and Systems Society from 2009 to 2011. He is/was the TPC Vice Co-Chair of ASP-DAC in 2016, a TPC Member of A-SSCC from 2013 to 2016, ESSCIRC since 2016, and ISSCC since 2016. He is/was a Distinguished Lecturer of the IEEE Circuits and Systems Society from 2014 to 2015 and the IEEE Solid-State Circuits Society from 2017 to 2018. He is the Senior Editor of the IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS from 2014 to 2015, the Guest Editor of the IEEE RFIC VIRTUAL JOURNAL in 2014, and the IEEE JOURNAL OF SOLID-STATE CIRCUITS and SYSTEMS I from 2010 to 2011 and from 2014 to 2015, and the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II from 2010 to 2013.

Prof. Mak received the Honorary Title of Value for Scientific Merits by the Macau Government in 2005. He co-received the DAC/ISSCC Student Paper Award in 2005, the CASS Outstanding Young Author Award in 2010, the National Scientific and Technological Progress Award in 2011, the Best Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II from 2012 to 2013, the A-SSCC Distinguished Design Award in 2015, and the ISSCC Silkroad Award in 2016.



Chirn Chye Boon (M'09–SM'10) received the B.E. (Hons.) and Ph.D. degrees in electrical engineering from Nanyang Technological University (NTU), Singapore, in 2000 and 2004, respectively.

He was with Advanced RFIC, NTU, where he was a Senior Engineer. Since 2005, he has been with NTU, where he is currently an Associate Professor. Since 2010, he has been the Program Director of the RF and mm-wave research in the \$\$50 Million Research Center Of Excellence, VIRTUS, NTU. He is the Principal Investigator for Industry/Government

Research Grants of \$\$8,646,178.22. He has authored over 100 refereed publications in the fields of RF and mm-wave. He has authored the book: *Design of CMOS RF Integrated Circuits and Systems* (2010). He is involved in radio frequency & mm-wave circuits and systems design for biomedical and communications applications. He has conceptualized, designed, and siliconverified 80 circuits/chips for biomedical and communication applications.

Dr. Boon serves as a committee member for various conferences. He was a recipient of the year-2 Teaching Excellence Award and the Commendation Award for Excellent Teaching Performance from the School of Electrical and Electronic Engineering, NTU. He is an Associate Editor of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS and a Golden Reviewer for the IEEE ELECTRON DEVICES LETTERS.



**Rui P. Martins** (M'88–SM'99–F'08) was born in 1957. He received the bachelor's (five years), master's, and Ph.D. degrees, and the Habilitation degree for Full Professor in electrical engineering and computers from the Department of Electrical and Computer Engineering, Instituto Superior Técnico (IST), TU of Lisbon, Portugal, in 1980, 1985, 1992, and 2001, respectively. He has been with the Department of Electrical and Computer Engineering, IST, TU of Lisbon, since 1980.

Since 1992, he has been on leave from IST, TU of Lisbon (now University of Lisbon since 2013), and also with the Department of Electrical and Computer Engineering, Faculty of Science and Technology (FST), University of Macau (UM), Macau, China, where he has been a Chair Professor since 2013. In FST, he was the Dean of the Faculty from 1994 to 1997, where he has been a Vice Rector with the University of Macau since 1997. In 2008, after the reform of the UM Charter, he was nominated after open international recruitment, and reappointed in 2013 as a Vice Rector (Research) until 2018. Within the scope of his teaching and research activities, he has taught 21 bachelor and master courses in UM, has supervised (or co-supervised) 40 theses: 19 Ph.D. theses and 21 master's theses. He has co-authored six books and nine book chapters: 18 patents: 16 USA patents and two Taiwan patents; 377 papers: 111 in scientific journals and 266 in conference proceedings; and 60 other academic works, in a total of 470 publications. He was a Co-Founder of Chipidea Microelectronics, Macau (now Synopsys) in 2001/2002. In 2003, he created the Analog and Mixed-Signal VLSI Research Laboratory, UM, and was elevated to the State Key Laboratory of China in 2011 (the first in Engineering in Macao), being its Founding Director.

Dr. Martins was a member of the IEEE Circuits and Systems Society (CASS) Fellow Evaluation Committee in 2013 and 2014, respectively, a Circuits and Systems (CAS) Society Representative in the Nominating Committee, for the election in 2014, the Division I (CASS/EDS/SSCS), and the Director of the IEEE. He was a Nominations Committee Member in 2016. He was the Founding Chairman of the IEEE Macau Section from 2003 to 2005 and the IEEE Macau Joint-Chapter on CAS/Communications from 2005 to 2008, the World Chapter of the Year of the IEEE CAS Society in 2009. He was the General Chair of the 2008 IEEE Asia-Pacific Conference on CAS APCCAS 2008 and the Vice President for Region 10 (Asia, Australia, and the Pacific) of the IEEE CASS from 2009 to 2011. Since 2011, he has been the Vice President of (World) Regional Activities and Membership of the IEEE CASS (2012 to 2013). He was an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS from 2010 to 2013, and was nominated the Best Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II from 2012 to 2013. He was a recipient of two government decorations: the Medal of Professional Merit from the Macao Government (Portuguese Administration) in 1999 and the Honorary Title of Value from the Macao SAR Government (Chinese Administration) in 2001. He was the General Chair of the ACM/IEEE Asia South Pacific Design Automation Conference in 2016. He is currently the Chair of the IEEE Fellow Evaluation Committee (class of 2018), both of the IEEE CASS. In 2010, he was elected, unanimously, as a Corresponding Member of the Portuguese Academy of Sciences, Lisbon, being the only Portuguese Academician living in Asia.