# Multilevel Half-Rate Phase Detector for Clock and Data Recovery Circuits

Cecilia Gimeno<sup>()</sup>, Member, IEEE, David Bol<sup>()</sup>, Member, IEEE, and Denis Flandre<sup>()</sup>, Senior Member, IEEE

Abstract—In this brief, a half-rate (HR) bang-bang (BB) phase detector (PD) with multiple decision levels is proposed for clock and data recovery (CDR) circuits. The combination allows the oscillator to run at half the input data rate while providing information about the sign and magnitude of the phase shift between the PD inputs. This allows a finer control of the frequency of the oscillator in the phase-locked loop (PLL) of the CDR circuit, which results in up to 30% less output clock jitter than with a conventional two-levels HR BB PD. Thanks to this, the bit error rate can be decreased by up to  $5 \times$  in a 5-Gb/s CDR circuit. The proposed topology was implemented in a 28-nm FDSOI CMOS technology providing average power consumption below 76  $\mu$ W with a supply voltage of 1 V. Although multilevel (ML) BB PDs have already been proposed in some PLL-based CDR with very interesting results, a specific design of the PD has to be implemented for an HR system. This brief provides the first ML-HR-BBPD.

*Index Terms*—Clock and data recovery (CDR) circuits, halfrate (HR) phase detector (PD), multilevel (ML).

#### I. INTRODUCTION

Clock and data recovery (CDR) is a key function in many serial communication systems, from optical to electrical communications, but especially for high-speed signaling [1].

The performance of the clock recovery is crucial for the reliability of the communication system, especially important to perform synchronous operations such as the retiming and demodulation of the input data. Jitter in the clock, defined as the uncertainty in the edge placement in the clock waveform, results in distortion of the data signals waveforms [2]. This jitter translates in oscillator phase deviation from ideal, which results in phase noise.

Although other systems such as delay-locked loops or phase interpolator-based CDR are used in some cases, phase locked loops (PLLs) are the most widespread systems to implement a reference-less CDR. Fig. 1 shows the general block diagram of a PLLbased CDR. It is composed of a voltage-controlled oscillator (VCO), which generates the required clock, a phase detector (PD), which compares the phase of the generated clock to that of the randomized input data, and a charge pump (CP), which charges or discharges a loop filter (LF) to generate the required control signal for the VCO.

The PD is one of the critical blocks of the CDR as it determines the phase error between the input data and the clock, which conditions the control voltage for the VCO, and therefore the correct agreement between the clock and data edges.

Although a linear PD is sometimes used in [3], a binary or bangbang (BB) PD is usually preferred in high-speed CDRs due to its simplicity, good phase adjustment, high-speed operation, and low power. The BBPD provides a binary output, which gives information about the sign of the phase shift between its inputs, i.e., if the clock is lagging or leading the input data [4].

Manuscript received November 29, 2017; revised February 27, 2018; accepted April 6, 2018. This work was supported by F.R.S.-F.N.R.S. of Belgium. (*Corresponding author: Cecilia Gimeno.*)

The authors are with the ICTEAM Institute, Université catholique de Louvain, 1348 Louvain–La-Neuve, Belgium (e-mail: cecilia.gimenogasca@ uclouvain.be).

Color versions of one or more of the figures in this brief are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2018.2826440

 $\begin{array}{c|c} \mathsf{DiN} & \mathsf{CP} \\ \hline [5 \text{ Gb/s}] & \mathsf{PD} & \mathsf{Late} \\ \hline \mathsf{Farly} & \mathsf{Gb/s} \\ \hline \mathsf{Farly} & \mathsf{Gb/s} \\ \hline \mathsf{Farly} & \mathsf{Gb/s} \\ \hline \mathsf{G}_2=100p, \\ \mathsf{G}_2=100p, \\ \mathsf{G}_2=10p, \\ \mathsf{G}_2=1k\Omega \\ \end{array}$ 

Fig. 1. Block diagram of a PLL-based CDR showing the parameters used in our Verilog-AMS model.

The Alexander PD [5] or variations of it, such as the inverse Alexander PD where the outputs (Early and Late) are inverted (Late and Early) [6], are the most commonly used PD in highspeed designs. Other topologies have been presented in [7] but their complexity is increased. All these Alexander-based PDs work at a full-rate clock frequency; which means that the frequency of oscillation of the VCO is the same as the data rate of the input data.

At high speed, a half-rate PD (HR-PD) is very useful to reduce the requirements of the VCO and increase the throughput of the system [1], [8], [9]. CDRs implemented with an HR-PD sense the input data at full rate but use a VCO running at half the input rate. This technique also relaxes the speed requirement of the PD.

In this brief, we propose a new multilevel HR BB PD (ML-HR-BBPD). Thanks to the ML operation that provides information about the sign and the magnitude of the phase difference between the PD inputs, the bit error rate (BER) performance of the output data as well as the jitter of the clock generated with a PLL-based CDR is improved compared to the conventional two-levels HR-PD. Although ML-BBPD have been already proposed in some PLL-based CDR with very interesting results [10], to our best knowledge, they have never been proposed for an HR system.

The main objective of this brief is, therefore, to provide an ML alternative to the conventional HR-PD and perform a comparison of the two topologies. For that the two PDs have been included in a PLL-based CDR system that is used as a testbench for comparison. The brief is organized as follows. Section II reviews the conventional topology of single-level HR PDs. In Section III, the proposed ML HR PD is presented followed by the details of subblocks in Section IV. Section V provides the main performances results of the proposed detector in a 5 Gb/s HR CDR circuit in 28-nm FDSOI and compares them to the performance of the conventional detector.

#### II. CONVENTIONAL HALF-RATE PHASE DETECTOR

Fig. 2 shows the block diagram of the conventional BB-HR-PD [8]. It is based on the Alexander PD but using three samples of the incoming data. The rising edges of  $\Phi_0$  and  $\Phi_{180}$  sample the incoming data to generate edge samples *E*0 and *E*1 and rising edge of  $\Phi_{90}$  is used to generate data sample *D*0. Combinational logic (two XOR gates, one inverter, and two AND gates) allows to generate the Early (E) and Late (L) decisions. Note that an additional synchronization signal is required ( $\Phi_{SYN}$ ) with a phase between  $\Phi_0$  and  $\Phi_{90}$  to ensure the correct E/L decision by properly sampling data and edges.

1063-8210 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS



Fig. 2. Block diagram of the conventional HR-BBPD [9].



Fig. 3. Operation principle of the conventional HR-BBPD [9].

Fig. 3 shows the samples taken when the clock lags or leads the data. In the locked state, the samples taken by  $\Phi_0$  and  $\Phi_{180}$ correspond to the transition of the data output, while the sample taken by  $\Phi_{90}$  occurs at the middle of the data. If the samples taken by  $\Phi_0$ and  $\Phi_{180}$  are different but the samples taken by  $\Phi_{180}$  and  $\Phi_{90}$  are equal (transition between  $\Phi_0$  and  $\Phi_{90}$ ), the clock is early and the clock frequency must be decreased. Vice versa, if both the sample taken by  $\Phi_0$  and  $\Phi_{180}$  and the samples taken by  $\Phi_{90}$  and  $\Phi_{180}$  are different (transition between  $\Phi_{90}$  and  $\Phi_{180}$ ), the clock is late and the clock frequency must be increased. If there is no transition between  $\Phi_0$  and  $\Phi_{180}$ , that is, no data transition, both Early and Late are equal to 0 and no action is taken. This is summarized as

Early: 
$$E0 \oplus E1 = 1$$
,  $E1 \oplus D0 = 0 \rightarrow \text{Clk frequency} \downarrow$  (1)

Late: 
$$E0 \oplus E1 = 1$$
,  $E1 \oplus D0 = 1 \rightarrow \text{Clk frequency} \uparrow$  (2)

Others: 
$$E0 \oplus E1 = 0, \rightarrow Clk$$
 not adjusted. (3)

## III. PROPOSED MULTILEVEL HALF-RATE PHASE DETECTOR

As a tradeoff between pure linear and pure BB HR-PD, we propose an ML-HR-BBPD whose schematic is given in Fig. 4. The digital nature of the BB HR-PD is not altered but we have further levels of quantization to measure the phase difference. This results in a reduction of jitter because of the finer corrections to the VCO frequency when the system has locked in phase.

As shown in Fig. 4, the proposed phase detection scheme uses more samples of the data: apart from the edge samples E0 and E1(generated by the rising edges of  $\Phi_0$  and  $\Phi_{180}$ ) and data sample D0 (by  $\Phi_{90}$ ) like in the standard topology, additional mid-samples M0 and M1 (generated by  $\Phi_{45}$  and  $\Phi_{135}$ ) are provided. A modified combinational logic, less sensitive to different delays of the blocks, allows generating different Early (1 and 2) and Late (1 and 2) decisions explained in (4)–(8).

In this case (see Fig. 5), if the samples taken by  $\Phi_{45}$  and  $\Phi_{90}$  are equal but different from  $\Phi_0$ , the clock is early (Early1) and the clock



Fig. 4. Schematic of the proposed ML-HR-BBPD.



Fig. 5. Operation principle of the proposed ML-HR-BBPD. Relation between the clock phases and the input data for different operation conditions (left). Digital output as a function of the phase shift between the clock and the data (right).

frequency must be decreased. Vice versa, if the samples taken by  $\Phi_{90}$ and  $\Phi_{135}$  are equal but different from  $\Phi_{180}$ , the clock is late (Late1) and the clock frequency must be increased. But now, if the samples taken by  $\Phi_0$ , and  $\Phi_{45}$  are equal but different from  $\Phi_{90}$ , the clock is early and far from the lock condition (Early2) and the clock frequency must be decreased more. Vice versa, if the samples taken by  $\Phi_0$ ,  $\Phi_{45}$ , and  $\Phi_{90}$  are equal but different from  $\Phi_{135}$  and  $\Phi_{180}$ , the clock is late (Late2) and far from the lock condition and the clock frequency must be increased more. This is summarized as

| Early1: $E0 \oplus E1 = 1$ , $E0 \oplus D0 = 1 \rightarrow \text{Clk frequency} \downarrow$            | (4) |
|--------------------------------------------------------------------------------------------------------|-----|
| Early2: $M0 \oplus M1 = 1$ , $M0 \oplus D0 = 1 \rightarrow \text{Clk frequency} \downarrow \downarrow$ | (5) |
| Late1: $E0 \oplus E1 = 1$ , $E1 \oplus D0 = 1 \rightarrow Clk$ frequency $\uparrow$                    | (6) |
| Late2: $M0 \oplus M1 = 1$ , $M1 \oplus D0 = 1 \rightarrow \text{Clk frequency} \uparrow \uparrow$      | (7) |
| Others: $\rightarrow$ Clk not adjusted.                                                                | (8) |

The operation of the proposed phase detection scheme comprises four levels of quantization (Fig. 5): positive high phase shift, positive low phase shift, negative low phase shift, and negative high phase shift. In this way, it provides a finer control of the frequency of the VCO in the CDR circuit, which results in a reduced jitter compared



Fig. 6. Schematic design of the DFF.



Fig. 7. (a) Top-level and (b) schematic design of the XOR gates.

to purely binary phase detection schemes as it will be shown in Section V.

# **IV. SCHEMATIC DESIGN**

To implement the proposed ML-HR-PD three different blocks are needed: DFF, XOR gates, and AND gates.

The schematic design of the DFF is shown in Fig. 6. It has been implemented with a basic true-single-phase-clock edge-triggered CMOS DFF, which provides the flip-flop operation at high-speed with low power consumption (typically 64.7  $\mu$ W at 10-GHz clock).

Full-static CMOS logic is used to implement both XOR and AND gates, in preference to source-coupled logic (SCL) or dynamic logic. Although SCL logic can provide an advantage in speed, full-static CMOS gates are fast enough to achieve few Gb/s in sub-90-nm CMOS technologies, and present a rail-to-rail output swing and no static power consumption. Although dynamic CMOS logic can provide an increased speed and in some cases reduced implementation area, in modern digital technology (after the  $0.35-\mu$ m era), full-static CMOS is the dominating logic, especially considering power consumption, noise, and clock skew problems.

The AND gate is formed by a standard NAND gate plus an inverter so that only when both A and B are equal to a logic 1, both  $P_1$  and  $P_2$ are OFF and both  $N_0$  and  $N_1$  are ON, generating a Z equal to logic 0 and, therefore, an output equal to logic 1.

To implement the XOR gates a topology based on transmission gates have been used (Fig. 7). As shown in Fig. 7(a), the top circuit implements an XOR function with two CMOS transmission gates and four inverters. Therefore

$$OUT(XOR) = \overline{Z} = \overline{\overline{A} \oplus B} = A \oplus B.$$
(9)



Fig. 8. CP necessary for (a) conventional PD and (b) proposed ML-HR-PD.

Fig. 7(b) shows the schematic implementation. Transistors  $P_4$ ,  $P_5$ ,  $N_4$ , and  $N_5$  forms the transmission gates and  $P_1-P_3$ ,  $N_1-N_3$  constitute the inverters of the XOR operation.  $P_i$  and  $N_i$  form the last inverter.

# V. SIMULATION TESTBENCH: PLL-BASED CDR

To study the performance of the proposed PD and compare it to the conventional HR one, both have been included in a PLL-based CDR like the one shown in Fig. 1. The two PDs as well as the CPs have been implemented in a 28-nm FDSOI CMOS technology. The other blocks that constitute the CDR are implemented in Verilog-AMS with behavioral models including the resulting nonidealities on the input signals of the PD, i.e., jitter and duty cycle of both the input data and the clocks.

The two PDs have been simulated in a CDR with the same loop parameters. Used values are given in Fig. 1. Random jitter with a normal distribution has been included in the input data. A pseudorandom bit sequence is generated by using a random number generator that returns a 32-bit signed integer.

A second-order low-pass filter has been used as an LF, as shown in Fig. 1.

A VCO with a typical 0.5 GHz/V gain generates the required clock signal. As described before, in the proposed ML-HR-PD topology, four phases of the clock are needed  $(0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ})$  plus the negated one (180°), while in the conventional topology two phases (the clock 0° and its quadrature replica 90°) and their respective opposite (180°) plus an extra delay phase are needed. Although coupled LC-VCOs could be used to generate the required phases, a fourstage differential ring oscillator is preferred in applications where phase noise is not critical, as it presents lower power consumption and area as well as direct generation of the required clocks phases. The same number of stages is required for both PD topologies so no extra difficulty or power is added for the proposed topology. State of art VCOs allow us to have estimated power consumption as low as 180  $\mu$ W working at frequencies higher than 5 GHz [11]. When included in a closed-loop system, the jitter in the clock is reduced up to 1–2 ps [12], [13].

Fig. 8 shows the implementation of the CP for the conventional PD [Fig. 8(a)] and for the proposed topology [Fig. 8(b)]. We see in the case of an ML-HR-PD, an extra current branch is used so that when there is a small delay only one branch is on, while when the delay is larger, both branches are on at the same time generating in this way different levels of quantization. Although an extra branch is added, the speed of the switches in the new topology is not increased so no extra difficulty is neither added in this block. To achieve a fair comparison between the two PDs, the transistor implementation of the CPs has been included in the simulations.

#### VI. PERFORMANCE

As previously mentioned, the jitter in the state of art clocks implemented in a CDR will be of the order of 1-2 ps which



Fig. 9. Cycle-to-cycle jitter of output clock versus input data rms jitter in UI units [UI].



Fig. 10. BER performance versus input data rms jitter.

comes from the noise of the delay stages itself. However, the input data might be affected by many nonidealities due to for example the transmission channel. These effects will result in more important nonidealities coming from the input data than from the clock itself.

Fig. 9 shows the rms cycle-to-cycle jitter of the generated clock as a function of the Gaussian jitter of the random input data. We can see that even when there is no jitter in the input data, the jitter in the output clock is 30% better in the proposed PD (ML-HR-PD) than in the standard HR-PD. This is because of the fine tuning allowed by the intermediate level in the phase detection. Although not shown in Fig. 9, the same results are obtained for the period jitter with an improvement of up to 31%.

Fig. 10 shows the BER performance of a CDR implemented with both the conventional and the ML HR PDs as a function of the input jitter. In the case of low or no input jitter, we expect that the BER for both systems are equal. This is because, in absence of nonidealities in the input data, the 30% higher jitter generated by the conventional HR-PD does not prevent to sample the data in the middle of the bit. In fact, we obtain no error in the simulation of  $10^7$  input bits, which results in a BER lower than  $10^{-7}$ . Because of computation limits in time and memory, a simulation longer than  $10^7$  bits was not possible with our Verilog-AMS simulation framework. But as we obtain no error in the transmission of  $10^7$  bits we can consider it as an error free transmission. However, when jitter is included in the input data, the higher jitter of the clock leads to higher BER. Our ML-HR-PD then achieves up to 5 times better BER than the conventional HR-PD.

Regarding the duty cycle of the input data, we made an analysis where the input signal has an alternate pattern which means that there are no two consecutives logic 1 or logic 0 at the input signal. Different duty cycles from 0.7 to 1.3 unit interval (UI) have been studied. With the proposed ML HR PD, we obtain a constant 30% improvement in the jitter of the recovered clock versus the conventional HR PD.

Even if we have different delays in the phases of the clocks, the operation of the proposed ML-HR-PD is also more robust providing 4 times less BER in the recovery of the data when a random delay



Fig. 11. Waveforms of the DFF, XOR, and AND gates.

with normal distribution up to 20-ps rms is included in each of the clock phases.

The proposed ML-HR-PD has also been implemented in a 28-nm FDSOI CMOS technology using the cells shown in Section IV. Using a 5-Gb/s rate input data and an HR clock working at 2.5 GHz, the whole ML-HR-PD requires an average power consumption lower than 76  $\mu$ W and a peak power of 1.2 mA with a 1-V power supply. The standard HR-PD has average power consumption around 65  $\mu$ W and a peak power of 0.9 mA under the same conditions. Therefore, although the proposed topology requires higher power consumption, this is not significant for the overall power consumption of the CDR that is on the order of some millivolts [9]. In fact we estimate that, thanks to the fact that an HR VCO is used, the VCO power is reduced by approximately 2.5 times while keeping the same phase noise and area.

Fig. 11 shows the response of the DFF, AND, and XOR gates with regards to their inputs. We can see that in all the cases the blocks correctly perform their function even working with 5-Gb/s signals. However, there is a delay between the inputs and outputs of different blocks that comes from their real implementation with several stages. These delays are 34 ps for the DFF, 21 ps for the XOR gates and 10 ps for the AND gate. Although not shown here, the delay of an inverter is of the order of 5 ps in 28-nm FDSOI. Therefore, in the conventional HR-PD, different delays are generated between different inputs of the AND gates (see Fig. 2) that we avoid thanks to a more symmetrical topology.

Corner simulations have been also performed to validate the reliability of the solution versus the most extreme variations that can be met in practice due to the process and the proposed ML-HR-PD works correctly in all the cases.

# VII. CONCLUSION

In this brief, an ML-HR-BBPD is proposed. It offers a nice tradeoff between pure linear and pure BB HR-PD, providing multiple levels of quantization to measure the phase difference and tune the VCO in a PLL implementation.

We conclude that the ML-HR-PD retains all the advantages of the HR-BBPD at the cost of a slightly higher complexity while it reduces the jitter of the generated clock by up to 30% thanks to the finer control of the VCO. This jitter reduction allows reducing the BER up to 5 times when the input signal at 5 Gb/s is affected by jitter. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

#### REFERENCES

- B. Razavi, "Challenges in the design high-speed clock and data recovery circuits," *IEEE Commun. Mag.*, vol. 40, no. 8, pp. 94–101, Aug. 2002.
- [2] B. Casper, "Clocking wireline systems: An overview of wireline design techniques," *IEEE Solid State Circuits Mag.*, vol. 7, no. 4, pp. 32–41, Sep. 2015.
- [3] C. H. Son and S. Byun, "On frequency detection capability of full-rate linear and binary phase detectors," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 64, no. 7, pp. 757–761, Jul. 2017.
- [4] J. Lee, K. S. Kundert, and B. Razavi, "Analysis and modeling of bangbang clock and data recovery circuits," *IEEE J. Solid-State Circuits*, vol. 39, no. 9, pp. 1571–1580, Sep. 2004.
- [5] J. D. H. Alexander, "Clock recovery from random binary signals," *Electron. Lett.*, vol. 11, no. 22, pp. 541–542, 1975.
- [6] M. Verbeke, P. Rombouts, X. Yin, and G. Torfs, "Inverse Alexander phase detector," *Electron. Lett.*, vol. 52, no. 23, pp. 1908–1910, 2016.
- [7] D. Rennie and M. Sachdev, "A novel tri-state binary phase detector," in *Proc. IEEE Int. Symp. Circuits Syst.*, New Orleans, LA, USA, May 2007, pp. 185–188.

- [8] M. Ramezani, C. Andre, and T. Salama, "Analysis of a half-rate bangbang phase-locked-loop," *IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.*, vol. 49, no. 7, pp. 505–509, Jul. 2002.
- [9] G. Shu et al., "A reference-less clock and data recovery circuit using phase-rotating phase-locked loop," *IEEE J. Solid-State Circuits*, vol. 49, no. 4, pp. 1036–1047, Apr. 2014.
- [10] C. Sanchez-Azqueta, C. Gimeno, C. Aldea, and S. Celma, "New multilevel bang-bang phase detector," *IEEE Trans. Instrum. Meas.*, vol. 62, no. 12, pp. 3384–3386, Dec. 2013.
- [11] G. de Streel *et al.*, "SleepTalker: A ULV 802.15.4a IR-UWB transmitter SoC in 28-nm FDSOI achieving 14 pJ/b at 27 Mb/s with channel selection based on adaptive FBB and digitally programmable pulse shaping," *IEEE J. Solid-State Circuits*, vol. 52, no. 4, pp. 1163–1177, Apr. 2017.
- [12] M. K. Raja, D. L. Yan, and A. B. Ajjikuttira, "A 1.4-psec jitter 2.5-Gb/s CDR with wide acquisition range in 0.18-μm CMOS," in *Proc. ESSCIRC*, Sep. 2007, pp. 524–527.
- [13] D. H. Baek, B. Kim, H.-J. Park, and J.-Y. Sim, "A 5.67 mW 9 Gb/s DLL-based reference-less CDR with pattern-dependent clock-embedded signaling for intra-panel interface," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2014, pp. 48–49.