Engineer the Channel and Adapt to it: Enabling Wireless Intra-Chip Communication

Xavier Timoneda, Sergi Abadal, Antonio Franques, Dionysios Manessis, Jin Zhou, Josep Torrellas, Eduard Alarcón, Albert Cabellos-Aparicio

Abstract—Ubiquitous multicore processors nowadays rely on an integrated packet-switched network for cores to exchange and share data. The performance of these intra-chip networks is a key determinant of the processor speed and, at high core counts, becomes an important bottleneck due to scalability issues. To address this, several works propose the use of mm-wave wireless interconnects for intra-chip communication and demonstrate that, thanks to their low-latency broadcast and system-level flexibility, this new paradigm could break the scalability barriers of current multicore architectures. However, these same works assume 10+ Gb/s speeds and efficiencies close to 1 pJ/bit without a proper understanding on the wireless intra-chip channel. This paper first demonstrates that such assumptions do not hold in the context of commercial chips by evaluating losses and dispersion in them. Then, we leverage the system’s monolithic nature to engineer the channel, this is, to optimize its frequency response by carefully choosing the chip package dimensions. Finally, we exploit the static nature of the channel to adapt to it, pushing efficiency-speed limits with simple tweaks at the physical layer. Our methods reduce the path loss and delay spread of a simulated commercial chip by 47 dB and 7.3 ×, respectively, enabling intra-chip wireless communications over 10 Gb/s and only 3.1 dB away from the dispersion-free case.

Index Terms—Dispersive channels, Millimeter wave propagation, Multipath Interference, Multiprocessor interconnection, Transceivers.

I. INTRODUCTION

Multicore processors are present in virtually every computing domain nowadays. They integrate a number of processor cores within the same chip and, in the past few years, manufacturers have been consistently increasing the core count seeking higher execution speeds. However, in order to translate this potential into effective performance, the on-chip communication problem must be solved: cores need an integrated interconnect to exchange or share data and, for densely populated chips, traditional interconnects are burdensome and slow down the processor. Communication, not computation, thus becomes the main performance bottleneck in multicore systems.[1]

In the past, most chips did not contain more than a handful of cores and on-chip communication was easily performed through a bus. Since buses do not scale well with the number of cores, a completely different approach was soon required. The adopted solution, called Network-on-Chip (NoC), consists of a packet-switched network of routers that are co-integrated with the cores as represented in Figure 1. Since then, NoCs have been widely applied not only in research works [2]–[5], but also in commercial chips such as Tilera’s TILE-GX [6] or Intel’s Xeon Phi [7]. Nevertheless, with the arrival of extreme scaling and massive multicore architectures, standard NoCs start to show performance and efficiency issues [8]. New paradigms are thus required in the manycore era.

The scalability problems of NoCs are mainly the network diameter and overprovisioning. As further elaborated in Sec. II, these cause the communication latency and power to increase, especially for chip-wide transactions. Therefore, any new candidate to improve existing NoCs should address them and, among a few alternatives [9], Wireless Network-on-Chip (WNoC) shows great promise in this regard. In short, WNoC basically consists in overlaying a set of wireless intra-chip links over a backbone wired NoC. This reduces the latency of chip-wide transfers, including broadcasts, by virtue of the omnidirectional speed-of-light propagation of radio waves, and also combats overprovisioning thanks to its global reconfigurability [10]. As shown in the literature, these unique features become key enablers of new multicore architectures capable of pushing current scalability limits [11]–[13].

The WNoC paradigm builds on the foundations of widespread millimeter-wave (mm-wave) technology. A wide variety of on-chip antennas is already available [14]–[16] and wireless intra-chip communication with such antennas has been experimentally confirmed in multiple works [17]–[19]. Additionally, 60/90 GHz integrated transceivers specifically designed for WNoC have been tested [20]–[23]. On top of this, a great variety of works have evaluated new topologies and routing protocols [24]–[29] in an attempt to exploit the potential of WNoC at the network level.

The main caveat of the majority of WNoC research is that it lays on incorrect channel models. Many works [18], [30]–[37] either neglect the influence of the chip package, which introduce losses and dispersion, or directly neglect...
dispersion whatsoever. This does not invalidate the potential of the WNoC paradigm, but leads to erroneous assumptions on the achievable speed and power. For instance, many WNoC architectures assume bandwidths well over 10 GHz [12], [27], [28], [38], [39], which may not be achievable due to multipath effects. Other works obtain power consumption estimates by assuming path losses between below 30 dB [40]–[43]. In the present study, we show that these assumptions are false for standard chip packages.

This paper aims to fill this gap and restate the potential of WNoC by proposing, as the main contribution, a novel co-design methodology that (i) properly characterizes the wireless intra-chip channel, and (ii) identifies and exploits its uniquenesses. It can be summarized in three pillars:

- **Channel characterization:** we study the propagation within a realistic computing package, which has been often overlooked. Frequency and time domain analyses are performed to extract attenuation and dispersion scaling trends. With this, we prove that the assumptions made in most WNoC works may not hold true, and that path loss and delay spread often follow contradicting trends.

- **Channel engineering:** the intra-chip channel is unique in that it can be engineered. Therefore, we propose an optimization methodology that explores the package design space to jointly minimize attenuation (path loss) and dispersion (delay spread). We illustrate the methodology by applying it in a particular chip package design and reduce the path loss and delay spread by 30 dB and 3.52× together, respectively, or by 47 dB and 7.32× in separated extreme cases.

- **Static transceiver optimization:** the intra-chip channel is also unique in that it is quasi-deterministic. Based on this, we propose to combat dispersion by predicting the multipath effects and adapting the transceiver backend to them. We easily accommodate 10 Gb/s and reach beyond the coherence bandwidth limit, figures that would be unattainable with conventional coding.

Although the static and monolithic nature of the WNoC scenario were already discussed in [24], [44], this is the first work that, to the best of the authors’ knowledge, systematically exploits the unique traits of the wireless intra-chip channel. The proposed methodology could potentially lead to the conditions to operate at 10–20 Gb/s with 1–2 pJ/bit, figures that are widely assumed in the literature but that would be otherwise unattainable. It is worth noting that very few other wireless communication scenarios, if any, allow to engineer the channel to enhance propagation.

The remainder of this paper is organized as follows. Sec. II provides some background. Sec. III details the proposed methodology, which is then evaluated in Sec. IV. Finally, Sec. V discusses the results and Sec. VI concludes the paper.

II. BACKGROUND

**Network-on-Chip:** NoCs generally implement a 2-D mesh topology wherein every router is connected to a core and to its four neighbors (Fig. 2). The choice is driven by the regularity of the topology and the short path lengths, which simplifies the routers and the links. Topologies requiring long links are in fact discouraged as their energy and delay scale exponentially with length and technology [45]. Short links, however, come at the cost of a network diameter that scales as 2(\( k – 1 \)) in a \( k \times k \) mesh. Thus, 64-core chips, which are commercially available [6], [7], have a network diameter of 14 hops with a chip-wide latency of several tens of nanoseconds without contention. This delay would be incurred by transmissions among far-apart cores or, even worse, broadcasts that would also increase contention as they flood the mesh. Alternatively, carefully designed WNoCs can reduce this delay to a few nanoseconds regardless of the location of data and number of destinations. This difference in performance is crucial because communications are often on the critical path of the program and any added delay can slow down execution [11].

**Wireless Network-on-Chip:** WNoC broadly refers to the implementation of wireless intra-chip links on top of a wired NoC. A packet arriving to a wireless interface is serialized, modulated and radiated by the antenna with a given pattern as we show in Figure 2. Radio waves propagate through the package at nearly the speed of light until reaching the intended destinations, also located within the same package, where they are demodulated and deserialized. Since intermediate router hops are avoided, WNoC reduces the latency of long-range and broadcast communications by an order of magnitude. On the downside, wireless bandwidth is limited and needs to be shared among the cores.

The physical layer of WNoC adapts to chip resource constraints. The use of mm-wave bands allows antennas to be commensurate with cores, whereas simple modulations such as On-Off Keying (OOK) are adopted to avoid bulky or power-hungry components at the transceiver. With such low order modulations, high symbol rates are needed to reach the 10+ Gb/s speeds expected for WNoC. This, together with the stringent Bit Error Rate (BER) requirements of the scenario (10^{-15} to be comparable to that of a wire), makes signals particularly vulnerable to Inter-Symbol Interference (ISI). Fortunately, multipath effects can be mitigated through package–transceiver co-design, as we propose in this work.
At the MAC and network layers, WNoCs are constrained by the resource limitations and latency requirements of the scenario. Directive antennas are prohibitive, with few exceptions, rendering spatial multiplexing impractical. Thus, MAC protocols generally rely on some variant of low-latency collision-avoiding token passing to share one or a few frequency channels, thus leaving ISI as the main source of interference\cite{25}, \cite{29}. Moreover, by design, wireless intra-chip networks are generally one-hop in an attempt satisfy the strong latency demands of multiprocessors\cite{12}, \cite{26}, \cite{38}, \cite{39}. As a result, to support the WNoC functionalities, a routing algorithm only needs add the logic to decide when a packet should enter the wireless plane, i.e., for broadcast or long-range transmissions\cite{24}. The simplicity at the network layer makes the wireless option more scalable than other emerging technologies, as briefly discussed in Section V.

**Chip Structure and Antenna Placement:** The typical cross-section of a standard chip consists of a metal stack with 5--10 layers, separated by an insulator and placed over a lossy silicon substrate\cite{14}. Chips are then generally covered by a package that provides mechanical support and facilitates its interfacing with the rest of components. Flip-chip packages, wherein the chip is flipped over and connected to the PCB board through solder bumps, are currently widespread and preferred over wired bonding. As shown in Figure 2 the chip ends up surrounded by (i) a metallic heat sink contacted by a heat spreader and (ii) the package carrier, with several metal layers on top of the PCB.

The flip-chip package does not leave much space for the antennas. Due to the presence of solder bumps, antennas cannot be implemented in the first metal layer anymore\cite{46}. Alternatively, designers have to use the metal layers closer to the silicon or, as proposed recently, drill Through-Silicon Via (TSV) to implement vertical monopoles\cite{47}, \cite{48}. Due to the very stringent area constraints of the scenario, directional antennas and MIMO arrays are generally prohibitive, with few exceptions\cite{49}.

**Chip-scale Channel Characterization:** At the chip scale, most channel characterization works have been based on full-wave simulation due to manufacturing costs and the complexity of probing in highly integrated packages\cite{34}, \cite{35}, \cite{50}. In open packages, however, experimental works have been more common and have shown a reasonable agreement between measurement and simulation\cite{34}, \cite{46}, \cite{51}. Several of those works described two propagation aspects worth considering.

First, the low resistivity silicon used to facilitate transistor operation introduces significant losses and, therefore, shall be avoided\cite{14}. Second, materials used as heat spreader like Aluminum Nitride (AIN) introduce low electrical losses and, thus, would enhance propagation\cite{31}. This opens interesting perspectives to the manufacturer, which can now take chip design decisions based on the potential for wireless intra-chip communication.

Being enclosed in a metallic package, electromagnetic propagation is confined within the limits of the package. Such field confinement has positive implications on security as eavesdropping or jamming are physically avoided, but also leads to strong multipath effects. This has been formulated by Matolak et al. through micro-reverberation theory\cite{44}, yet without detailing the package structure. In fact, very few studies include the chip package in their simulations or measurements and, those that do it, are limited to low frequencies or lack proper justifications on the antenna type and placement\cite{34}, \cite{46}, \cite{51}. Others simply assume free space over the insulator layer\cite{18}, \cite{32}--\cite{35}.

To find analogous results, we need to refer to works at the data center cabinet scale\cite{52}, or at the motherboard scale in desktops or laptops\cite{19}, \cite{53}, \cite{54}, which have structural resemblances. However, the results are not directly applicable to the chip scale due to substantial differences in dimensions, materials, and antenna placement restrictions.

Remind that, without proper understanding of the wireless channel within the package, the impact of the wireless chip-scale paradigm cannot be really assessed. In the next section, we propose a methodology to bridge this gap.

### III. System Design

Our methodology provides a way to systematically co-design the chip package and the transceiver exploiting the static and monolithic nature of the system. This way, the methodology (i) validates the WNoC concept, (ii) increases the achievable data rate, and (iii) reduces the power consumed by the transceiver circuitry. Here, we first overview our proposal and then detail its design.

#### A. System Overview

The wireless intra-chip channel is largely unknown and prevents architects from assessing the true potential of WNoC. The proposed methodology, summarized in Figure 3, solves the problem in three steps.
First, a comprehensive characterization of the wireless channel within a chip package is performed. Through modeling and full-wave solving, we obtain the response of the wireless channel as a function of the position of the transmitting and receiving antennas within a 4×4 grid, as well as two parameters that chip makers can modify at design time: the frequency band and the dimensions of the package. As further elaborated in Section III-B the results are processed to evaluate path loss and dispersion over the transmission distance.

The next step in the methodology is referred to as channel engineering and is uniquely suited to this monolithic system. Its main goal is to find the combination of package dimensions and frequency band that jointly minimizes path loss and dispersion. To this end, we define a figure of merit that takes both aspects into account with adjustable weights, allowing manufacturers to model the importance of power and performance in the system. This figure of merit drives an optimizer that, thanks to heuristics derived from the previous characterization process, navigates through the package design tradeoffs efficiently. The exploration is possible thanks to the use of full-wave electromagnetic simulations, which avoid the need for building multiple expensive test vehicles. More details on the methodology are given in Section III-C.

Once we have found the best package and frequency band for our purposes, we optimize the transceiver by leveraging the static nature of the channel. As shown in Section III-D, simple but effective modifications are carried out at both sides of the transmitters and receivers within the chip. To characterize the channel effects. Unless noted, we consider a homogeneous distribution of 4×4 antennas within a 20×20 mm² chip and a central frequency of 60 GHz. The minimum distance among antennas at 60 GHz is 3.57λ, where the λ is the wavelength within silicon. This distance and the high loss of silicon guarantee that there is no near-field coupling among neighboring antennas.

Frequency Domain Analysis. The full-wave solver uses the Finite Elements Method (FEM) to obtain the field distribution, the antenna gain, and the coupling between antennas in the frequency domain. Then, the channel frequency response $H_{ij}(f)$ is evaluated for each antenna pair as

$$G_i G_j |H_{ij}(f)|^2 = \frac{|S_{ji}(f)|^2}{(1 - |S_{ii}(f)|^2) \cdot (1 - |S_{jj}(f)|^2)},$$

(1)

where $G_i$ and $G_j$ are the transmitter and receiver antenna gains, $S_{ji}$ is the coupling between transmitter $i$ and receiver $j$, whereas $S_{ii}$ and $S_{jj}$ are the reflection coefficients at both ends [57]. Once the whole matrix of frequency responses $\mathbf{H}$ is obtained, a path loss analysis can be performed by fitting the attenuation $L$ over distance $d$ to

$$L = 10 \log_{10}(d/d_0) + L_0,$$

(2)

where $L_0$ is the path loss at the reference distance $d_0$ and $n$ is the path loss exponent [18]. The path loss exponent is around 2 in free space, below 2 in guided or enclosed structures, and above 2 in lossy environments. Since losses at the channel are crucial to determine the power consumption at the transceiver (see Section V) we will report improvements in terms of worst-case $L_{\text{max}}$, average $L_{\text{avg}}$, and path loss exponent $n$.

Time Domain Analysis. In the time domain, we define an input excitation $x_i(t)$ at the input of the transmitting antenna $i$. Then, CST employs the Finite-Difference Time-Domain (FDTD) method to calculate the output signal $y_j(t)$ at the receiving antenna $j$. Hence, the impulse response $h_{ij}(t)$ between transmitter $i$ and receiver $j$ can be derived with the classical formulation

$$y_j(t) = x_i(t) * h_{ij}(t),$$

(3)

where $*$ denotes the convolution operator. Once calculated, it is straightforward to evaluate the Power Delay Profile (PDP) in the channel between transmitter $i$ and receiver $j$ as

$$P_{ij}(\tau) = |h_{ij}(t, \tau)|^2,$$

(4)

therefore obtaining a matrix of PDP functions $\mathbf{P}$ for all transmitters and receivers within the chip. To characterize the multipath richness of the channel, we obtain the delay spread $\tau_{\text{rms}}$ using the PDP of each channel as

$$\tau_{\text{rms}}^{(i,j)} = \sqrt{\frac{\int (\tau - \tau_{\text{rms}}^{(i,j)})^2 P_{ij}(\tau) d\tau}{\int P_{ij}(\tau) d\tau}},$$

(5)
where $\bar{\tau}_{ij} = \int \tau P_{ij}(\tau) d\tau P_{ij}(\tau) d\tau$ is the mean delay of the channel.

In this work we will assume that all wireless channels are broadcast and, therefore, they should be operated at the lowest speed ensuring correct decoding at all nodes. As a result, we will take the worst delay spread across all pairs of transmitters-receivers (i.e., across all distances) as limiting case and use it to evaluate the coherence bandwidth $B_c$, as follows

$$
\tau_{rms} = \max_{i,j \neq i} \tau_{rms}^{(i,j)} \Rightarrow B_c \propto \frac{1}{\tau_{rms}}. \tag{6}
$$

For simplicity, we will take $B_c = \frac{1}{\tau_{rms}}$.

### C. Channel Engineering

Our methodology takes path loss and delay spread as two metrics to be optimized. Since both aspects are dependent on multiple inputs, the channel engineering can be formally treated as a Multi-Objective Optimization (MOO) problem. These problems can be solved using algorithms amenable to MOO, such as evolutionary algorithms [58].

Another way to tackle the channel engineering is by reducing it to a single-objective problem using weights. In particular, our methodology defines a single custom figure of merit $\phi_w$ that we will attempt to maximize. Since the aim is to mitigate the path loss and the delay spread, the figure of merit takes the form

$$
\phi_w = \frac{1}{P_Lw DS(1-w)} \tag{7}
$$

where $PL$ is the path loss metric, $DS$ is the delay spread metric, and $w \in [0, 1]$ models the importance of power or speed in different designs. In other words, $w$ is fixed by the architect: small values will be used in high performance devices where speed needs to be optimized over power, whereas large values imply minimization of the path loss oriented to low-power embedded systems. In this paper, our metrics are $PL = L_{avg}$ and $DS = \tau_{rms}$, with $L_{avg}$ defined as the average path loss across all distances and $\tau_{rms}$ as defined in Equation [6]. Moreover, we normalize both metrics so that they have the same dynamic range between 0 and 1.

The package engineering process as defined in this work considers three variables that can be modified at design time: the silicon thickness $T_s$, the heat spreader thickness $T_h$, and the carrier frequency $f_c$. Then, the objective is to maximize the figure of merit

$$
\max_{T_s,T_h,f_c} \phi_w \tag{8}
$$

this is, to find the $T_s$, $T_h$, and $f_c$ values that maximize the figure of merit for a given $w$ and within the bounds given by the manufacturer or the architect [18]. We conservatively assume $T_s \in [0.1, 0.7]$ mm and $T_h \in [0, 0.8]$ mm, which are ranges easily achievable with current silicon thinning and packaging techniques for 3D ICs [59].

To solve the optimization problem, it is first worth noting that the full-wave simulations required to obtain $\phi_w$ for each $\{T_s, T_h, f_c\}$ combination are very computationally intensive, especially as $f_c$ increases, which renders exhaustive searching impractical. Also, path loss and dispersion are related to $\{T_s, T_h, f_c\}$ in non-monotonic ways and often showing opposed trends. This creates local peaks in the $\phi_w$ function, thus discarding methods such as the gradient-based hill climbing, which tends to get stuck into local maxima.

Among the pool of optimization techniques, one alternative amenable to this problem would be Simulated Annealing (SA), which uses a probabilistic method to avoid local peaks and progressively approach a global optimum. Although SA can be modified to solve MOOs [53], we treat our problem as a single-objective optimization and use conventional SA variants. Since SA has been used in other electromagnetic problems [60], [61] and is widely known, we will not detail its implementation for the sake of brevity. We just note that the results of the channel characterization described in this work can help deriving the appropriate heuristics (e.g., candidate generation, cooling schedule) for SA to converge fast to the global optima.

### D. Static Transceiver Optimization

Once the channel is engineered to minimize path loss and delay spread, we leverage the static nature of the channel to perform simple yet effective optimizations in the RF backend. The idea is to push the symbol rates while resorting to the known, deterministic channel response to keep complexity at a minimum.

Figure 4 shows the block diagram of a typical wireless intra-chip link. As pointed out in Section II, OOK modulation is generally considered. Assuming a bit-energy of $E_b = P_{rx}/r_b$, where $P_{rx}$ is the received power and $r_b$ is the symbol rate, the BER of OOK is lower bounded by

$$
BER_{OOK} \leq \frac{1}{2} erfc \left( \sqrt{\frac{E_b}{4N_0}} \right) \tag{9}
$$

where $erfc$ is the complementary error function and $E_b/N_0$ is the signal-to-noise ratio. This bound assumes coherent detection with optimal threshold calculation and no ISI. In our case, however, ISI manifests when pushing the data rate beyond the Nyquist rate. To mitigate its effects, we propose two techniques: threshold adaptation and RZ modulation.

**Threshold adaptation**: The main issue in conventional wireless environments is that multipath effects are space- and time-dependent. Therefore, its impact on the Euclidean distance between the OOK symbols and on the optimal decision threshold cannot be predicted. In the worst case, ISI is modeled as added noise, reducing the noise margin and leading to an approximate BER of

$$
BER_{OOK}^{isi} \approx \frac{1}{2} erfc \left( \sqrt{\frac{E_b}{3(N_0 + I)}} \right) > BER_{OOK} \tag{10}
$$

where $I$ is the interference energy.
In WNoC, the channel is time-invariant and we can calculate the exact position of each symbol at all times. This means that we can find the Euclidean distance between symbols and the optimal decision threshold for any combination of previous symbols even in the presence of ISI. This information can be used to design a receiver composed by \( K \) parallel deciders, each with its own threshold, and a register that selects the appropriate leg. Assuming that with \( K \) deciders we address all ISI effects, we can approximate the BER as

\[
BER_{\text{OOK}} \approx \frac{1}{K} \sum_{k=1}^{K} \frac{1}{2} \text{erfc} \left( \sqrt{\frac{\alpha_k E_b}{4N_0}} \right) \quad (11)
\]

where \( \alpha_k \) models the effect of a given past symbol combination to the Euclidean distance between current symbols. The number of required deciders scales as \( K \sim \tau_{\text{rms}}/T_b \) where \( T_b = 1/r_b \) is the symbol period assuming a binary modulation. In any case, the associated overheads are small compared to the cost of the RF front-end.

**Return to zero:** a classical way to mitigate ISI effects is by using RZ techniques, which reduce the length of the symbol through duty cycling. One the one hand, this shortens the length of the current symbol as seen by the receiver, which implies lower spillover into the next symbols. On the other hand, the lower ISI comes at the cost of a drop in the received energy, which may offset the gains of reduced ISI if RZ is not designed properly. However, since the channel is time-invariant, we can infer the duty cycle that maximizes the signal-to-interference ratio and, thus, minimizes the BER for any symbol combination. In Equation (11), this would be equivalent to increasing \( \alpha_k \) for all \( k \).

### IV. Evaluation

The three pillars of the proposed methodology are evaluated separately. Section IV-A discusses channel scaling trends, Section IV-B shows the gains of the channel engineering process, and Section IV-C illustrates the transceiver improvements.

#### A. Channel Characterization

Here, we quantify the impact of the silicon thickness \( T_s \), the heat spreader thickness \( T_h \) and the central frequency \( f_c \) on the path loss and delay spread. Unless noted, we assume a homogeneous distribution of 4×4 antennas and take \( f_c = 60 \) GHz and the dimensions of a standard chip (\( T_s = 0.7 \) mm and \( T_h = 0.2 \) mm) as default values. We obtain the path loss and delay spread for all antenna pairs and perform a linear regression to obtain the dependence with distance.

Figure 5 shows the scaling trends with respect to the silicon thickness. This layer is highly lossy, as mentioned in Sec. II and we observe that the benefits of thinning it down are significant. A 100-\( \mu \)m chip has a maximum path loss of \( L_{\text{max}} = 36.29 \) dB and a maximum delay spread of \( \tau_{\text{rms}} = 0.19 \) ns. Compared to a standard chip, the thinned alternative is \( 2.1 \times \) better in terms of path loss (39 dB difference) and \( 2.73 \times \) better in terms of worst-case delay spread (0.35 ns difference). Additionally, the path loss exponent is reduced from \( n = 4.32 \) to \( n = 1.32 \), confirming the transition from a lossy environment (\( n > 2 \)) to a guided medium (\( n < 2 \)). The performance also scales better in terms of delay spread, reducing the slope from 25.05 to 5.83 ps/mm.

Figure 6 repeats the analysis by varying the heat spreader thickness \( T_h \). Given its low electrical losses, this layer can aid propagation and its inclusion is thereby highly recommended. The delay spread improves up to \( 3 \times \) (from 0.6 to 0.2 ns) due to the presence of a stronger reflection cluster coming from the heat spreader. As for the path loss, the case here presented shows a limited impact in terms of path loss (~10 dB improvement in average) because most of the energy is dissipated in the 0.7-mm silicon layer before reaching the heat spreader. Although not shown due to space constraints, the effect of AIN on path loss is much more evident for thinned down silicon as the exponent drops from \( n = 4.01 \) (no AIN) to close to 1.1 (0.8 mm). In that case, the delay spread also oscillates between 0.2 and 0.6 ns, sometimes contradicting the path loss tendency.

Finally, Figure 7 presents the results of the frequency scaling analysis, which we limit to the 60–120 GHz span due to computational constraints. Additionally, we fix the silicon and heat spreader thicknesses to small and large values, respectively, following the design recommendations justified above. We chose this particular (\( T_s = 0.3 \) mm and \( T_h = 0.8 \) mm) because it is close to an optimal point with respect to dispersion. We find that \( f_c = 110 \) GHz leads to a minimum in terms of delay spread, although the improvement is limited with respect to the other frequencies. The impact on path loss, on the other hand, is substantial yet counter-intuitive at times as the average path loss drops first oscillates around 40–50 dB when shifting the frequency between 60 GHz and 90 GHz, to then increase substantially towards 90 dB at 120 GHz.
B. Engineering the Channel

Here, we show the potential of channel engineering through a partial exploration of the \{T_s, T_h, f\} design space. Our aim is not to fully implement the optimizer, but rather to validate the potential of the approach by confirming both the complex interactions between inputs and the presence of local optima, as well as by giving good approximations of the path loss and delay spread improvements that we can expect.

We first plot the figure of merit \(\phi_w\) as function of each exploration parameter while leaving the others fixed. The results, summarized in Figure 8, confirm the main lessons learned in Section IV-A: thin silicon is generally preferable (left plot), it is hard to obtain clear tendencies with respect to the heat spreader (middle plot), and performance may plateau close to local optima (right plot). The choice of \(w\) also plays an important role in the optimization and Figure 8 also confirms it. Since path loss and delay spread often show opposed trends, the shape of \(\phi\) changes in unexpected ways and causes wild variations in the optimal design points. Take, for instance,
the frequency scaling trend. The optimal point is clearly at 110 GHz for \( w = 0 \), but that peak dilutes progressively and disappears around \( w = 0.6 \). At that point, the optimal frequency becomes 60 GHz or 80 GHz due to the better path loss behavior.

In order to estimate the maximum gains that we can achieve through channel engineering, we further explored the design space in the quest for points close to a hypothetical global optima. We chose three representative values of \( w \) and compared the results with those of a standard chip (\( T_s = 0.7 \text{ mm}, T_h = 0.2 \text{ mm}, f_c = 60 \text{ GHz} \)). Figure 8 and Table I illustrate the outcome of this process. There \( L_{max} \) and \( L_{avg} \) refer to the maximum and average path loss across all measured transmitter-receiver pairs within the 4×4 homogeneous grid of antennas.

We first set \( w = 0 \) to simulate the extreme of high performance, thereby pushing the limits on the delay spread. The peak has been found around \( \{ T_s = 0.3 \text{ mm}, T_h = 0.8 \text{ mm}, f_c = 110 \text{ GHz} \} \) and yields a worst-case delay spread of \( \tau_{rms} = 71.32 \text{ ps} \) for a coherence bandwidth of \( B_c = 14.02 \text{ GHz} \). This is roughly one order of magnitude better than the standard chip case (0.52 ns for 1.92 GHz) and confirms that the speeds assumed in the WNoC literature are feasible. In terms of path loss, this design point is also 10–15 dB better than the standard.

A second representative case would be \( w = 1 \), which pushes the limits on the path loss. The peak has been found by thinning the silicon down to our lower limit and using a thick spreader: \( \{ T_s = 0.1 \text{ mm}, T_h = 0.8 \text{ mm}, f_c = 60 \text{ GHz} \} \). This case achieves an outstanding path loss reduction of 47.07 dB for \( L_{max} \) and 32.69 dB for \( L_{avg} \) (\( n = 1.32 \)). Further, this confirms that the path loss figures assumed in the literature, around 25–35 dB, are indeed achievable even in the presence of a chip package. However, the delay spread is maintained at the levels of the standard chip in this case.

Finally, let \( w = 0.5 \) to model a channel engineering process searching a balance between power and performance. In this case, a local peak has been found around the point \( \{ T_s = 0.1 \text{ mm}, T_h = 0.38 \text{ mm}, f_c = 70 \text{ GHz} \} \). With respect to the standard chip, this design allows to improve the coherence bandwidth \( B_c \) by 3.52× and the average path loss \( L_{avg} \) by over 1.5×. Although this may not be a global optimum, it illustrates the potential of the methodology.

### C. Static Transceiver Optimization

Since we are interested in pushing the limits of performance, this section evaluates the transceiver improvements in the package engineered for high performance. Thus, we take the worst-case transient response of the \( \{ T_s = 0.3 \text{ mm}, T_h = 0.8 \text{ mm}, f_c = 110 \text{ GHz} \} \) design point with a delay spread of \( \tau_{rms} = 71.32 \text{ ps} \). In all the studied cases, OOK-modulated waveforms are convoluted with the transient response at the channel and fed to the receiver, which determines the hypothetical position of the next ‘0’ or ‘1’ symbol. The BER is calculated assuming independent and equiprobable symbols.

#### Threshold adaptation

We simulate our proposed receiver with different number of decision thresholds \( K \). We first obtain the threshold values by looking at the previous \( \log_2(K) \) symbols and then use conventional \( \text{erfc} \) formulation to derive the error probability. Figure 10(a) plots the resulting BER for a fixed \( r_b \) of 10 Gb/s, assumed in numerous WNoC works, and as a function of \( E_b / N_0 \). Although we are below the coherence bandwidth, ISI effects disable the use of \textit{a priori} thresholds based on steady state measurements alone. The performance for \( K = 4 \) is far from ideal, but starts to improve significantly. At \( K = 8 \), the receiver performs close to a coherent receiver in an ISI-free environment. In fact, it only needs to be 24.1 dB above the noise floor to achieve the stringent BER required for WNoC (10⁻¹⁵). This is only 3.1 dB over the ideal case.

To further evaluate the potential of the proposed scheme, we fix the received power and push the data rate way beyond the coherence bandwidth. The results, shown in Figure 10(b), reveal that the receiver by default stops working upon reaching the ISI wall at around 5 Gb/s. With as few as \( K = 2 \) thresholds, our proposed scheme improves the achievable data rate between 20% and 40%. Again, increasing the number of decision thresholds allows to further mitigate ISI (the bitrate increases from 7.32 up to 10.56 Gb/s at \( BER = 10^{-9} \)), to the point of becoming indispensable as we keep pushing the data rate. These results illustrate the tradeoff between performance and receiver complexity, although the overhead of our proposed scheme is arguably small.

**Return-to-zero**: One of the conclusions that can be extracted from Figures 10(a) and 10(b) is that we can minimize ISI, but we cannot get rid of it completely. The adaptive threshold moves along with the average received energy, but
cannot eliminate the case where the ‘0’ and ‘1’ symbols move closer. This is precisely the case targeted by RZ. To evaluate it, we assume a receiver with $K = 8$ and set the $E_b/N_0$ for all transmission speeds. The results, plotted in Figure 10(c), demonstrate that there is indeed a duty cycle value for all transmission speeds. The results, plotted in Figure 10, reveal that RZ brings our scheme 1.2 dB closer to the ideal receiver for $BER = 10^{-15}$.

V. DISCUSSION

Impact on transmission speed. The channel engineering process, by means of substantial delay spread cuts, increases the ISI-free speed by an order of magnitude with respect to a standard chip. Further, the transceiver optimizations have demonstrated that (i) achieving a $BER$ of $10^{-15}$ at 10 GHz is affordable, and that (ii) it would be otherwise impossible. This thereby proves that our methodology enables the speeds generally assumed in the WNoC literature.

Impact on power consumption. By reducing the path loss by up to 47 dB, we achieve attenuation levels close to those assumed in recent transceiver proposals (26.5 dB in [40, 41] and 26 dB calculated with data in [42, 43]). Meeting such assumptions would lead to bit energies of 1.95 pJ/bit for [40, 41] or 0.54 pJ/bit for [42, 43], along the lines of what is assumed in the WNoC literature. On top of that, our transceiver only needs an extra 3.1 dB of SNR to compensate for the ISI effects at 10 Gb/s and $BER = 10^{-15}$.

To make an explicit connection between channel losses and efficiency, we note that power amplifiers are the most consuming components of current transceivers, e.g., 70.8% in [42, 43]. Compensating for extra losses, noise figures, or circuit limitations would make these figures to increase even further. In fact, each amplifier has a limit $P_{sat}$ on the output power it can provide. Going beyond that limit would require a re-design of the amplifier and, according to long-time experimentally validated scaling tendencies, the extra effort is generally paid with a reduction of the amplifier efficiency in 2.5% per each extra dBm of $P_{sat}$ [62]. On this same direction, it is worth noting that increasing the frequency may impact only not on the channel, but also on the efficiency of the transceiver. Although there is no technological impediment preventing the use of the 60–120 GHz band (current technologies offer $f_T$ and $f_{max}$ values around 300 GHz and above), pushing the frequency may initially lead to a loss of efficiency. This difference, however, levels out as technology matures and its use is extended.

Generality of results: we note that the specific results contained in this paper are, by definition, valid for a particular chip arrangement and cannot be generalized to any chip package. The key takeaways of the present work are, however, that the wireless intra-chip channel can be optimized and that the proposed methodology is applicable to any chip package. Such optimization process is unique to this wireless communications scenario.

Research directions: although this work has mitigated the intra-chip channel impairments significantly, we do not consider to have reached a lower bound. Besides the application of simulated annealing techniques to find global optima, we could improve propagation further by (i) directing certain rays via reflectors or leveraging the multiple antennas already in place to perform beamforming, (ii) thinning silicon down to the manufacturing limits [59], or (iii) exploring frequencies up to the terahertz band [63]. Additionally, factors such as the chip’s lateral dimensions, the antenna placement, or the resistivity of the silicon substrate [18] could be brought into the optimization process as long as the computational cost is affordable. At the transceiver side, low-weight coding would help minimizing the impact of ISI at very high speeds [64]. Further, compact and efficient Forward Error Correction (FEC) techniques could allow to reach the required BER without placing a large burden on the amplifiers [65, 66].

Alternative technologies: Transmission of optical signals through integrated nanophotonic waveguides [67–69] or of RF signals through transmission lines (TLs) [70, 71] can provide low latency and broadcast. Compared to wireless intra-

### TABLE I

<table>
<thead>
<tr>
<th>$w$</th>
<th>$\tau_{max}$ (ns)</th>
<th>$B_c$ (GHz)</th>
<th>$L_{max}$ (dB)</th>
<th>$L_{avg}$ (dB)</th>
<th>$n$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0.07</td>
<td>14.02</td>
<td>58.62</td>
<td>42.76</td>
<td>3.28</td>
</tr>
<tr>
<td>0.5</td>
<td>0.15</td>
<td>6.76</td>
<td>45.49</td>
<td>36.48</td>
<td>1.74</td>
</tr>
<tr>
<td>1</td>
<td>0.59</td>
<td>1.69</td>
<td>28.55</td>
<td>21.88</td>
<td>1.32</td>
</tr>
</tbody>
</table>

Fig. 10. Impact of transceiver optimizations on the Bit Error Rate (BER) assuming OOK modulation. NRZ stands for Non-Return-to-Zero.
chip communication, both nanophotonics and TLs are more energy efficient and provide higher bandwidth, because energy is guided rather than radiated. In this respect, the present works aims to reduce the performance and efficiency gap with respect to its alternatives. Beyond that, the main downturn of nanophotonics and TLs is the need of a physical infrastructure to interconnect the nodes, which complicates the network design. Further, nanophotonics are less scalable due to laser power needs. Light is modulated by the transmitter and then guided to all the receivers. Each receiver extracts a fraction of the light, causing losses, and requiring high laser power for large destinations sets. On the other hand, TLs are less scalable due to the need for amplifiers along the transmission line and centralized arbitration, issues that are exacerbated if the fan-out is large.

VI. CONCLUSION

Wireless intra-chip communication has been proposed as a potential solution to the scalability problems of current multicore processors. However, we have demonstrated that most works on this field are overly optimistic with regards to the channel, assuming figures one or two orders of magnitude better than what we found for a standard chip package. To further address this fundamental issue and restate the potential of WNoC, we proposed a methodology that exploits two unique traits of this new wireless scenario: its monolithic and static nature. The first allows us to engineer the channel, this, to modify the chip package to enhance propagation in manufacturer-friendly ways. This process is applicable to any chip package and, here, we have illustrated its potential by showing improvements of 47 dB of path loss or more than 10 GHz in coherence bandwidth for a particular system. The second allows us to optimize the transceiver to mitigate multipath effects beyond the Nyquist limit. We demonstrated that we can decode OOK signals at 10 Gb/s with a BER of $10^{-15}$ with a signal-to-noise ratio only 3.1 dB greater than in a dispersion-free environment.

REFERENCES

Networks for the assembly of Software-driven Functional Metasurfaces.

J. Oh, A. Zajic, and M. Prvulovic, “Traffic steering between a low-speed on-chip wireless networks and broadcast-enabled manycore processor architectures.”


Xavier Timoneda is a research assistant at the Universitat Politècnica de Catalunya, where he obtained his degree in Telecommunications Systems Engineering in 2018, after performing his final Degree Thesis at the University of Illinois at Urbana-Champaign (UIUC). He has authored 4 scientific publications during his first year as a research assistant, and has recently coauthored a book chapter. His research interests include artificial intelligence and chip-scale communications, currently focusing his research in the Development of Artificial Neural Networks for the assembly of Software-driven Functional Metasurfaces.

Sergi Abadal (M’16) received the B.Sc. and M.Sc. degrees in telecommunication engineering from the Universitat Politècnica de Catalunya (UPC), Barcelona, Spain, in 2010 and 2011, and the Ph.D. in computer architecture from the same institution in 2016. During his Ph.D., he was awarded by INTEL within its Doctoral Student Honor Program. He currently works as a postdoctoral researcher at the NanoNetworking Center in Catalunya (N3Cat) and, since October 2019, as Principal Investigator of the EU H2020 project WIPFLASH. From 2009 to 2010, he was a Visiting Researcher with the Broadband Wireless Networking Laboratory, Georgia Institute of Technology, Atlanta, USA. He has also been visiting researcher at the School of Computer Science, University of Illinois, Urbana-Champaign, in 2015 and 2018. He has co-authored more than 60 research papers and 7 book chapters. Since 2018, he is Associate Editor of the Nano Communication Networks (Elsevier) Journal, where he was appointed Editor of the Year in 2019. His current research interests are ultra-high-speed on-chip wireless networks and broadcast-enabled manycore processor architectures.

Dionyssios Manessis possesses M.Sc. and Ph.D degrees in Materials Science & Engineering from Stevens Institute of Technology, NJ, USA and project leadership certificate degrees from Cornell University, NY, USA. He has worked as Technologist for Universal Instruments Corporation in NY, USA and since 2001 has been Senior Technology Scientist in Fraunhofer IZM in Berlin. His main research interests lie on Fine pitch Flip chip and Wafer Level CSP bumping, solder balling, materials selection for advanced packaging technologies, embedding processes for heterogeneous integration of components in PCBs and optical PCBs, large scale prototype manufacturing. In the above technical fields, he has published extensively in international conferences and peer-reviewed journals.

Antonio Franques is a PhD student in Computer Science at the University of Illinois at Urbana-Champaign (UIUC), and a member of the i-acoma group. His research focuses on the application of high-frequency wireless on-chip communication in manycore architectures. Specifically, his goal is to design new shared-memory architectures that reduce the large cost of core-to-core communication in parallel computing. While working towards his PhD, he also interned twice at AMD Research, working on prototype communication hardware for exascale computing. Prior to joining UIUC, he obtained a Bachelors Degree in Telecommunications Engineering from the Polytechnic University of Valencia (UPV), Spain, and also performed two years of research in the area of Computational Mathematics under the supervision of Professors Juan Ramon Torregrosa and Alicia Cordero, as a member of the DAMRES group.

Yin Zhou is an Assistant Professor at the Electrical and Computer Engineering department of the University of Illinois at Urbana-Champaign (UIUC). He received a B.S. degree in electronics science and technology from Wuhan University, Wuhan, China, in 2008, a M.S. degree in microelectronics from Fudan University, Shanghai, China, in 2011, and a Ph.D. degree in electrical engineering from Columbia University, New York, NY, USA, in 2017. From 2011 to 2012, he also worked as an RF integrated circuits design engineer with MediaTek Singapore. Dr. Zhou is a recipient of the 2015-2016 Qualcomm Innovation Fellowship and the 2015-2016 IEEE Solid-State Circuits Society Predoctoral Achievement Award. He received the Eli Jury Award from the Department of Electrical Engineering at Columbia University in 2016 for his outstanding achievement in the areas of systems, communications, signal processing, or circuits.
Josep Torrellas is the Saburo Muroga Professor of Computer Science at the University of Illinois at Urbana-Champaign (UIUC). He is the Director of the Center for Programmable Extreme Scale Computing, and past Director of the Illinois-Intel Parallelism Center (I2PC). He is a Fellow of IEEE (2004), ACM (2010), and AAAS (2016). He received the IEEE Computer Society 2015 Technical Achievement Award, for “Pioneering contributions to shared-memory multiprocessor architectures and thread-level speculation”, and the 2017 UIUC Campus Award for Excellence in Graduate Student Mentoring. He is a member of the Computing Research Association (CRA) Board of Directors. He has served as the Chair of the IEEE Technical Committee on Computer Architecture (TCCA) (2005-2010) and as a Council Member of CRA’s Computing Community Consortium (CCC) (2011-2014). He was a Willett Faculty Scholar at UIUC (2002-2009). As of 2016, he has graduated 36 Ph.D. students, who are now leaders in academia and industry. He received a Ph.D. from Stanford University.

Eduard Alarcón is an associate professor at the Universitat Politècnica de Catalunya, where he obtained his PhD in electrical engineering in 2000. He has coauthored more than 400 scientific publications, 8 book chapters and 12 patents. He was elected IEEE CAS society distinguished lecturer, member of the IEEE CAS Board of Governors (2010-2013), Associate Editor for IEEE TCAS-I, TCAS-II, JOLPE, and Editor-in-Chief of JETCAS. His research interests include nanocommunications and wireless energy transfer.

Albert Cabellos is an assistant professor at Universitat Politècnica de Catalunya, where he obtained his PhD in computer science engineering in 2008. He is co-founder and scientific director of the NaNoNetworking Center in Catalunya. He has been a visiting researcher at Cisco Systems and Agilent Technologies and a visiting professor at the KTH, Sweden, and the MIT, USA. He has co-authored more than 80 research papers. His research interests include nanocommunications and software-defined networking.