## Abstract

This paper focuses on the empirical derivation of regret bounds for mobile systems that can optimize their locations in real-time within a spatiotemporally varying renewable energy resource. The case studies in this paper focus specifically on an airborne wind energy system, where the replacement of towers with tethers and a lifting body allows the system to adjust its altitude continuously, with the goal of operating at the altitude that maximizes net power production. While prior publications have proposed control strategies for this problem, often with favorable results based on simulations that use real wind data, they lack any theoretical or statistical performance guarantees. In this work, we make use of a very large synthetic dataset, identified through parameters from real wind data, to derive probabilistic bounds on the difference between optimal and actual performance, termed regret. The results are presented for a variety of control strategies, including maximum probability of improvement, upper confidence bound, greedy, and constant altitude approaches. In addition, we use dimensional analysis to generalize the aforementioned results to other spatiotemporally varying environments, making the results applicable to a wider variety of renewably powered mobile systems. Finally, to deal with more general environmental mean models, we introduce a novel approach to modify calculable regret bounds to accommodate any mean model through what we term an “effective spatial domain.”

## 1 Introduction

Spatiotemporally varying environments are common in mobile robotics (see Refs. [1] and [2]), unmanned aerial vehicles (see Ref. [3]), and the emerging field of renewably powered robotic systems (see Refs. [4] and [5]). In these instances, the objective is to utilize one or more mobile robots to perform a mission in an unknown, spatiotemporally varying environment. Within the renewable energy community, mobile systems can also be used as energy-harvesting devices in spatiotemporally varying environments. In this work, we focus on one such system that possesses mobility to explore and exploit its spatiotemporally varying environment: an *airborne wind energy system.* Airborne wind energy (AWE) systems, such as the Altaeros Buoyant Airborne Turbine [6], replace conventional towers with tethers and a *lifting body* (wing or aerostat) that holds the airborne generator aloft (Fig. 1). These systems are effective due to their ability to reach high altitudes, where wind speeds can be much higher than wind speeds typically seen by ground mounted turbines. In addition, AWE systems can adjust their operating altitudes and flight configurations. This capability can be broken down generally into two operational modes: crosswind flight, whereby AWE systems fly in optimized high-speed motions perpendicular to the wind, and altitude optimization. Both lead to challenges involving control in a partially observable, spatiotemporally varying environment, but for this paper we focus on altitude optimization. Altitude optimization involves an important tradeoff between *exploration* and *exploitation*, which is common to mobile systems operating in spatiotemporally varying environments. Specifically, because the wind speed is unknown at all altitudes besides the current altitude of the AWE system, the controller must balance exploring other altitudes to search for better wind conditions (for more power generation later) and exploiting the current estimate of the maximum wind speed (for more power generation now).

A large body of research has examined different control strategies for balancing exploration and exploitation in the context of AWE systems (see Refs. [7–10]). Each of these results has demonstrated, based on simulations driven by real data, that appropriately designed altitude optimization algorithms can significantly outperform omniscient, constant-altitude flight.

An examination of the aforementioned results reveals an expansive set of data-driven simulation results, but a lack of theoretical or statistical performance guarantees. It is intuitive to focus on the difference between optimal and realized performance, termed *regret*, in formulating such guarantees. In fact, a number of papers investigate the formulation of regret bounds in classical bandit problems [11–13]. These problems are often explained via a slot machine (i.e., bandit) analogy. Here, a gambler must, at every instance, select one of many arms to pull on, where each arm yields a different reward that is subject to some stochastic distribution. In the context of our problem, the gambler is the mobile (AWE) system, the arms are spatial locations (altitudes), and the rewards are performance outputs (power output or wind speed). Additionally, a few papers investigate the use of Gaussian process modeling to determine regret bounds in bandit problems [14,15]. Most of these papers specify bounds in terms of “big O” notation as functions of time. For practical purposes, this does not allow for calculable bounds that can be used to compare different control strategies. The papers that do provide calculable bounds assume *stationary rewards*, which do not account for temporally varying environments. This paper will focus on using statistical analyses to provide meaningful bounds on regret that are designed for temporally varying environments.

This paper first provides an empirical quantification of regret using real-time altitude setpoint optimization of an AWE system as a case study. We use wind speed versus altitude data to identify statistical *hyperparameters* of a Gaussian process (GP) model which are used to generate a synthetic wind dataset. For several altitude control strategies, including maximum probability of improvement (MPI), upper confidence bound (UCB), greedy, and baseline constant altitude strategies, we compute statistical regret bounds using a Monte Carlo simulation setup. In particular, we generate empirical cumulative distribution functions for regret under each control strategy. This represents the first rigorous effort at “scoring” each strategy. Because prior experimentally validated efforts have already addressed lower level tracking of altitude setpoints (see Ref. [16]), the content of this paper focuses exclusively on the higher level real-time altitude setpoint optimization.

These contributions represent an extended version of our recent conference publication [17]. We focus in this paper on extending our earlier work to generalize statistical regret bounds to environments characterized by different statistical parameters. We first use dimensional analysis to show that under a constant mean model, regret bounds can be expressed in terms of a small number of dimensionless parameters. To lift the constant mean assumption, we introduce the novel concept of an “effective spatial domain” size, which we show to be useful in generalizing regret bounds to spatiotemporally varying environments with different, nonconstant mean characteristics.

### 1.1 Notation.

In this paper, $Pr[A]$ represents the probability of event *A* occurring, *E*[*B*] represents the expected value of random variable *B*, and $N(C,D)$ represents a normal distribution with mean *C* and variance *D*.

## 2 Mathematical Preliminaries

In this paper, the variable *z*(*T*) will be used to represent the spatial decision variable (the chosen altitude in the AWE application), $v(z(T))$ will be used to represent the performance (wind speed as a surrogate for power generation) at the chosen *z*, for a given time-step, *T*. In a more general setting, *z*(*T*) will be a vector for multidimensional decision variables.

*regret*. Put simply, regret is the difference in performance between the optimal strategy given perfect knowledge and the strategy chosen by a controller based on imperfect knowledge. Mathematically

*z*(

*T*) is the location chosen by the control strategy, and $z*(T)$ is the maximizer of the performance function. In most instances, cumulative regret and average regret are more informative statistics than instantaneous regret. These quantities are given by

where $R(T)$ represents cumulative regret and $Ravg(T)$ represents average regret. These metrics are especially useful because it is often beneficial to tradeoff some sacrifice in instantaneous performance for improved future performance that results from better exploration of the environment. This tradeoff is often referred to as exploration versus exploitation [18].

Having defined regret, we turn to the quantification of regret bounds. Because the spatiotemporally varying environment is stochastic, there always exists a chance that any control strategy will yield zero reward at any time. Consequently, it is impossible to derive an upper bound on regret (other than a trivial zero-reward bound) with 100% confidence. To circumvent this issue, we examine regret bounds in a *probabilistic* sense as encoded in the following axiom:

*For every*$\epsilon \u2208(0,1),\tau >0,\u2009\u2203\u2009\delta (\epsilon )>0$

*such that*

*This relationship can also be expressed in terms of average regret*

In words, this axiom states that there exists some upper bound *δ* on regret that holds with probability $1\u2212\epsilon $ (after some number of initial time steps *τ*).

A key contribution of this paper lies in the estimation of $\delta (\epsilon )$ (and $\delta \xaf(\epsilon \xaf)$ in the case of average regret) for different control strategies; the regret bound acts as a scoring mechanism for each strategy. It is worth noting that the relationship between *δ* and $1\u2212\epsilon $ is equivalent to the cumulative distribution function for regret. Two sample relationships between *δ* and *ε* are illustrated in Fig. 2. By definition, *ε* is bounded between 0 and 1. Given that low regret is desirable, an $\epsilon \u2212\delta $ curve closer to the *ε*-axis describes a superior control strategy.

### 2.1 Airborne Wind Energy Case Study.

For our AWE case study, regret represents the difference between the optimal value of the wind speed given omniscient knowledge and the achieved value of the wind speed under a given control algorithm. A regret bound in this case represents a measure of how close the performance of a given algorithm would be to the optimal omniscient performance in a stochastic environment. The goal of this paper is to define methods for calculating the probabilities with which a variety of regret bounds will hold, resulting in figures similar to Fig. 2.

## 3 Control Architectures

*based on the data at spatial locations that have been visited up to that time*. In this work, we use GP modeling with a squared exponential covariance kernel and power law wind shear model as the mean function to compute the prediction mean and variance. Readers are referred to Ref. [19] for full details of GP modeling and Ref. [17] for full details for the AWE application. For the AWE application, the hyperparameters used in the GP model were identified based on one year of training data from a 900-MHz Doppler wind profiler deployed in Lewes, Delaware [20], using the methodology from Ref. [19] as

### 3.1 Candidate Control Strategies.

We consider three control strategies for control of the system's spatial location: MPI, UCB, and a “greedy” algorithm, with full details in Ref. [17]. These three control strategies are used often in literature for similar applications (see, for example, Refs. [11,21,22]). We also introduce two benchmark strategies that maintain a constant spatial location (altitude, in the case of the AWE application), defined as “fixed location—omniscient” (FLO) and “fixed location—average” (FLA). FLA uses the constant location that results in mean performance, and should thus be beaten by any reasonable control strategy. FLO uses the constant location that maximizes performance, and is potentially beatable due to the temporal variation of the reward function. Both FLA and FLO require omniscient knowledge.

Figure 3 shows sample altitude trajectories followed by the AWE system under an omniscient optimal control strategy (where full spatiotemporal knowledge of the wind is assumed, thereby representing an upper bound on attainable performance) in the top plot, along with the UCB, MPI, and greedy strategies in the bottom three plots, respectively. The contours represent the *on-board estimates* of wind speed versus altitude, which in the case of the bottom three plots are based on a partially observable wind field. It can be seen that the greedy approach, which does not incentivize exploration, maintains the worst estimate of the wind profile.

## 4 Generating Synthetic Data

Although data are available for certain applications, including wind speed versus altitude data for the AWE application, it is not sufficient to run Monte Carlo simulations of competing algorithms, which also does not lend insight into how well the regret bound estimates will extrapolate to other applications for which data are not available. To generate a vast amount of data with customizable spatial and temporal qualities, we leverage existing data to extract spatiotemporal statistical parameters, then generate a very large synthetic dataset that adheres to these identified parameters.

Using identified covariance kernel parameters, synthetic data is generated by applying a coloring filter to Gaussian white noise to maintain consistent covariance matrices between the generated data and the initial training data. This is done by using the Cholesky decomposition of the desired covariance matrix as a coloring filter according to the following process. It is important to note that white noise is used as a starting point in the process to capture the variance in the environmental variable (wind speed in the AWE example), *not* to reflect noise in the measurements available to the control system. The coloring filters are used to capture spatial and temporal correlation.

*n*is the length of coordinate vector

*x*) is calculated from the Cholesky decomposition of the covariance matrix. Here,

*x*is a vector (or matrix) spanning the (possibly multidimensional) domain of the desired synthetic data, and

*K*(

*x*,

*x*) is the associated square covariance matrix between all points in

*x*

*L*to correlate the noise according to the desired covariance matrix

*v*is overlaid on the mean model

_{N}*m*(

*x*) to obtain a set of synthetic data that can be used for simulation,

*v*

_{syn}

## 5 Dimensional Analysis for Extrapolation to Other Environments

One goal of determining regret bounds is to extrapolate those bounds to environments with *different* sets of hyperparameters, under a given controller. Here, we consider this extrapolation in the case of constant mean environments, whereas Sec. 7 introduces an additional variable that allows for the generalization to variable mean environments. To start this process, we begin with the following Proposition, which follows from the Buckingham Pi Theorem [23]:

*By the Buckingham Pi Theorem, the interdependence of the variables in Table 1 (which describe the system under consideration in this work, under a given controller) can be modeled as in Eqs. (10) and (11)*

*where*$\Pi 1,\Pi 2,\Pi 3,\Pi 4$*are given in Table 2.*

Variable | Dimensions |
---|---|

$E(Ravg(T))$ | m/s |

σ | m/s |

δt | s |

$zmax\u2212zmin$ | m |

l_{t} | s |

l_{z} | m |

Variable | Dimensions |
---|---|

$E(Ravg(T))$ | m/s |

σ | m/s |

δt | s |

$zmax\u2212zmin$ | m |

l_{t} | s |

l_{z} | m |

Note that $E(\xb7)$ denotes an expectation, which is required for dimensional analysis, since $Ravg(T)$ is a random variable, whereas the other five variables are deterministic.

Pi group | Variables |
---|---|

$\Pi 1$ | $E(Ravg(T))ltlz$ |

$\Pi 2$ | $\sigma ltlz$ |

$\Pi 3$ | $\delta tlt$ |

$\Pi 4$ | $zmax\u2212zminlz$ |

Pi group | Variables |
---|---|

$\Pi 1$ | $E(Ravg(T))ltlz$ |

$\Pi 2$ | $\sigma ltlz$ |

$\Pi 3$ | $\delta tlt$ |

$\Pi 4$ | $zmax\u2212zminlz$ |

*Proof*. Under a given controller, the system under consideration in this work involves *n* = 6 variables, namely, expected average regret ($E(Ravg(T))$), the size of the spatial domain ($zmax\u2212zmin$), the control time-step (*δt*), and the environmental hyperparameters (*σ*, *l _{t}*, and

*l*). Across these variables, we have

_{z}*k*= 2 fundamental dimensions, namely, length, and time (see Table 1).

*, in the governing equation is $p=n\u2212k=4$. Therefore, there exists a function $g(\Pi 1,\Pi 2,\Pi 3,\Pi 4)$ for which:*

_{i}By rearranging, we get Eq. (10). By taking *l _{t}* and

*l*as fundamental quantities and $E(Ravg(T)),\u2009zmax\u2212zmin$,

_{z}*δt*, and

*σ*as derived quantities, we can calculate $\Pi 1,\Pi 2,\Pi 3,\Pi 4$ by nondimensionalizing with

*l*and

_{t}*l*(see Table 2). This yields Eq. (11).▪

_{z}*σ*, the standard deviation of the reward in the environment, as is corroborated by Fig. 4. In this figure, each of the lines corresponds to a single set of system and environment parameters

*δt*, $zmax\u2212zmin$,

*l*, and

_{t}*l*with variable

_{z}*σ*. Each of the lines tested has a coefficient of determination value of at least 0.995, strongly suggesting that regret does in fact vary linearly with

*σ*. The consequence of this observation is that $\Pi 2$ can be factored out of the unknown function, and

*l*and

_{t}*l*can be canceled, resulting in Eq. (13)

_{z}To show that the theoretical results from the dimensional analysis hold up in simulation, Figs. 5 and 6 show the individual effects of the spatial and temporal parameters on regret. Clustering of points on these plots suggests that only the values of the spatial and temporal parameters affect regret regardless of the values for domain size, time-step, and either length scale.

Figure 5 shows how regret varies when varying both domain size and spatial length scale. Specifically, either increasing the size of the spatial domain or decreasing the spatial length scale increases average regret, as expected.

Figure 6 is similar to Fig. 5 but examines the effects of the temporal parameter instead of the spatial parameter. Here, decreasing mobility (by increasing the control time-step) and shortening the temporal length scale both cause regret to grow, as expected. Note that regret appears to be reaching an asymptote in this plot. This is because high values of the temporal parameter mean that successive data points are essentially uncorrelated due to the severity of the temporal variation in the environment. Thus, exploration is no longer useful at this point, and further increases in *δt* or decreases in *l _{t}* do not further increase regret.

## 6 Effective Spatial Domain

A number of applications, including the AWE system, are characterized by mean functions that vary spatially. We present the concept of an effective spatial domain, a useful tool for generalizing results to nonconstant mean environments. Figure 7 shows three separate mean functions, one of which is constant. A high-performing algorithm will prioritize visiting locations where the mean function is high. Thus, environments with larger spatial variability in the mean function (e.g., the dotted curve in Fig. 7) can focus on smaller subsets of the full spatial domain while still achieving strong performance. Consequently, for a spatial domain size, larger spatial variation in the mean will result in a smaller *effective spatial domain* that must be explored.

To quantify the concept of an effective spatial domain and begin to understand how it impacts average regret, recall the spatial parameter $(zmax\u2212zmin)/lz$ presented in Eq. (11). This parameter can be thought of as a nondimensional measure of the size of the domain which the control system may operate. As certain environments result in a smaller feasible fraction of the domain, this parameter should be scaled to account for various mean function models, as described through the following Conjecture:

*The dimensional analysis-based dynamic similarity results of*Sec. 6

*with a spatially varying mean function m(z) can be replaced by*

*using a modified nondimensional spatial parameter given by*$keff(zmax\u2212zmin)/lz$

*, where*$keff$

*is the effective domain multiplier and can be calculated as*

*Here, t _{a} is some arbitrary time,*$[zmin,zmax]$

*is the spatial domain, and μ*$zmax$.

_{max}is the maximum of the mean function (as opposed to the maximum of the actual data). For a power law mean (and any other strictly increasing mean), this is simply the value of the mean function atEquation (15) describes a method for calculating the tail of the probability distribution of the environmental variable that lies above the maximum value of the mean function of that variable, scaled such that the maximum value of $keff$ is unity. It can be thought of as a probabilistic measure of the fraction of the spatial domain likely to be visited by an efficient control algorithm.

A visualization of this calculation for the solid mean profile from Fig. 7 can be seen in Fig. 8. This mean function is a sample of the traditionally accepted power law mean wind shear profile. Shown in blue and green in Fig. 8 are sample probability density functions of the environmental resource at different altitudes. The area highlighted under each function corresponds to the probability that the resource at that altitude will be higher than the maximum value of the mean function. This is the interior of the integral from Eq. (15). Integrating this quantity and dividing by the domain size averages this probability across all altitudes. Noting that the maximum value with a symmetric probability distribution is 1/2, we multiply by 2, causing values of $keff$ to fall in the range $[0,1]$.

Figure 9 shows contours of regret versus mean function parameters on the left and plots of $keff$ versus the same domain on the right. The parameters varied are the leading coefficient *v*_{0} and the exponent *b* from a power law wind shear model, $v(z)=v0(zzref)b$. By inspection, lines of constant regret on the left plot are closely correlated to the lines of constant effective spatial domain on the right plot, lending credibility Conjecture 1. The same data are plotted as a scatter plot in Fig. 10. This plot shows a strong correlation between regret and effective domain multiplier for power law mean models with a Spearman rank correlation coefficient of 0.984, strongly implying the validity of this approach for power law mean profiles.

To further analyze the relationship between the effective domain and regret, Fig. 11 shows lines of constant effective domain size with variable power law coefficients and exponents. The horizontal axis plots the power law leading coefficient *v*_{0}, but the exponent *b* varies as well to keep the effective domain a constant for each line. The fact that each of these lines is near horizontal strongly supports the hypothesis that effective domain size can replace domain size to account for nonzero mean functions. That is, for nonzero mean environments, Eq. (13) can be replaced by Eq. (14).

Figure 12 shows Monte Carlo simulation results of regret versus effective domain for three separate mean models. All three follow near-identical trends. The fact that three completely independent mean models have results that match when accounting for effective domain size shows the generalizability of the effective domain parameter in accounting for varying mean models in regret bound calculations.

## 7 Empirical Comparison of Control Strategies—Airborne Wind Energy System

We now return to our AWE system case study, namely, the Altaeros BAT. Using each of the acquisition functions defined above, 100,000 simulations were run, each with a different set of synthetic wind data generated using the standard power law wind shear model and the covariance and mean function hyperparameters given in Sec. 3.

For every set of synthetic data, each control strategy's performance was simulated independently. Instantaneous and average regret were logged for each run of each control strategy. These synthetic data were used to generate the regret bound comparisons of Figs. 13 and 14, which describe the relationship between the upper bound on regret, *δ*, and its associated probability, $1\u2212\epsilon $, as defined in Eqs. (4) and (5).

Several conclusions can be drawn from these results. First, each algorithm outperforms the average stationary strategy (FLA) by a significant margin, so each has some merit. However, the greedy strategy fails to outperform the omniscient stationary strategy (FLO), reinforcing the idea that the explicit incentivization of exploration in the acquisition function is important to long-term exploitation. Both UCB and MPI outperform the omniscient stationary strategy (FLO) by a significant margin, showing that each is an effective algorithm in a wide range of cases.

Both plots have an asymptote along the *δ* axis since there is no way to provide a hard finite upper bound on regret. Furthermore, each graph has a different trend as it approaches *ε* = 1. Because it is possible to obtain zero instantaneous regret (in a discretized spatial environment), there is a range of *ε* over which *δ* is zero for each algorithm. Comparing this to the average regret plots, $\delta \xaf$ increases extremely quickly in going from $\epsilon \xaf=1$ to $\epsilon \xaf=.99$ because it is nearly impossible to obtain zero cumulative (and therefore average) regret.

## 8 Conclusions

Novel contributions from this paper include dimensional analysis for the expected form of regret bounds in a nonstationary environment as well as the introduction of an “effective spatial domain” multiplier to account for nonzero environmental means. The dimensional analysis showed that for any environment that can be classified by a variance, a spatial parameter, and a temporal parameter, the regret incurred through operation in such an environment can also be classified as a function of three similar, nondimensionalized parameters. This analysis also considered the mean model of the environment, characterizing an “effective spatial domain” multiplier that allows for a variety of mean models to be characterized under the single analysis in this paper. Real-time optimization algorithms and regret bound results were concretely assessed based on an AWE system, focusing on several candidate control algorithms (UCB, MPI, and greedy).

## Funding Data

National Science Foundation Award Number 1711579, entitled “Collaborative Research: Multi-Scale, Multi-Rate Spatiotemporal Optimal Control with Application to Airborne Wind Energy Systems” (Funder ID: 10.13039/100000001).