Abstract
Low-fidelity engineering-level dynamic models are commonly employed while designing uncrewed aircraft flight controllers due to their rapid development and cost-effectiveness. However, during adverse conditions, or complex path-following missions, the uncertainties in low-fidelity models often result in suboptimal controller performance. Aircraft system identification techniques offer alternative methods for finding higher fidelity dynamic models but can be restrictive in flight test requirements and procedures. This challenge is exacerbated when there is no pilot onboard. This work introduces data-driven machine learning (ML) to enhance the fidelity of aircraft dynamic models, overcoming the limitations of conventional system identification. A large dataset from twelve previous flights is utilized within an ML framework to create a long short-term memory (LSTM) model for the aircraft's lateral-directional dynamics. A deep reinforcement learning (RL)-based flight controller is developed using a randomized dynamic domain created using the LSTM and physics-based models to quantify the impact of LSTM dynamic model improvements on controller performance. The RL controller performance is compared to other modern controller techniques in four actual flight tests in the presence of exogenous disturbances and noise, assessing its tracking capabilities and its ability to reject disturbances. The RL controller with a randomized dynamic domain outperforms an RL controller trained using only the engineering-level dynamic model, a linear quadratic regulator controller, and an L1 adaptive controller. Notably, it demonstrated up to 72% improvements in lateral tracking when the aircraft had to follow challenging paths and during intentional adverse onboard conditions.
1 Introduction
Autonomous aircraft flight control has made many possibilities available to humans. This ranges from reducing the workload on human pilots to allowing fully autonomous uncrewed flight missions. As the complexity of missions increases, the dependency on the capabilities of flight control systems grows. The symbiotic relationship between the flight controller and the dynamic model of aircraft motivates the development of higher-fidelity aircraft models that support advanced flight controllers.
High-fidelity aircraft modeling techniques, such as wind tunnel tests and computational fluid dynamics analysis, are commonly employed for high-sensitivity applications, such as transport aircraft, fighters, and business jets. However, for small uncrewed aircraft systems (UAS), the utilization of these high-fidelity modeling methods can be cost-prohibitive.
For UASs, an alternative solution to reduce modeling costs is to rely on conceptual-level low-fidelity dynamic models. This includes dynamic models developed using relatively simple theoretical methods [1,2]. Such models have been successfully implemented for designing modern flight controllers, even sophisticated deep reinforcement learning (RL) controllers. However, inherent uncertainties in the low-fidelity models adversely impact flight controller performance. The uncertainties in the low-fidelity models can result in predicting the aircraft motion with incorrect magnitudes, incorrect time delays, or even incorrect motion trends [3]. The exponential demand on the complexity of UAS missions demands better dynamic models and flight controllers.
A commonly used method to enhance aircraft dynamic models' fidelity is system identification using flight test data. System identification has been studied for many years, and a wide range of analytical methods exist, as presented in the references cited here [4–6]. This technique allows enhancing aircraft simulation models based on data observed in the actual flight environment. The method can be used even for improving the simulation of complex flight conditions such as rotorcraft operation in ship air wake gusts [7].
Commonly, for modeling the base flight dynamics of a fixed-wing aircraft, system identification techniques use relatively short segments of flight consisting of specially designed input maneuvers. System identification methodology, as presented in standard textbooks, call for performing the designed flight maneuvers in low wind conditions (e.g., early morning) to reduce the impact of wind disturbances on modeling the aircraft flight dynamics [4]. In system identification experiments and flight maneuvers, inputs are tailored to specific frequencies, amplitudes, and shapes (e.g., singlet, doublet, multistep) to excite aircraft states or dynamic modes effectively. Such careful design ensures mode excitation and prevents coupling between dynamic modes. However, these requirements make the conventional system identification approach restrictive because commonly collected flight test data cannot be utilized for system identification purposes, necessitating gathering specific flight data.
The rapid advancements in data-driven machine learning (ML) techniques have provided a new opportunity to enhance the fidelity of aircraft dynamic models using commonly collected flight test data, overcoming the limitations of the conventional system identification approach. In this work, we use data, which was already available from multiple previous flights for improving the fidelity of the UAS model. We use data-driven ML techniques, incorporating a long short-term memory (LSTM) recurrent neural network (RNN) architecture. LSTM RNNs have memory elements that are not present in standard neural networks. These memory elements allow LSTM models to make predictions based on information in the previous sequence of input data, instead of only making predictions based on information from a single previous time-step.
Aircraft system identification using neural networks has been done in previous works such as [4,8,9], where the end goal was to obtain aircraft stability and control derivatives. However, in our previous experience, getting consistent stability and control derivatives for a small UAS was challenging [10]. In this work, our objective is not to obtain stability and control derivatives. Instead, we aim to utilize neural networks for modeling the input–output relationships, mimicking the dynamic behavior of the aircraft, as in Refs. [11,12].
We develop a model that takes aircraft states and controls at previous time-step(s) as inputs and uses them to predict the next time-step states. In Ref. [11], a small feed-forward (memoryless) neural network is used along with a moving data window concept to perform online modeling of aircraft translational acceleration. In Ref. [12], RNNs are used to model (i.e., predict) aircraft rotational rates and translational velocities at different Mach numbers and altitudes. However, in the reference, training is performed based on simulation data due to the difficulty of obtaining a large set of flight test data that covers the entire flight envelope. We train an LSTM RNN using actual flight test data to capture the actual flight dynamics. Unlike an online modeling approach, we focus on offline modeling using a large set of data to generate a model that works over a larger portion of the flight envelope.
Using LSTM environment models for training RL agents has been explored in nonaerospace model-based RL applications such as developing recommender systems [13] and a robotic assistant for elderly people [14]. Ref. [15] developed a (non-LSTM) neural network world model and used it for training a simplified discrete “controller” for aircraft air-to-air engagement scenarios. However, aside from being a non-LSTM model, the scope of the dynamic model is limited to position and velocity modeling and does not account for aircraft forces and moments. Also, the developed controller outputs are limited to general commands of turn left, turn right, speed up, slow down, etc., not actual servo deflection commands. To the best of our knowledge, using an LSTM RNN as the model for training RL-based aircraft inner-loop controllers has not been done in previous research.
Deep RL has been the subject of much recent research in robotics and uncrewed systems [16–19]. This work expands on the recent application of RL techniques to fixed-wing aircraft flight control. In flight control, RL techniques have been mainly focused on rotary wing applications and simulation-based validation. Examples of application of RL to rotary wing flight control include [20,21], which were performed in simulations, and [22,23] which include actual flight tests. For fixed-wing aircraft, different RL algorithms have been applied recently to flight control, but the works were only validated in simulations. Notable examples are Refs. [24,25], which use an actor-critic design framework, Ref. [26], which uses the soft actor-critic algorithm, Ref. [27], which uses the twin-delayed deep deterministic policy gradient (TD3) algorithm, and Ref. [28], which uses the normalized advantage function Q-learning algorithm. Recent applications of RL algorithms to fixed-wing UASs include [29] for active stall protection and [30] for perched landing.
A few recent works include the application of deep RL to fixed-wing UASs inner-loop flight control while including actual flight test validations [31–35]. The deep deterministic policy gradient algorithm is used in Refs. [31,32], while the proximal policy optimization (PPO) algorithm is used in Refs. [33,34]. References [31–34] develop flight controllers for longitudinal motion pitch angle and airspeed tracking, and controller training is done using the low-fidelity physics-based aircraft dynamic models. Very recently, reference [35] used the soft actor-critic algorithm for attitude (roll and pitch) control of a fixed-wing UAS, where the controller was trained on an aircraft model developed using wind tunnel testing and computational fluid dynamics.
The gap in the controller's performance between simulation and the actual environment is often referred to as the “reality gap” [36] and is active research in the field of robotics and AI/ML. In our previous studies [33,34], we utilized an approach called domain randomization [37–39] to improve the generalization performance of the control policy and show that the RL-based controller developed in the simulation was robust and verifiable in the actual environments. In domain randomization, some environment parameters are randomized in simulation during controller development, exposing the controller to possible variations. For instance, Refs. [33,34] used estimated uncertainties in model parameters, control delays, sensor noises, and wind disturbances to enrich the training environment. The concept is closely related to robust control design such as . However, in the robust control design, only worst-case variation is considered. The accurate a priori knowledge about worst-case variation is difficult to achieve, so often, an overestimated upper bound is used in favor of safety and at the expense of performance. In contrast, domain randomization does not only focus on the worst-case but also considers the whole range of parameter values and attempts to perform statistically significant enough variability in the simulator so that there are significant overlaps with the real-world variations. Although quantifying the true reality gap is intractable, its upper bound can be estimated from the simulated data and is reduced with an increasing number of samples from the distributions [40].
Deep RL-based controller development, in general, yields a model that is large, complex, and opaque. Therefore, it is difficult to establish safety guarantees for such controllers using existing analytical methods. However, probabilistic guarantee can be established through methods such as formal verification [41–43] that checks the correctness of the model outputs using a simplified model. Reference [31] proposed a monitoring algorithm based on formal verification that automatically switches from a primary controller (deep RL) to a secondary controller (LQR) if a predicted state, given the primary controller output, falls outside the safe zone for the secondary controller. However, the proposed formal verification method relies heavily on the accuracy of an linear-time-invariant (LTI) dynamic model and requires a separate secondary controller that runs in parallel. In addition, the controller switching can result in unintended outcomes such as oscillations. An alternative method is called safe reinforcement learning (safe RL) (e.g., Ref. [44]) that attempts to acquire the safe policy out of the box by factoring in safety objectives in the training process.
In this work, we introduce data-driven ML methods to improve the fidelity of an aircraft dynamic model. We use a bank of collected flight data in an ML framework, overcoming conventional system identification restrictions. The work also contributes to designing a verifiable deep RL-based lateral-directional flight controller. We follow the approach of safe RL through rigorous training and testing of the controller to ensure safe control. To that end, the controller synthesis step utilizes both the developed LSTM-based and physics-based dynamic models to form a randomized and physics-informed RL training environment to improve robustness toward modeling uncertainties. A novel RL reward function is used that includes safety components such as constraints on rate of controls, which has been shown to be effective in improving the stability and robustness of the controller. In addition, an LSTM layer is added to the controller architecture to enhance the controller's adaptive performance. Several challenging actual flight tests are conducted to assess the controller performance and the improvements in the aircraft dynamic model. The flight tests are conducted on different days with different wind conditions and in the presence of sensor noise demonstrating the ability of the controller to control the aircraft in the presence of exogenous disturbances.
2 Testbed Aircraft: The SkyHunter
The testbed aircraft used in this work is the SkyHunter UAS presented in Fig. 1. The SkyHunter is a fixed-wing UAS featuring a twin tailboom design and uses a single pusher electric motor. The UAS has a 1.8 meters wingspan, a length of 1.4 meters, and a weight of 4 kilograms. The SkyHunter has elevator, aileron and rudder control surfaces.
3 Physics-Based Dynamic Model
where A and B are matrices consisting of the linear model coefficients. X is the state vector containing the perturbed lateral-directional states (β, , P and R) and U is the control vector containing the perturbed controls (δa and δr). is the time rate of change of the states.
where m is the mass of the aircraft, I is the moment of inertia matrix, is the aerodynamics, propulsive and gravitational forces, is the aerodynamic and propulsive moments, and are the transnational and rotational velocity vectors, respectively. All quantities are in the aircraft body coordinate system. The models were developed using geometric and mass measurements from the airframe. Aerodynamic forces and moments are modeled based on stability and control derivatives estimated using advanced aircraft analysis (AAA) software [2]. AAA is an aircraft design software widely used by the aerospace industry and academia for decades [45–48]. AAA is based on the analytical methods presented in Refs. [49,50] and the U.S. Air Force Stability and Control DATCOM [51]. The quality of the AAA physics-based model of SkyHunter was further improved using actual flight data and tuning methods presented in Ref. [52]. The improved SkyHunter physics-based model was used in developing guidance and control algorithms, which were tested in actual flight tests, and they outperformed widely used open-source autopilot software (e.g., Pixhawk) [34,53–55]. Further details on the SkyHunter model are published in Ref. [3] along with an analysis of the validity of the physics-based 6DOF model under different flight conditions, including loss of control and stall testing scenarios. In this work, we use machine learning techniques to improve the fidelity of the aircraft dynamic model.
4 Data-Driven Modeling of Aircraft Dynamics Using Machine Learning
Machine learning provides a way to use data for modeling different processes. We use a bank of data from 12 flight tests to improve the lateral-directional dynamics model of the SkyHunter. These 12 flight tests amount to a total of 218,085 data points (182 min of flight time) which are used for training, validating and testing the developed ML models. The explored model architectures, the used flight test data, the training setup, and the modeling results are presented in this section.
4.1 Model Architecture.
In this work, our goal is to develop a model for the lateral-directional motion. The model takes previous state and control values as inputs and outputs predictions of future states. For the lateral-directional motion, the aircraft states of interest are the sideslip angle (β), the roll angle (), the roll rate (P), and the yaw rate (R). The control inputs of interest are the aileron and rudder deflections (δa and δr, respectively). Thus, the developed model has the inputs and outputs shown in Fig. 2(a).
Here, the model inputs are the states and controls at the current time-step, , and the model outputs are the states at the next time-step, . W represents the weights matrix. We do not include a bias term in the linear model.
where W and represent the weights and biases in each layer, and represents the hidden layer outputs. The subscripts indicate layer numbers.
A popular variant of RNN models, known as the long short-term memory (LSTM) RNN, is used in this work. The LSTM structure is designed to have “gates,” which are intended to manage which information is kept or forgotten from the observed sequential input data. Detailed information about the mathematics of the LSTM model is available in Ref. [56]. We develop two LSTM models. The first LSTM model directly predicts the next time-step outputs (like the linear, and MLP models). The second LSTM model (referred to as ResLSTM) uses a framework that predicts the residuals. i.e., instead of predicting the next time-step outputs, the model predicts how much the next step changes from the current step. The LSTM and ResLSTM networks used in this work have one hidden layer with 32 units.
4.2 Flight Data Used For Modeling.
A collection of data from 12 flight tests is used in this work. The flight data covers flight from the takeoff ascent to cruise flight to landing descent. This does not follow the standard aircraft system identification procedure where the flight data needs to be collected from specifically designed maneuvers. Instead, we use a bank of flight data already collected from normal aircraft operation from different phases of flight. We propose using flight data without specifically performing maneuvers for separately exciting the aileron and rudder controls and without separately exciting each of the lateral-directional modes (Dutch-roll, roll mode, and spiral mode).
For training, we use data from nine flight tests. These data are directly used for training the ML models weights and biases. Data from two flight tests are used as the validation set. The validation set is used to evaluate when the model training should be stopped to avoid overfitting. Training is stopped when the loss (the mean squared error) evaluated on the validation set does not improve for two consecutive training epochs. One flight test is used as the test dataset. This test dataset is used to evaluate model performance on data which was not used to train the model. Table 1 and Fig. 4 present the distribution of data in the training, validation and test datasets. The number of data points in the training, validation, and test datasets are 162,754, 38,117, and 17,214 points, respectively, which is equivalent to 135, 31, and 14 min of flight, respectively, given the 20 Hz sampling rate.
Statistic | Dataset | Sideslip angle β, (deg) | Roll angle , (deg) | Roll rate P, (deg/s) | Yaw rate R, (deg/s) | Aileron δa, (deg) | Rudder δr, (deg) |
---|---|---|---|---|---|---|---|
Mean | Train | −0.12 | −8.32 | −0.01 | −3.87 | −0.13 | 0.51 |
Validation | −0.21 | −9.32 | 0.01 | −4.62 | −0.29 | 0.64 | |
Test | 0.02 | −7.87 | −0.04 | −4.17 | −0.04 | 0.50 | |
Std. | Train | 1.01 | 16.80 | 13.48 | 8.67 | 0.78 | 1.13 |
Validation | 0.67 | 14.47 | 9.88 | 6.97 | 0.57 | 0.93 | |
Test | 0.79 | 15.11 | 11.07 | 7.27 | 0.50 | 0.79 | |
Min. | Train | −10.95 | −73.23 | −199.40 | −60.14 | −17.59 | −6.97 |
Validation | −5.53 | −55.91 | −66.87 | −41.49 | −4.50 | −5.40 | |
Test | −5.85 | −54.28 | −127.17 | −34.96 | −5.54 | −3.00 | |
Max. | Train | 12.00 | 139.19 | 190.91 | 53.40 | 14.27 | 13.72 |
Validation | 4.27 | 58.15 | 73.48 | 25.84 | 3.82 | 7.80 | |
Test | 6.75 | 67.41 | 104.92 | 39.75 | 6.08 | 2.63 |
Statistic | Dataset | Sideslip angle β, (deg) | Roll angle , (deg) | Roll rate P, (deg/s) | Yaw rate R, (deg/s) | Aileron δa, (deg) | Rudder δr, (deg) |
---|---|---|---|---|---|---|---|
Mean | Train | −0.12 | −8.32 | −0.01 | −3.87 | −0.13 | 0.51 |
Validation | −0.21 | −9.32 | 0.01 | −4.62 | −0.29 | 0.64 | |
Test | 0.02 | −7.87 | −0.04 | −4.17 | −0.04 | 0.50 | |
Std. | Train | 1.01 | 16.80 | 13.48 | 8.67 | 0.78 | 1.13 |
Validation | 0.67 | 14.47 | 9.88 | 6.97 | 0.57 | 0.93 | |
Test | 0.79 | 15.11 | 11.07 | 7.27 | 0.50 | 0.79 | |
Min. | Train | −10.95 | −73.23 | −199.40 | −60.14 | −17.59 | −6.97 |
Validation | −5.53 | −55.91 | −66.87 | −41.49 | −4.50 | −5.40 | |
Test | −5.85 | −54.28 | −127.17 | −34.96 | −5.54 | −3.00 | |
Max. | Train | 12.00 | 139.19 | 190.91 | 53.40 | 14.27 | 13.72 |
Validation | 4.27 | 58.15 | 73.48 | 25.84 | 3.82 | 7.80 | |
Test | 6.75 | 67.41 | 104.92 | 39.75 | 6.08 | 2.63 |
Data from each flight test were inspected before training to check its quality. The trim aileron and rudder values were identified in each flight and subtracted from the recorded aileron and rudder deflections. In several flights, a bias was identified and removed from the sideslip angle estimations recorded in the flight data. The bias in sideslip angles was related to errors in the aileron and rudder trim settings. Correcting these trim errors and rerunning the sideslip angle estimation Kalman filter offline helped correct the bias in the sideslip angles. All angular values and angular rates were converted to radians and radians per second before training. Table 1 and Fig. 4 present the flight data after the trim and bias corrections.
4.3 Model Training Setup.
where Y and are the flight data measurements and model predictions, respectively. no is the number of model outputs. For the case of four model outputs (β, , P, and R), no = 4. N is the number of data points. k and i are used to sum over the model outputs and the data points, respectively. Training was done using Tensorflow [57] using procedures similar to Ref. [58]. The batch size used during training is 32. Training stops if the validation loss does not improve in two successive epochs. The Adam algorithm [59] is used to perform the training.
4.4 Modeling Results.
The performance of the developed models is evaluated using the mean absolute error (MAE) metric, which is a standard metric in ML work [60]. In this section, we mathematically evaluate the performance of each of the developed models. The MAE in predicting a single time-step in the future is presented in Fig. 5 for each of the model outputs (β, , P, and R). The figure presents the performance over the training, validation and test datasets. The LSTM and ResLSTM models have better performance metrics on all three training sets for the sideslip angle, β. The linear and ResLSTM models have better performance on all three datasets for the roll angle, . All four models have similar performance for the roll rate, P. For the yaw rate, R, the LSTM and ResLSTM models have better performance on the training and testing data, and are a little better on the validation data.
We desire to use the developed models for training a lateral-directional aircraft controller using reinforcement learning. For this, the model needs to have good performance in simulating several time steps in the future, not just one time-step. Therefore, we evaluate the performance of the models on 6 s of simulation, which is the duration of the RL training episodes. In these simulations, the simulation outputs at one time-step are used as inputs in the following time-step in a looping manner. The aileron and rudder controls are the only two variables obtained from the flight data at each time-step since these are autopilot or human pilot commands and should not be predicted by an aircraft model. When evaluating the performance of the models, it became clear that feeding the models with zero sideslip angle yielded better results in the 6 s simulations. This may be due to errors in the sideslip angle estimations available in the flight data and used for ML model training. Therefore, for the remaining results, we feed zero as the sideslip angle to the models and we evaluate model performances for the three outputs: , P, and R.
For small UAS, there are two approaches to obtain aircraft airflow angles (angle of attack, α, and sideslip angle, β): (a) through estimation, or (b) through measurement. Measurement using a 5-hole pitot tube or other practical approaches suffers from static pressure inaccuracy due to static ports' location. This issue is resolved in large aircraft by distributing the location of static ports. Since UAS are small, air-stream interaction with the body can cause large errors in the measurement of airflow angles. Additionally, the cost of a 5-hole pitot tube or an -veins system can be more than one order of magnitude higher than the UAS cost. The second approach, obtaining the airflow angles through estimation, is less expensive, but finding the correct covariance matrix, if an extended Kalman filter is used, or dealing with bias errors results in challenges in obtaining good estimations. Additionally, it is difficult to know the ground truth airflow angles to evaluate the estimation accuracy. A good example demonstrating the challenge of estimating the sideslip angle is evident in this study. Employing the Extended Kalman filter estimated sideslip angles to model the lateral-directional flight dynamics resulted in inferior outcomes compared to scenarios where the sideslip angle was not utilized.
We compare the prediction performance of the four learned models (Linear, MLP, LSTM, and ResLSTM), the two physics models (LTI and 6DOF), and two baseline models (baseline and “zero”). The baseline model simply predicts that the next time-step states are equal to the current time-step states. The “zero” model predicts zero for all the states at all time steps. This model is of interest in straight line flight sections where the aircraft states are around zero.
Classification algorithms are used to classify the flight data into turning and straight line flight similar to the approach in Ref. [3]. Using the classification algorithms, 44 turning flight segments were obtained from the first validation flight, 34 segments were obtained from the second validation flight, and 43 segments were obtained from the test flight. Figure 6 presents the average MAE for these 6 s segments of turning flight. To obtain a perspective of how large the MAEs are, we normalize the MAEs by the standard deviations of roll angle, roll rate, and yaw rate calculated on the training data, σTrain (presented in Table 1). These normalized percentage values are shown on the right axes of the plots in Fig. 6. The learned models have improved MAE compared to the physics-based models in most comparisons. The LSTM model had improved MAE in all comparisons, except for the roll angle prediction in the test flight (but it had improved MAE in the two validation flights). The LSTM model had improvements of up to 45.8% and 23.4% over the physics-based models on the validation and test data, respectively. Table 2 shows the percentage improvement obtained by the LSTM model compared to the physics-based models.
Flight | |||
---|---|---|---|
Variable | Val. 1 (%) | Val. 2 (%) | Test (%) |
P | 13.0 | 4.5 | 20.0 |
R | 35.8 | 45.8 | 23.4 |
21.6 | 12.0 | −12.1 |
Flight | |||
---|---|---|---|
Variable | Val. 1 (%) | Val. 2 (%) | Test (%) |
P | 13.0 | 4.5 | 20.0 |
R | 35.8 | 45.8 | 23.4 |
21.6 | 12.0 | −12.1 |
Similar analysis was performed for straight line flight segments. Using the classification algorithms, 21 straight flight segments were obtained from the first validation flight, 20 segments were obtained from the second validation flight, and 21 segments were obtained from the test flight. Overall, in the analyzed straight line flights, the rotation rates predictions performance was comparable across the different learned and physics-based models. For the roll angle, , the learned models had larger errors than the physics-based models in the validation flights. However, for the test flight, the LSTM, linear, and MLP models had lower roll angle MAEs compared to the physics-based models.
Given the improved performance of the LSTM model compared to the physics-based models, the LSTM model is selected for training the lateral-directional controller using reinforcement learning. The LSTM model had improved performance over the ResLSTM model in roll angle predictions. The LSTM model had improved performance in yaw rate predictions over the Linear and MLP models as well in Fig. 6.
A sample of the prediction performance of the LSTM and physics-based models on a 30 s flight portion from the test flight is presented in Fig. 7. The improved performance of the LSTM model over the physics-based models can be seen in the roll rate (P) and yaw rate (R) modeling. In this test flight, the LSTM model had some error in modeling the roll angle (), but it was moving in the correct directions like the flight data.
Another sample of the prediction performance of the LSTM and physics-based models on a 30-s flight portion from the first validation flight is presented in Fig. 8. The improved performance of the LSTM model compared to the physics-based models can be seen in all three model outputs: the roll rate (P), yaw rate (R), and roll angle (). Using the LSTM model for training and then testing a flight controller, in the rest of this work, provides a way to evaluate the practical use of the LSTM model.
5 Controller Development Using Reinforcement Learning
Reinforcement Learning enables a control policy () to learn a sequential mapping from the state (s) to optimal control (a) by directly interacting with an environment (Env) and using feedback received as a form of reward (r) for its control decisions. The environment is formalized as a finite horizon discounted Markov decision process. An Markov decision process is defined by a tuple , where is the set of states, is the set of actions, is the state transition probability distribution, is the reward function, is the initial state distribution, is the discount factor, and T is the horizon of each episode of interactions. The algorithm used in this work optimizes a stochastic policy . Let, denotes its expected total discounted reward: , where denotes the whole trajectory, , and .
For the safety of the aircraft, controller training is performed in simulation environments. Two different control policies, π1 and π2, are developed for comparison trained using two different simulation environments. The training of π1 is performed using a deterministic (Env1) environment where an LTI-based dynamic model is used for the state transition. In contrast, the training of π2 uses a stochastic (Env2) environment utilizing domain-randomization approach to improve the generalization performance of the control policy [36–39]. The stochastic (Env2) environment makes use of both LTI- and LSTM-based dynamic models, where each model is randomly selected with uniform probability at the beginning of each training episode. The details of LTI- and LSTM-based dynamic models can be found in Secs. 3 and 4, respectively.
5.1 Neural Network Architecture.
The policy and the critic are represented by two LSTM-based neural networks (NNs) with weights θ and ν, respectively. Both NNs are based on the same architecture and are composed of one input layer, one LSTM layer, two feed-forward (FF) hidden layers, and one output layer, as shown in Fig. 9. The LSTM layer comprises one hidden layer of 128 LSTM memory cells. Each hidden layer is a fully connected layer of 128 hidden units (neurons) with tanh activation. Training hyper-parameters used in this work are presented in Table 3. The LSTM layer for the input is shown unrolled in time-step t in Fig. 9. The layer uses an input sequence where the state from current time-step t through time-step are stacked together.
Hyper-parameter | Value |
---|---|
LSTM hidden-layers | 1 |
FF hidden-layers | 3 |
Neurons | 128 |
Activation | |
State sequence length, | 32 |
Discount-factor, γ | 0.99 |
Clip param, ϵ | 0.2 |
Batch-size | 5120 |
Learning rate | 1 × 10−3 |
# of epochs | 10 |
Episode-length, T | 128 |
# of episodes per batch, | 40 |
Step time-period, dt | 0.05 |
Hyper-parameter | Value |
---|---|
LSTM hidden-layers | 1 |
FF hidden-layers | 3 |
Neurons | 128 |
Activation | |
State sequence length, | 32 |
Discount-factor, γ | 0.99 |
Clip param, ϵ | 0.2 |
Batch-size | 5120 |
Learning rate | 1 × 10−3 |
# of epochs | 10 |
Episode-length, T | 128 |
# of episodes per batch, | 40 |
Step time-period, dt | 0.05 |
5.2 Network Input-Output.
where and are the normalized aileron and rudder deflections from their respective trims. These control setpoints are appropriately scaled and shifted to match the aircraft's control constraints.
5.3 Reward Function.
where
Group 1 is used to improve the tracking performance and consists of weighted L2 cost/penalty for the roll tracking errors ().
Group 2 is weighted about half the magnitude smaller than Group-1 and consists of L2 cost/penalty for the nonzero perturbed control values outputted by the policy to keep them close to their respective trim values.
Group 3 regulates roll-rate (P) and yaw-rate (R) with L2 penalty weights about one order of magnitude smaller than Group-1.
Group 4 regulates control rates ( and ) with L2 penalty weights about two orders of magnitude smaller than Group 1.
Group 5 limits control rates ( and ) within their maximum values ( and ). The initial L2 penalty weights are set at about one order of magnitude smaller than Group 1 and then increased by one order of magnitude at the final stage of the training to further reduce the control rates.
Input: |
LTI- and LSTM-based dynamic models (DMs) developed in Secs. 3 and 4, respectively. |
Maximum # of episodes, kmax. |
Output: |
Optimized policy network weights. |
Initialization: |
Initialize critic and policy network parameters ν and θ, respectively. |
fork = 0,1,2,, kmaxdo |
1. For the state transition, ifπ1then pick Env1 |
else ifπ2then pick Env2. |
2. Collect set of trajectories of length |
T = 128 into a dataset |
on policy , where |
contains data collected along the ith trajectory. |
fori = 0,1,2,…, Ndo |
3. Compute total discounted reward Gt in (10). |
4. Estimate advantages using (11). |
end |
5. Update policy by maximizing (12). |
6. Update critic by minimizing (14). |
7. Break if converged. |
end |
Input: |
LTI- and LSTM-based dynamic models (DMs) developed in Secs. 3 and 4, respectively. |
Maximum # of episodes, kmax. |
Output: |
Optimized policy network weights. |
Initialization: |
Initialize critic and policy network parameters ν and θ, respectively. |
fork = 0,1,2,, kmaxdo |
1. For the state transition, ifπ1then pick Env1 |
else ifπ2then pick Env2. |
2. Collect set of trajectories of length |
T = 128 into a dataset |
on policy , where |
contains data collected along the ith trajectory. |
fori = 0,1,2,…, Ndo |
3. Compute total discounted reward Gt in (10). |
4. Estimate advantages using (11). |
end |
5. Update policy by maximizing (12). |
6. Update critic by minimizing (14). |
7. Break if converged. |
end |
The complete policy training pseudo-code is summarized in Algorithm 1. The training starts with randomly initializing policy and critic network weights. The state vector (Eq. (15)) is randomly initialized according to Table 4 before the start of each training episode to create the initial observation vector. The ranges for observation vector are matched with the initial conditions seen in conducted flight tests. The policy is then rolled into the environment in a Monte Carlo (MC) fashion to collect samples of trajectories. Each collected trajectory contains a sequence of (state, action, reward) tuples of a complete roll-out episode of length T = 128 timesteps and is stored in a memory buffer of size 5120 (). PPO uses a full buffer for the network update and refreshes the memory buffer with new trajectories after each update (on-policy update). Before each network update, the buffer is divided into sequences for LSTM layer input, each sequence with a length of . The update step uses Adam optimizer [59], a state-of-the-art stochastic gradient descent algorithm. After each update step, the policy parameter θ is moved toward the direction of higher suggested by the gradient of policy objective . The training is stopped once the desired performance is reached.
Figure 10 shows the sum of the reward function values of each complete episode (total episodic reward) of episodes after each update step during the policy training process. The training stops (convergence) after a total of 311 update steps (about 1.6 × 106 timesteps) for both π1 and π2. Higher mean and variance in reward values for π2 is observed because of the use of dynamic randomization. The algorithm applies the increased penalty weights of Group 5 after 310 update steps. The total training time of the controller was about 33 min on a laptop computer with a 12-core i7-9750H CPU and an RTX 2070 Max-Q GPU.
6 Flight Test Results
The performance of the developed controllers is validated in actual flight test experiments. Four flight tests are performed in which the controller's performance is evaluated in different scenarios and compared to controllers developed using modern and adaptive control techniques. The flights were performed on different days with different wind conditions demonstrating the ability of the controller to handle disturbances. The flights were also subject to sensor noise inherently present in the flight sensors.
In the first flight test (Flight 1 in Table 5), the aircraft was commanded to fly multiple laps around a rectangular path. This flight was conducted in 7.8 ft/s wind conditions coming from the West with gusts up to 10.7 ft/s. Thus, the wind speed was up to 21% of the commanded cruise speed of 50 ft/s. The laps were performed with (a) the neural network controller trained using the LSTM model and dynamic randomization (NN π2), (b) a linear quadratic regulator (LQR) controller developed using modern control techniques, and (c) an L1 adaptive controller. The LQR and L1 controllers was developed using LTI-based dynamic model as described in Sec. 3 and manually tuned for the targeted aircraft in flight tests. The 2D trajectory tracking performance of the aircraft using the different controllers is presented in Fig. 11. Flight using the neural network controller had better trajectory tracking than the other two controllers. Table 5 compares the root mean square error in roll angle tracking and the results show that the neural network controller had smaller root mean square error compared to the LQR and L1 controllers. The neural network controller yielded 30% and 60% improvements in the maximum tracking error at the east leg, compared to the LQR and L1 controllers, respectively.
Flight | Controller | (%) | RMSE (deg) | Max. error (ft.) | |
---|---|---|---|---|---|
1 | LQR | 100 | 3.03 | 71 East | |
1 | NN π2 | 100 | 2.53 | 50 East | |
1 | L1 | 100 | 5.50 | 128 East | |
1 | LQR | 50 | 2.91 | 69 East | |
1 | NN π2 | 50 | 2.46 | 45 East | |
1 | L1 | 50 | 6.47 | 147 East | |
1 | LQR | 0 | 2.90 | 69 East | |
1 | NN π2 | 0 | 2.60 | 45 East | |
1 | L1 | 0 | 6.20 | 158 East | |
2 | LQR | 100 | 3.40 | 113 East | |
2 | NN π2 | 100 | 3.45 | 97 East | |
2 | NN π1 (LTI only) | 100 | Failed | Failed | |
3a | NN π2 | 100 | 3.72 | 254 South | |
3a | LQR | 100 | 6.70 | 470 South | |
4 | LQR | 100 | 3.80 | 45 West | |
4 | NN π2 | 100 | 4.30 | 66 West |
Flight | Controller | (%) | RMSE (deg) | Max. error (ft.) | |
---|---|---|---|---|---|
1 | LQR | 100 | 3.03 | 71 East | |
1 | NN π2 | 100 | 2.53 | 50 East | |
1 | L1 | 100 | 5.50 | 128 East | |
1 | LQR | 50 | 2.91 | 69 East | |
1 | NN π2 | 50 | 2.46 | 45 East | |
1 | L1 | 50 | 6.47 | 147 East | |
1 | LQR | 0 | 2.90 | 69 East | |
1 | NN π2 | 0 | 2.60 | 45 East | |
1 | L1 | 0 | 6.20 | 158 East | |
2 | LQR | 100 | 3.40 | 113 East | |
2 | NN π2 | 100 | 3.45 | 97 East | |
2 | NN π1 (LTI only) | 100 | Failed | Failed | |
3a | NN π2 | 100 | 3.72 | 254 South | |
3a | LQR | 100 | 6.70 | 470 South | |
4 | LQR | 100 | 3.80 | 45 West | |
4 | NN π2 | 100 | 4.30 | 66 West |
Flight 3 has a triangular flight path while the other flights have rectangular flight path. This contributes to the different 2D tracking errors presented for Flight 3.
The controller's performance was also evaluated under the adverse conditions of degraded rudder control surface effectiveness. To artificially emulate rudder effectiveness failure, the aircraft rudder commands generated by the controllers were multiplied by an effectiveness factor () before being sent to the rudder servos. The controllers were tested under 50% rudder effectiveness and 0% rudder effectiveness (i.e., corresponds to rudder is not working). The flight trajectory tracking performance under these failure cases is presented in Fig. 11. Flight using the neural network has improved tracking over the other two controllers. Table 5 presents the maximum East error at the east flight leg for all three controllers under the different rudder degradation settings. Flight using the neural network controller yielded 36% and 72% improvements in the maximum tracking errors at the east leg compared to the LQR and L1 controllers, respectively. Rudder degradation did not have an adverse effect on the trajectory and roll angle tracking performance of the neural network controller, as seen in Table 5. The neural network controller again had the lowest roll angle tracking error in the rudder degradation flights.
In the second flight test (Flight 2 in Table 5), the aircraft was again commanded to fly around a rectangular path. This second flight was conducted in 7.3 → 11 ft/s wind conditions coming from the North. In this flight, a comparison was made between (a) the neural network controller trained using the LSTM model and dynamic randomization (NN π2), (b) the neural network controller trained using the LTI model only (NN π1), and (c) the LQR controller. Figure 12 shows the last 15 s of flight using the controller trained using the LSTM model with dynamic randomization (NN π2), then control was switched to the controller trained using the LTI model only (NN π1). The controller, trained using only the LTI model, could not safely control the aircraft and caused the aircraft to go into unstable behavior and roll almost 360 degrees, and the human pilot took back control of the aircraft. Comparison between the roll angle and trajectory tracking performance of the NN π2 and LQR controllers are presented in Table 5 and Fig. 13 where the neural network controller is seen to have better or similar performance compared to the LQR controller.
The performance of the developed controllers was evaluated in a third, more challenging flight scenario (Flight 3 in Table 5). The aircraft was commanded to fly around a triangular path as shown in Fig. 14, where the triangle has angles of about 90-25-65 degrees. This flight scenario puts a demand on the aircraft to perform large changes in heading and very drastic maneuvers. A comparison was made between (a) the LQR controller and (b) the neural network controller trained using dynamic randomization and the LSTM model (NN π2). Flight using the neural network controller has significantly better trajectory tracking as presented in Fig. 14. For example, at the South-West angle of the triangle, flight using the neural network controller has significantly improved tracking performance. As presented in Table 5, the maximum error from the South leg is 254 ft. for the neural network controller, which is 54% of the error for the LQR controller (470 ft.). This third flight was conducted in 8.8 ft/s wind conditions coming from the North-East with gusts up to 13.2 ft/s. Thus, the wind speed was up to 26% of the aircraft's 50 ft/s commanded cruise speed. The neural network controller without the sideslip angle (β) as input also showed comparable coordinated turn performance in flight tests during sharp turns with the well-tuned LQR controller that uses sideslip angle as input.
The fourth flight test (Flight 4 in Table 5) aimed to assess the capability of flight controllers in executing complex collision avoidance maneuvers in the presence of wind. During this test, the aircraft navigated through the southern segment of its flight path, successfully avoiding a virtual obstacle. The comparison was made between the flight performance of (a) an LQR-based controller and (b) the neural network controller ANN π2. This flight was conducted in 8.8 → 11 ft/s East wind conditions, corresponding to wind speed up to 24% of the aircraft 45 ft/s commanded cruise speed. Obstacle avoidance was done based on the approach in Ref. [64]. The aircraft under command of the neural network controller could successfully avoid the obstacle and follow the desired flight path as presented in Fig. 15. The neural network controller had similar roll angle and trajectory tracking performance to the LQR-based controller as seen in Table 5 and Fig. 15.
All four flight tests performed using the neural network controller π2 were subjected to different wind disturbance conditions and the controller was subject to the sensor noise. The flights were conducted on different days, in different wind and gust magnitudes, and different wind directions. Table 6 summarizes the wind conditions of the four flights. The wind conditions reached up to 26% of the aircraft cruise speed. The neural network controller π2 successfully controlled the aircraft in these conditions, and showed better or similar performance to LQR-based and L1 adaptive controllers.
Flight | Cruise speed, VC (ft/s) | Wind direction | Wind → gust (ft/s) | % of VC |
---|---|---|---|---|
1 | 50 | W. | 7.8 → 10.7 | 21% |
2 | 50 | N. | 7.3 → 11.0 | 22% |
3 | 50 | NE. | 8.8 → 13.2 | 26% |
4 | 45 | E. | 8.8 → 11.0 | 24% |
Flight | Cruise speed, VC (ft/s) | Wind direction | Wind → gust (ft/s) | % of VC |
---|---|---|---|---|
1 | 50 | W. | 7.8 → 10.7 | 21% |
2 | 50 | N. | 7.3 → 11.0 | 22% |
3 | 50 | NE. | 8.8 → 13.2 | 26% |
4 | 45 | E. | 8.8 → 11.0 | 24% |
As presented, flight using the neural network controller trained using the LSTM model and dynamic randomization (NN π2) showed significantly improved trajectory tracking performance compared to the LQR and L1 controllers (as seen in Flight 1, even for rudder degradation cases, and in Flight 3 in the challenging scenario requiring a large change in heading). The neural network controller trained using the LTI model only (NN π1) was unsuccessful and it caused the aircraft to go into a 360-deg roll. This shows that using the LSTM model and dynamic randomization can practically yield a successful flight controller with improved performance over controllers developed using modern and adaptive control techniques.
7 Conclusions
In this work, data-driven machine learning techniques are used to improve the dynamic model of an uncrewed aircraft using a bank of available flight test data. A recurrent neural network with LSTM architecture is shown to have improved modeling over dynamic models developed using physics-based methods. Unlike restrictive classical system identification methods, the ML LSTM method provides the required methodology and framework to use any portion of flight test to improve the fidelity of an aircraft dynamic model. Lateral-directional RL-based controllers are developed using the PPO deep RL algorithm. The RL-based controller performance, stability, and robustness are improved using a training environment that utilizes both the LSTM and physics-based dynamic models in a random fashion and a reward function that uses terms to regulate rates of control surfaces along with other states. The developed controller is tested in different flight test scenarios and is compared to controllers designed using modern (LQR) and adaptive (L1) control techniques. Assessing the controller's performance in benign flight test conditions or simple path following missions is insufficient. Several complex paths and intentional adverse onboard conditions in the presence of exogenous disturbances are used to quantify the improvement in aircraft dynamics model fidelity. The flight performance using the RL-based controller is observed to be significantly better than the LQR and L1 controllers—even during rudder degradation flight tests and in challenging flight scenarios requiring large changes in heading.
Acknowledgment
Much appreciation is given to collaborators from the KU Flight Research Lab, especially, Justin Clough, Megan Carlson, and Alex Zugazagoitia for their assistance in flight test support and execution.
Funding Data
National Aeronautics and Space Administration (NASA) and Armstrong Flight Research Center (Project No. 18CDA067 L; Funder ID: 10.13039/100007346).
Federal Aviation Administration (FAA) (No. 908-1003025; Funder ID: 10.13039/100006282).
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.