Abstract
This work presents mathematical and practical frameworks for designing deep deterministic policy gradient (DDPG) flight controllers for fixed-wing aircraft. The aim is to design reinforcement learning (RL) flight controllers and accelerate training by substituting the six degrees-of-freedom aircraft models with linear time-invariant (LTI) dynamic models. The initial validation flight tests of the DDPG RL flight controller exhibited poor performance. Post-flight test investigation revealed that the unsatisfactory performance of the RL flight controller could be attributed to the high reliance of the LTI model on accurate control trim values and the substantial errors observed in the predicted trim values generated by the engineering-level dynamic analysis software. A complementary real-time learning Gaussian process (GP) regression was designed to mitigate this critical shortcoming of the LTI-based RL flight controller. The GP estimates and updates the trim control surfaces using observed flight data. The GP regression method incorporates real-time corrections to the trim control surfaces to enhance the performance of the flight controller. Flight test validation was repeated, and the results show that the RL controller, bolstered by the GP trim-finding algorithm, can successfully control the aircraft with excellent tracking performance.
1 Introduction
Although the availability of high-performance computing systems has enabled artificial intelligence (AI) and machine learning applications to emerge and spread over a wide range of disciplines, their integration into aerospace systems has been relatively slow. Reinforcement learning (RL) is the subject of recent research in fields such as autonomous vehicles [1,2] and flight control. RL provides a means to develop sophisticated nonlinear agents that learn through experience to perform desired tasks.
Existing research on the application of RL techniques to flight control has primarily focused on simulation-based validations and rotary-wing vehicles. Notable examples include the application of RL techniques to quadcopters [3–6], with validation through simulation-based tests [3,4] and actual flight tests [5,6].
For fixed-wing aircraft, research has explored the twin-delayed deep deterministic policy gradient (TD3) algorithm [7], RL adaptive critic design (ACD) for altitude tracking in a Cessna Citation aircraft [8], and an ACD framework for a business jet aircraft [9], all validated through simulations. The recent development of the soft actor-critic deep reinforcement learning (DRL) algorithm has been applied to control a Cessna Citation 500 jet aircraft, while using a separate proportional-integral-derivative controller for airspeed tracking, with validation in simulations [10]. Another study utilized the normalized advantage function variant of the Q-learning algorithm to control the acrobatic flight of a fixed-wing aircraft, but only simulation-based testing was conducted [11]. Additional works include the use of the proximal policy optimization (PPO) algorithm for perched landing of a fixed-wing unmanned aerial system (UAS) with variable wing sweep control, limited to “on/off” throttle control [12], and RL as a method for active stall protection by deflecting flaps on the elevator of a fixed-wing glider [13].
Reference [14] presents the application of the PPO DRL algorithm to control airspeed and pitch angle, with actual flight tests conducted on a fixed-wing UAS to evaluate the performance of the RL multi-input-multi-output longitudinal flight controller. While references [15,16] employed the deep deterministic policy gradient (DDPG) algorithm and a linear time-invariant (LTI) model of the aircraft to control pitch angle and airspeed, the developed controller initially exhibited poor performance due to errors in predicted trim elevator and throttle settings by the engineering analysis software. To address this issue, manual adjustment of the trim throttle and elevator values was performed through the ground control station, which proved inefficient and time-consuming.
In this work, we address the shortcomings observed in the DDPG flight controller presented in references [15,16]. We develop a longitudinal neural network controller using a similar approach and incorporate real-time Gaussian process (GP) regression to automatically determine trim throttle and elevator values based on observed flight data. Additionally, we account for actuation delays caused by servo dynamics during controller training and include pitch rate in the design of the controller’s reward function which is critical to avoid rapid maneuvers and unintentional stall, factors that were not considered in Refs. [15,16] but are crucial for effective controller design. The DDPG-trained neural network controller, augmented with the GP algorithm, successfully controlled the aircraft in real-world flight tests.
2 The SkyHunter Aircraft
The SkyHunter UAS is the testbed aircraft used in this work (see Fig. 1). It is a commercially available fixed-wing UAS mostly made of foam. The aircraft has a twin tailboom design and uses a single pusher electric motor. It comes with elevator and aileron control surfaces and has been modified in-house to include rudders. The aircraft has a 1.8 m wingspan, a length of 1.4 m, and weighs 4 kg.
We developed two dynamic models for the SkyHunter UAS. The first model is a six-degrees-of-freedom (6 DoF) simulation environment that models the fixed-wing aircraft dynamics based on nonlinear and coupled equations of motion. The second model is a LTI decoupled dynamic model for the aircraft longitudinal dynamics. Both models were developed using stability and control derivatives estimated using advanced aircraft analysis (AAA) software [17]. AAA is an aircraft design software that uses physics-based and semi-empirical methods to obtain the stability and control derivatives. The methods used in AAA are good for preliminary aircraft design, thus, they yield low-fidelity aircraft models. However, the costs of obtaining such models are low, and we find the generated models to be practical for research purposes. The aircraft moments of inertia were obtained using swing tests. Motor thrust characteristics were obtained using thrust stand testing. The developed dynamic models were improved using flight data. Details on the fixed-wing aircraft equations of motion and methods for developing 6 DoF and LTI simulations can be found in Refs. [18,19].
3 Development of Aircraft Controller Using Reinforcement Learning
In reinforcement learning, the goal is to have an agent learn how to perform a task through interactions with the environment. As the agent performs the task, it receives a feedback signal consisting of rewards or penalties. The RL algorithm is designed so that the agent learns how to perform the task by learning to maximize the total reward it receives.
The recently developed deep deterministic policy gradient (DDPG) algorithm, which is used in this work, is a model-free off-policy actor-critic deep reinforcement learning algorithm [20]. The algorithm is applicable to problems with high-dimensional continuous action spaces. The following subsections present further details on the RL framework, the DDPG algorithm, and the training setup used in this work.
This research builds on Refs. [15,16] and thus used the DDPG algorithm. The PPO algorithm was used in Refs. [14] and yielded excellent results. Both the DDPG and PPO algorithms were found suitable for designing fixed-wing aircraft flight controllers. Recent research has also investigated the use of DDPG, PPO, and trust region policy optimization algorithms for developing RL-based controllers for a quadrotor aircraft and presented results from the three approaches [5].
3.1 The Reinforcement Learning Framework.
The RL framework consists of two parts: an environment and an agent. The environment refers to the dynamics of the system. In this work, the environment is the LTI model presented in Sec. 2 along with the servo dynamics. The environment takes in the current time-step states and control actions (st and at, respectively) and it outputs the next time-step states (st+1). The environment also outputs a reward signal, rt, which indicates how good it was to take the given action, at, and to go to the new state, st+1. The reward function is designed by the user, and should reflect what kinds of behavior are perceived to be good or bad. The agent, or the actor, is the entity that takes actions. The agent is a function (or a mapping) that takes in the current state, st, and outputs a probability distribution over the action space, p(a). If the actor is deterministic, as in this work, then the actor directly outputs the action, at, given the current state, st.
3.2 DDPG Algorithm Overview.
The DDPG algorithm uses an actor-critic framework and uses neural networks as the actor and critic elements. In the actor-critic framework, the actor takes decisions and the critic is used to estimate the value of being at given states and taking given actions at given states. The actor network is a deterministic policy which maps the observations, s, to actions, a. The actor network is denoted by the symbol μ and the network parameters (weights and biases) are denoted by . The goal of the RL algorithm is to train the actor network, by optimizing its parameters, , so that the return, Eq. (4), is maximized. The critic network is used to estimate the action-value function and is denoted by the symbol Q. The critic network parameters are denoted by θQ.
3.3 Longitudinal Controller Training Setup.
The actor and critic networks used in this work have the architectures presented in Figs. 2(a) and 2(b), respectively. These networks have similar architectures to the networks used previously in Ref. [16]. In this work, the actor takes the normalized velocity tracking error ((V − Vcmd)/Vcmd), the pitch tracking error (θ − θcmd), and the pitch rate (q) as inputs. The actor outputs the perturbations in throttle and elevator commands (δt and δe, respectively). The actual control commands used in flight tests are thus the perturbations plus the trim values. E.g., δe,total = δe + δe1. The network has three hidden layers. Each layer consists of 50 neurons with ReLU activation functions. All layers are fully connected.
The critic network has two input stages similar to the original DDPG article [20]. In the first stage, the states are input to the network and then pass through a hidden layer consisting of 50 neurons using ReLU activation functions. Then, this first stage passes through a linear hidden layer (layer consisting of multiplication by weights and addition of biases only with no nonlinear activation functions). In the second input stage, the controls enter the network and pass through a linear hidden layer. Next, the first and second stages of the critic network are joined together through element-wise addition. The network then contains two more hidden layers with ReLU activation functions. The output layer consists of a single linear neuron. This neuron outputs an estimate of Q(s, a), which is the action-value function. Again, each of the network hidden layers consists of 50 neurons.
The simulation environment used for training is the LTI model presented in Sec. 2 along with the servo dynamics. Using the LTI model allows the training to be faster and simpler than the 6 DoF model, requiring expensive wind tunnel tests or time-consuming and labor-intensive computational fluid dynamics (CFD) analysis to find total aerodynamics and propulsive forces and moments. In many controller designs and in the absence of high-fidelity wind tunnel or CFD models, a perturbed version of 6 DoF with linearized aerodynamic and propulsive models is used [22–24]. LTI model-based training decouples longitudinal and lateral-directional motion, enabling controller design to focus solely on longitudinal motion without requiring simultaneous training or simulation of a lateral-directional controller. A comparison was performed to assess the advantage of using an LTI model for RL training; the time requirement of the LTI model in a 1000-episode training was compared to the perturbed 6 DoF model with linear aerodynamics and propulsive forces and moments (a simpler version of 6 DoF). The LTI-based RL training was 45% faster than the perturbed 6 DoF model. The comparison was made on a computer with a 6-core, 12 logical processors, i7-8700 CPU. Another timing comparison was performed for the PPO RL algorithm, which uses an approach similar to Ref. [14], where the LTI-based training was found to be 2.6 times faster than the perturbed 6 DoF-based training.
At the end of each episode, the developed actor was tested on 81 simulations which start from different initial conditions. The sum of discounted rewards in the training episodes, Rtrain, and the average sum of discounted rewards in the 81 test simulations, , were calculated and used to evaluate the training progress. Figure 3 presents the training progress. The figure also shows the critic network predictions of the action-value function.
The highest sum of discounted rewards on the training episodes was obtained at episode 792 and it was Rmax,train = −116. The highest average sum of discounted rewards on the 81 test episodes was obtained at episode 828 and it was . The controllers obtained from episodes 792 and 828 were tested in simulation and were found to have similar performance. The controller from episode 828 was selected to be flight tested.
4 Gaussian Process-Based Control Trim Prediction
GP is used as a nonlinear regression method to calculate the aircraft’s control trims over time automatically. It is a non-parametric method with relatively higher data efficiency than the other nonlinear and parametric regression method counterparts (e.g., neural networks). It also has relatively small numbers (three for each input) of trainable hyperparameters, which can be trained offline using actual flight test data.
5 Flight Results
Before conducting flight tests, the performance of the developed RL longitudinal flight controller for the SkyHunter UAS was evaluated in the software and hardware in the loop simulations (SiTL and HiTL) followed by three flight tests. The flights consisted of commanding the aircraft to fly in a rectangular flight pattern at constant airspeed and altitude. Figure 4 presents the data collected in these three flights.
In the first two flights, the aircraft could not maintain the desired altitude. Although in the first flight, the aircraft was maintaining the desired altitude initially (Fig. 4(a)), each time it had to turn, the DDPG flight controller failed to follow the commanded pitch angle, which meant losing altitude and higher commanded pitch angles to compensate. The inefficient pitch tracking by the RL flight controller caused saturation of pitch angle commands and the inability of the aircraft to maintain the altitude. Starting at the 870 s mark (point A in Fig. 4(a)), the aircraft was performing a counterclockwise turn at the South–East corner of the flight path. The wind condition recorded for this flight was a 10.3–13.2 ft/s wind coming from the South–East direction, as noted in the caption of Fig. 4(a). Thus, the aircraft was encountering tailwinds in the section of flight following point A, which represents an adverse flight condition since the aircraft can be prone to losing lift. As a result of turning and the adverse wind condition, the aircraft lost over 135 ft of altitude in 15 s. In the second flight (Fig. 4(b)), the aircraft continuously lost altitude, and the flight had to be terminated. The aircraft lost 135 ft of altitude within about 28 s.
A common observation is present in the first two flights; the elevator trim angle was set to a value of δe1 = −0.3 deg, which is calculated by a physics-based model (i.e., AAA), though the actual elevator trim during these two flights were around δe1 = −4 to −6 deg. This adversely affects the controller tracking performance and although the controller correctly commanded negative elevator deflections (δe), the total elevator deflections (δe,total = δe + δe1) were insufficient to reduce the tracking errors due to the significant error in the elevator trim prediction. The aircraft needed more negative total elevator angles to generate more upward pitching moments and reduce navigation errors. The GP elevator trim predictions calculated offline for the first two flights, and presented using solid line style in Figs. 4(a) and 4(b), had more negative trim values. The GP-predicted trim values varied around −4 deg in the first flight, and −6 deg in the second flight, while the physics-based model trim was −0.3 deg. These more negative trim angles would have improved the altitude tracking performance.
When the GP trim-finding algorithm was operating in the third flight (Fig. 4(c)), it adjusted the elevator trim angle to have more negative values compared to the first two flights. The GP algorithm evaluated the elevator trim angle to be −3.5 deg at the beginning of the neural network flight and gradually increased the value to −3.3 deg by the end of the flight. Compared to the first two flights, the aircraft had improved altitude tracking and successfully maintained the desired altitude for the entire flight. The altitude tracking root-mean-square error (RMSE) is 8.3 ft in this flight, as presented in Table 1, while the aircraft was losing altitude in the first two flights. This is an excellent altitude tracking given the high wind condition of 10–17.6 ft/s [14,15]. To evaluate the consistency and performance of the AI/GP flight controller, the rectangular flight pattern was repeated three times.
Altitude | Airspeed tracking | ||
---|---|---|---|
tracking | RMSE (ft/s) | RMSE (ft/s) | |
Flight | RMSE (ft) | (first 28 s) | (overall) |
1 | Fell 135 ft. | 4.5 | 5.2 |
in 15 s | |||
2 | Fell 135 ft. | 4.0 | 4.0 |
in 28 s | |||
3 | 8.3 | 1.5 | 2.1 |
Altitude | Airspeed tracking | ||
---|---|---|---|
tracking | RMSE (ft/s) | RMSE (ft/s) | |
Flight | RMSE (ft) | (first 28 s) | (overall) |
1 | Fell 135 ft. | 4.5 | 5.2 |
in 15 s | |||
2 | Fell 135 ft. | 4.0 | 4.0 |
in 28 s | |||
3 | 8.3 | 1.5 | 2.1 |
Concerning airspeed tracking, the flight data show that the neural network controller was maintaining constant airspeed in all three flights to some measure. The airspeed did not diverge as the altitude tracking did. However, the aircraft was consistently flying slower than the desired airspeed in the first two flights. In the third flight, where the GP trim-finding algorithm is used, the aircraft had improved airspeed tracking as observed in Fig. 4. The airspeed tracking RMSE in the third flight is about twice as good (or better) compared to the first two flights. Table 1 presents the overall airspeed tracking RMSE of each flight. Since the second flight is only 28 s long, the airspeed tracking RMSE of the first 28 s of each flight is also presented for a fair comparison. Again, the 1.5–2.1 ft/s airspeed tracking RMSE is an excellent tracking performance, given the high wind [14,15].
Using the learning GP regression algorithm to predict the trim control settings based on observed flight data is critical since it is not possible to identify the errors in the trim settings through simulations of the controller. The simulations use the physics-based model trim values and there is no way to tell that the actual aircraft flight requires a different trim setting through simulations.
6 Conclusions
In this work, a longitudinal neural network controller is developed for an unmanned aerial system using the LTI dynamic model and DDPG reinforcement learning algorithm. Post-flight test analysis revealed the sensitivity of RL training using the LTI model to potential errors in the predicted trim control surfaces by the engineering-level analysis software. To overcome this, a learning GP regression algorithm is employed to determine real-time trim elevator and throttle settings using actual flight data. The RL-based flight controller, complemented with GP algorithm, provides a simple yet very effective controller framework for autonomous aircraft. RL controllers can be developed using low-fidelity models without significant reduction in the tracking performance, and they have the capability to generalize the actions around the control policy. The availability and affordability of LTI models will facilitate applications of RL and neural networks in flight control. The methods presented here can be applied to flight control of various aircraft configurations and other autonomous control applications.
Acknowledgment
This work was supported by the National Aeronautics and Space Administration (NASA) Project # 18CDA067L, the Fedral Aviation Administration (FAA) Funding # 15-C-UAS-KU-05, and the State of Kansas. Much appreciation is given to collaborators from the KU Flight Research Lab, especially, Aaron McKinnis, Robert Bowes, and Alex Zugazagoitia for their assistance in flight test support and execution.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.