2026-05-02: Evolution of Language-Guided Control in UAV Systems

Evolution of Language-Guided Control in UAV Systems

From Search and Rescue (SAR) to intelligence, surveillance and reconnaissance (ISR), the use of small unmanned aerial systems (sUAS) in scientific and military applications has become increasingly more prominent. One of the most important aspects of the use of these devices is how they are directed. This blog takes a look at the evolution of control for these systems from 2017 to 2026, by examining the approach used in the paper, “'Fly Like This': Natural Language Interface for UAV Mission Planning” [1], to the use of the Large Language Models (LLMs) being examined in “A Universal Large Language Model-Drone Command and Control Interface” [2]. These papers explore how natural language interfaces (NLI), or more generally, each of these multiple human-system interfaces can be used to dictate the actions of these platforms. Taken together, these publications reflect the technological capabilities present at the time of their release, illustrating nearly a decade of progress. Evaluating the time span between these papers, one must consider the future trajectory and ask the question, what's next?

Fly Like This (2017)

In this paper, published in March of 2017, the authors perform a Human-System Integration (HSI) evaluation of the effectiveness, efficiency and natural comfort of each of the multiple interfaces, such as, speech, hand gesture, and traditional mouse, as input to define the flight path of an Unmanned Aerial Vehicle (UAV). Each interface had the objective of identifying one of 12 flight path segments expressed by the user. As each segment is generated by its respective interface, they are linked together to form the UAV’s planned flight path.

Figure 1. Gesture library of 12 trajectory segments developed by Chandarana et al. [3]

Interfaces

Gestures

Using a Leap Motion Controller (now known as Ultraleap), and accompanying SDK v2.2.6, 12 hand gestures, using only their right hand, were converted into 12 trajectory segments to produce the UAV flight path. The controller consists of three infrared cameras to produce sub-millimeter accuracy in an 8 square-foot interactive volume. The hand motions were chosen to reflect the overall motion the user expected the UAV to take. For example, if the user intends for the UAV to form an orbit, they create a circular pattern with their hand. Each segment was confirmed with either a right for yes or a left for no.

Figure 2: Leap Motion Controller (available in 2017)

Speech

The development of the speech interface used a commercial-off-the-shelf (COTS) microphone and processed that data using Carnegie Mellon University’s (CMU) Sphinx software with its built-in US-English acoustic and language models. Sphinx is a speech recognition software designed to take digitized acoustic signals and convert them to text. Software was developed to convert words such as “right”, “left”, “circle”, and “spiral” to UAV flight path segments. Each segment was confirmed with a verbal “yes” or “no”.

Figure 3: Audio-Technica PRO 8HEmW

Mouse

As a baseline, a mouse interface was used in conjunction with a drop-down menu to generate the 12 trajectory segments. Using a traditional computer mouse, the user would select the type of segment they wanted to be generated, followed by a pop-up window to confirm with a yes/no selection to add that segment to the flight path.

Accuracy and Speed

Of the three interfaces mentioned above, how fast can human intent be translated into actionable parameters to develop a flight path? The paper shows that in both accuracy and speed, the mouse reigned supreme by a relatively large margin.

Figure 4: Comparison of mouse, speech, and gesture input for UAV flight path entry from “Fly Like This: Natural Language Interfaces for UAV Mission Planning” [1]

In a real-world operational environment, the operator faces a significantly higher cognitive load than that represented in a controlled study. Beyond the mechanical act of inputting coordinates or gestures, the operator must first perform situational assessment and path optimization. This "pre-input" phase represents a substantial cognitive overhead that is often decoupled from the efficiency of the interface itself.

The Pre-LLM Conclusion

The study described above is a representative sample of the state of the art in 2017 and at that time, the underlying technology behind tracking speech and hand gestures became advanced enough to generate quality data to produce UAV pathing with relatively few errors. Even today, the implementation of such systems still remains technically impressive. However, fast forward to late 2022 and the emergence of publicly available LLMs hit the stage. This development substantially advanced human system integration, specifically with respect to the capabilities and applications of natural language processing technologies.

A Universal Large Language Model-Drone

Command and Control Interface (2026)

ChatGPT launched publicly in November of 2022, marking the moment Large Language Models became mainstream and publicly available. These LLMs began to approximate human intent by pattern matching text input and generating aligned responses. LLMs not only process human language, but also analyze a seemingly inexhaustible list of data structures, including JSON.

In 2025, the Model Context Protocol (MCP) was introduced and subsequently adopted by numerous LLM platforms, enabling these models to access a range of tools that enhance their functional capabilities. In the paper “A Universal Large Language Model – Drone Command and Control Interface” [2], the authors describe a methodology to enable LLMs, through an MCP Server, to control drone behavior. This represents a significant step forward in translating human intent to UAV action or behavior.

The Architecture & Interface

Figure 5: System architecture of the MCP-based drone control interface from
“A Universal Large Language Model -- Drone Command and Control Interface” [2]

The authors demonstrated this concept by constructing their architecture using three primary components, an LLM, and MCP Server, and a drone, both virtual and physical. The LLM at this point has become an interchangeable component and therefore multiple LLM providers such as OpenAI, Anthropic, and Gemini are all capable of performing in this architecture. These LLMs aren’t restricted to connecting to a single MCP server, making for a modular system that allows the system architect to pick and choose the best MCP servers for the system they’re constructing. The authors connected the Google Map MCP server to the LLM to provide the model with navigational information, while the ‘drone controlling’ MCP Server was a custom server called ‘droneserver’, which is freely provided along with the source code on GitHub. Combined, these two servers provided the LLM with the information and control required to produce Micro Air Vehicle Link (MAVLink) messages. MAVLink is a common standard used in the drone community which provides command and control, as well as telemetry from the drone. The third component of this architecture can either be virtual or physical. For both the physical and virtual components of the study, the authors employed the control and simulation software ArduPilot. Initially the authors connected these MCP servers to a virtual drone controlled by ArduPilot in a virtual environment such as Gazebo. When the final component was used as a physical system, a drone equipped with a Raspberry Pi Zero W (with connection to the internet) had the MCP server installed on it locally and was provided an interface to the flight controller.

The Evolution of Objective-Based Control

A decade of progress has replaced rigid controls with total mission flexibility. Users have the option to zoom out to issue broad, high-level objectives or dive deep into the weeds of specific flight paths, choosing the exact level of abstraction the mission requires. For example, in an ISR application, a user can tell the LLM to search a given area and it can refer to the appropriate algorithm to maximize a specific goal such as fuel efficiency or coverage. Alternatively, a user could specify tight flight paths for a UAV to follow or even mix and match.

With the addition of the MCP server, the LLM gains access to additional information the user isn't required to specify or even consider. If the objective the system needs to maximize is fuel efficiency, the LLM could autonomously access the latest weather patterns allowing the drone to map a course over various altitudes capitalizing on tailwinds to increase the range and time of flight for the platform.

Ultimately, with the introduction of the LLM, the shift in control has gone from manual navigation to strategic oversight. By abstracting the technical burdens of UAV flight, these types of systems empower the operator to focus on the mission. Advancing forward, a single operator is no longer restricted to one or a few drones, but an entire swarm of drones who handle the user's intent.

- John Deasy

References

Saephan, M., Meszaros, E., Trujillo, A., & Allen, B. (2017). Fly like this: Natural language interfaces for UAV mission planning. In Proceedings of the Tenth International Conference on Advances in Computer-Human Interactions (ACHI 2017) (pp. 40–46). IARIA.

[2] Ramos-Silva, J. N., & Burke, P. J. (2026). A universal large language model -- drone command and control interface. arXiv. https://doi.org/10.48550/arXiv.2601.15486

[3] Chandarana, M., Meszaros, E. L., Trujillo, A. C., & Allen, B. D. (2016). Natural language and gesture-based interfaces for UAV mission planning. AIAA AVIATION Forum, Washington, D.C. https://ntrs.nasa.gov/citations/20160010163

Search This Blog

Web Science and Digital Libraries Research Group

2026-05-02: Evolution of Language-Guided Control in UAV Systems

Comments

Post a Comment