This article shows how to generate good training data for RF signal classification tasks, such as automatic modulation classification (AMC) or radio signal identification.
The Need for RF Training Data
Machine learning algorithms often use large amounts of example data to train neural networks for a specific recognition task. For the domain of RF signal classification, this data consists of many labelled examples of RF signals for each of the output classes.
During training, the neural network extracts signal features from the training data, that are useful to distinguish the classes. It is important to understand, that this training process is purely data driven, i.e. the training is based only on the data and no other pre-knowledge. A neural network has no idea about signals, different signal modulations or modes. Instead, for a neural network, RF data is just a sequences of numbers. In these number sequences it merely tries to find common similarities in each class’ training data.
Unfortunately there is no guarantee that the found similarities or features are useful to discriminate between actual classes. Problems often occur, when the training data has a so called “bias”, that can easily mislead a neural network. Biased training data exhibits accidential properties, that allow to (easily) classify the training data – but these properties do not correspond to actual signal features and thus do not work in practice. An example for such a bias could be: most examples of class A have low SNR, while most examples of class B have high SNR. Then the probability is high, that a neural network just learns to distinguish the classes by looking at the amount of present noise. In this case the neural network will fail in practical operation.
A neural network can learn to make predictions only based on the available training data. Therefore it is vital for machine learning to use good training data. Good data embodies the actual characteristics of a signal class, but at the same time contains a lot of variations of signal properties, that are not characteristic for a class. If these two requirements are fulfilled, a neural network can learn to focus on characteristic features and at the same time to ignore irrelevant properties such as noise.
Typical RF Setup
The typical situation for RF machine learning is shown in the figure below. We assume that the RF ML system is processing signals on the receiver side.
The signals at the output of a receiver have gone through different impairments on their way from transmitter input to receiver output. It starts with a non-distorted waveform in the transmitter, called “clean signal” in the following. This clean signal undergoes different distortions or impairments, that make the signal look very different at the receiver output. These impairments include noise and fading from the channel, deviations between the local oscillators in receiver and transmitter, deviations of the ADC/DAC sample clocks, interference by other signals and the influence of band-limiting filters in both transmitters and receivers.
In a good training dataset all of these effects are modeled properly in order to make the neural network learn, that e.g. frequency and phase offsets or noise and fading are not characteristic features of a specific class, but need to be ignored during prediction.
The remainder of this article focuses on synthetic training data, that is generated using algorithms and software only. This approach gives the engineer maximum control over the training signal properties. Some research works also use real RF data obtained from measurements with real hardware setups in the lab. In this case, good care must be taken not to introduce a bias to the training data, e.g. due to constant “lab channel conditions”, constant frequency offsets, etc.
Introducing Diversity to RF Training Data
In the following, the important signal impairments for training data for RF machine learning are presented step by step. The example signals are from a simple signal class, namely Morse code modulation.
Content Diversity
A clean Morse code signal, as most other signals, varies over time. Its waveform changes depending on the actually sent content, e.g. the currently sent character. The figure below shows four short fractions of Morse code waveforms. They all look different, but belong to the same class (“Morse code”).
It is important to include a sufficiently large number of diverse transmitted content in the training data. This makes the neural network learn, that “Morse code” consists of many differently shaped waveforms, that share the typical Morse code on-off-keying modulation of typical speeds. I call these variations, that are based on the actual signal content, “content diversity” and it forms a good basis to build a training data set.
Frequency Offset Diversity
Frequency offset models the deviation of local oscillator frequency between transmitter and receiver. In literature, this is often termed “carrier frequency offset” (CFO). Introducing a frequency offset simply means shifting the signal’s spectrum up or down in frequency. A training data set should include different (random) frequency offsets from the range that is expected in the final application. A neural network trained with many different frequency offsets can learn, that carrier frequency is not a good feature to distinguish between the classes. This is exactly want we want.
Phase Offset Diversity
A Phase offset may be introduced by a phase difference between the local oscillators and/or the ADC/DAC sampling clock, as well as phase shifts during signal propagation. Since the phase shift is quasi unpredictable in a practical application, random phase shifts of any values from 0° to 360° should be included in the dataset.
Sample Frequency Offset Diversity
A deviating frequency between the DAC in the transmitter and the ADC in the receiver typically occurs in practice. Although these deviations are often small, e.g. <1%, these effects are present in actual receiver data. It should be mentioned, that sample frequency offset is different from carrier frequency offset. Sample rate offset squeezes or expands the signal in time and therefore also in frequency or bandwidth.
Noise Diversity
There are different sources of noise, such as receiver noise or atmospheric disturbance. Noise can be modelled e.g. by Gaussian (white) noise or more advanced noise models. The training data should contain different SNR values in order to achieve a good noise diversity.
Fading Diversity
Fading is the variation of signal amplitude over time and/or frequency. It occurs due to multi-path propagation or varying channel attenuation. Various channel models are available for different types of wave propagation. Simple models for mobile communications are Rayleigh and Rician fading. For the propagation of shortwave signals the Watterson Model is typically applied, which also includes doppler effects. Channel models often have several parameters, with which the exact channel behaviour can be set (e.g. standard ITU-R 1487 defines 10 different types of Watterson fading).
Interference Diversity
Sometimes a received signal includes interference from another signal in the background. There are many different scenarios where interference is present. Some examples are: accidental use of a channel by two users, intentional interference (jamming) or signal artifacts originating e.g. from intermodulation. Interference can occur in a large variety with different interfering signal type, amplitude, frequency offset or number of interferers.
Band-limiting Filters
A typical radio system uses different filters in the transmitter and receiver, that limit the bandwidth of a signal. Examples are waveform shaping filters, IF filters or high power analog filters, that limit out-of-band emissions. Note, that on the receiver side a signal is often already contaminated by noise. Therefore filtering does not only change the signal waveform itself, but also the shape of noise, leading to possibly colored noise. Different filters with different widths introduces some amount of channel bandwidth diversity.
Summary
It is important to include a large amount of diversity in the training data for RF ML system. Diverse data includes many different signal impairments, in order to make the neural network ignore the impairments, that occur in its practical operation and learn only the actual signal patterns of the classes.
The most important types of impairments for RF systems have been introduced above. A good training dataset includes random combinations of impairments, plus content diversity, as shown in the figure below.
The exact distributions of the impairments (e.g. range of SNR values, expected type of fading, interference expected or not) can be tailored to the application. It also depends on the ML task to be carried out (signal classification, spectral segmentation etc.). The Panoradio HF dataset also contains some of the presented impairments for shortwave signal classification.
Generating data synthetically is a good way to control the diversity of a dataset. Care must be taken when real-world data is used, because this type of data often has a bias and further action must be taken to get good training data.
If you are interested how the different impairments influence the performance of a neural network under real-world operation, please take a look at my paper on RF Signal Classification with Synthetic Training Data and its Real-World Performance (2022).
Dear Stefan,
Just watched your videos from SDRA’23 and found this website. The explanations on the classification methods is quite well presented. Just wondering if the “all conv net” or “deep CNN” configuaration could be run on a single board computer like a Raspberrypi 4.
Is it correct that you used Python libraries for programming and would any code be available for trying to replicate your findings on real world data? I am running a KiwiSDR server with reasonable performance in Jakarta.
best regards,
John
Hello John,
in principle the nets can also be executed on a Raspberry PI, although they have not been specifically designed for low computational load. In the end, the crucial question is how fast it is required to run.
The algorithms are indeed in Python, for more information please contact me by email.
best regards