Introduction
This post presents an overview of open training datasets for radio frequency (RF) signal classification with AI and machine learning. The task of radio signal classification can include automatic modulation classification, signal identification and specific emitter recognition, as outlined in this introductory article. Good training datasets are the backbone of modern machine learning algorithms, such as deep neural networks, and are therefore very important.
Several training datasets for RF signals have been published over the years. The available datasets often have very different properties, that influence their applicability to certain classification tasks. Here are some characteristics that can differ between datasets, even if they are designed for a similar tasks:
- Data classes or labels: e.g. different set of modulation types
- Source of RF data: software simulated, lab or real-world environment
- Augmentation: different noise, fading, frequency offsets, etc.
- Data format: IQ/spectral representation, sampling frequency, observation time
- Number of data samples, i.e. the size of the dataset
- Quality of data and labels: amount of incorrect labels, unwanted data bias, empty data samples, etc.
In particular, the quality of training data and labels has a major impact on the performance of machine learning systems. This issue often appears only in real-world operation, as it tends to be less noticeable when training and test data are taken from the same underlying dataset (as it is the case with the typical training/validation split).
Overview of Datasets
The following table provides a compact overview of some open datasets for RF signals and is constantly updated. If you think a dataset should be included, please send me a message.
Dataset | Quality* | Task | Classes | Signal Types | Source | Data Format | Dataset Size |
---|---|---|---|---|---|---|---|
RadioML 2018 | Flawed | Modulation Classification | 24 | variants of ASK, PSK, APSK, QAM, AM, FM | Simulated | 1024 x IQ | 2M (21 GB) |
Panoradio HF | OK | Signal Identification | 18 | HF signals | Simulated | 2048 x IQ | 173k (5 GB) |
CSP Blog | OK | Modulation Classification | 8 | variants of PSK, MSK, QAM | Simulated | 32k x IQ | 112k (25 GB) |
Torchsig | OK | Modulation Classification | 53 | variants of ASK, PAM, PSK, QAM, FSK, OFDM | Simulated | 4096 x IQ | 5.3M (340 GB) |
HisarMod | Flawed | Modulation Classification | 26 | variants of FSK, PSK, QAM, PAM and analog | Simulated | 1024 x IQ | 780k (5 GB) |
(*) Quality: As a simple quality check, I have inspected random data samples to perform a simple check on the plausibility of waveform and ground truth labels in order to detect severe problems.
RadioML
RadioML was one of the first open datasets for modulation recognition. The most recent version 2018.01 features 24 “textbook” modulation types (different variants of ASK, PSK, APSK, QAM, AM, FM). Each signal consists of 2048 IQ samples. Although RadioML is very widely used in the scientific community, several severe flaws of the data have become public over time (e.g. here). The datasets are now marked as “erratic” on their official website.

Panoradio HF
The Panoradio HF dataset contains different signal classes or modes, rather than “textbook” modulation types. There are 18 signal classes, that are used in the shortwave band, including different analog and digital modes (Morse code, SSB, AM, radiofax, Navtex and various digital modes from amateur radio). The data samples are vectors of 2048 IQ samples with a sampling frequency of 6 kHz. The signals contains varying data content, such as text, images, speech and music. Augementations are random frequency and phase offset, varying SNR and different ionospheric fading channels according to standard CCIR-520.

CSP Blog
Several datasets are available from the cyclostationary blog, that cover different modulation types for modulation recognition. The most prominent data is CSPB.ML.2018R2, which contains 8 different, mostly PSK and QAM, modulation types (BPSK, QPSK, 8-PSK, DQPSK, MSK, 16-QAM, 64-QAM, 256-QAM). The length of the data samples is comparably long with 32,768 IQ samples. Augmentations include randomly varying symbol rate, SNR, frequency offset and pulse shaping roll-off.

Torchsig
Torchsig is an interesting dataset, that features a comparably large number of 53 modulation types (variants of ASK, PAM, PSK, QAM, FSK, OFDM). The generation software is open source and still maintained, so that the creation of custom or adapted datasets is possible. In the basic configuration signals consist of 4096 IQ samples. In addition, Torchsig also features wideband scenarios, that include multiple signals in one data sample (This task is more complex than pure modulation recognition).

HisarMod
Hisar includes 26 modulation classes (variants of FSK, PSK, QAM, PAM and analog modulations). Each data sample consists of 1024 IQ samples. Augmentations are randomly varying SNR and wireless fading channels. Some modulation parameters are fixed, like oversampling rate (2) and the roll-off factor of the RC shaping filter (0.35). Unfortunately, inspecting the dataset shows multiple severe issues, including questionable label assignment, strange waveforms and absence of modulated data, which raises questions on its usefulness.
