February 9, 2025

Overview of Open Datasets for RF Signal Classification

Introduction

This post presents an overview of open training datasets for radio frequency (RF) signal classification with AI and machine learning. The task of radio signal classification can include automatic modulation classification, signal identification and specific emitter recognition, as outlined in this introductory article. Good training datasets are the backbone of modern machine learning algorithms, such as deep neural networks, and are therefore very important.

Several training datasets for RF signals have been published over the years. The available datasets often have very different properties, that influence their applicability to certain classification tasks. Here are some characteristics that can differ between datasets, even if they are designed for a similar tasks:

  • Data classes or labels: e.g. different set of modulation types
  • Source of RF data: software simulated, lab or real-world environment
  • Augmentation: different noise, fading, frequency offsets, etc.
  • Data format: IQ/spectral representation, sampling frequency, observation time
  • Number of data samples, i.e. the size of the dataset
  • Quality of data and labels: amount of incorrect labels, unwanted data bias, empty data samples, etc.

In particular, the quality of training data and labels has a major impact on the performance of machine learning systems. This issue often appears only in real-world operation, as it tends to be less noticeable when training and test data are taken from the same underlying dataset (as it is the case with the typical training/validation split).

Overview of Datasets

The following table provides a compact overview of some open datasets for RF signals and is constantly updated. If you think a dataset should be included, please send me a message.

Dataset Quality*TaskClassesSignal TypesSourceData FormatDataset Size
RadioML 2018FlawedModulation Classification24variants of ASK, PSK, APSK, QAM, AM, FMSimulated1024 x IQ2M (21 GB)
Panoradio HFOKSignal Identification18HF signalsSimulated2048 x IQ173k (5 GB)
CSP BlogOKModulation Classification8variants of PSK, MSK, QAMSimulated32k x IQ112k (25 GB)
TorchsigOKModulation Classification53variants of ASK, PAM, PSK, QAM, FSK, OFDMSimulated4096 x IQ5.3M (340 GB)
HisarModFlawedModulation Classification26variants of FSK, PSK, QAM, PAM and analogSimulated1024 x IQ780k (5 GB)

(*) Quality: As a simple quality check, I have inspected random data samples to perform a simple check on the plausibility of waveform and ground truth labels in order to detect severe problems.

RadioML

RadioML was one of the first open datasets for modulation recognition. The most recent version 2018.01 features 24 “textbook” modulation types (different variants of ASK, PSK, APSK, QAM, AM, FM). Each signal consists of 2048 IQ samples. Although RadioML is very widely used in the scientific community, several severe flaws of the data have become public over time (e.g. here). The datasets are now marked as “erratic” on their official website.

Example signals of the RadioML dataset in time domain and spectrogram
RadioML 2018.01 Examples (high SNR). Note the DC offset in the two examples at the left

Panoradio HF

The Panoradio HF dataset contains different signal classes or modes, rather than “textbook” modulation types. There are 18 signal classes, that are used in the shortwave band, including different analog and digital modes (Morse code, SSB, AM, radiofax, Navtex and various digital modes from amateur radio). The data samples are vectors of 2048 IQ samples with a sampling frequency of 6 kHz. The signals contains varying data content, such as text, images, speech and music. Augementations are random frequency and phase offset, varying SNR and different ionospheric fading channels according to standard CCIR-520.

Example signals of the Panoradio HF  dataset in time domain and spectrogram
Panoradio HF examples of shortwave signals, sample frequency is 6 kHz (high SNR), fading channels are clearly visible

CSP Blog

Several datasets are available from the cyclostationary blog, that cover different modulation types for modulation recognition. The most prominent data is CSPB.ML.2018R2, which contains 8 different, mostly PSK and QAM, modulation types (BPSK, QPSK, 8-PSK, DQPSK, MSK, 16-QAM, 64-QAM, 256-QAM). The length of the data samples is comparably long with 32,768 IQ samples. Augmentations include randomly varying symbol rate, SNR, frequency offset and pulse shaping roll-off.

Example signals of the CSP dataset in time domain and spectrogram
CSP Blog dataset CSPB.ML.2018R2 examples (high SNR)

Torchsig

Torchsig is an interesting dataset, that features a comparably large number of 53 modulation types (variants of ASK, PAM, PSK, QAM, FSK, OFDM). The generation software is open source and still maintained, so that the creation of custom or adapted datasets is possible. In the basic configuration signals consist of 4096 IQ samples. In addition, Torchsig also features wideband scenarios, that include multiple signals in one data sample (This task is more complex than pure modulation recognition).

Example signals of the Torchsig dataset in time domain and spectrogram
Torchsig “Sig53 Impaired Validation” data examples (high SNR)

HisarMod

Hisar includes 26 modulation classes (variants of FSK, PSK, QAM, PAM and analog modulations). Each data sample consists of 1024 IQ samples. Augmentations are randomly varying SNR and wireless fading channels. Some modulation parameters are fixed, like oversampling rate (2) and the roll-off factor of the RC shaping filter (0.35). Unfortunately, inspecting the dataset shows multiple severe issues, including questionable label assignment, strange waveforms and absence of modulated data, which raises questions on its usefulness.

Example signals of the HIsarMod dataset in time domain and spectrogram
Hisar Mod Data Examples showing questionable waveforms and signal content

Leave a Reply

Your email address will not be published. Required fields are marked *