Building an EEG Event Classification System

From Raw Brain Signals to Deep Learning Predictions: A Complete Tutorial

📊 TUH EEG Dataset 🧠 Deep Learning 🐍 Python / PyTorch

Hands-on Tutorial: This article summarizes our open-source EEG classification project. All code, notebooks, and trained models are available in the GitHub repository. Clone it, run the notebooks, and experiment with your own data!

01Introduction to EEG Classification

Electroencephalography (EEG) is a non-invasive method of recording electrical activity in the brain. Clinicians use EEG to diagnose conditions like epilepsy, sleep disorders, and brain injuries by identifying specific patterns—called events—in the signal.

Manually reviewing hours of EEG recordings is time-consuming and requires expert knowledge. This is where machine learning comes in: we can train models to automatically detect and classify these events, assisting clinicians in their diagnostic workflow.

In this guide, we'll build a complete pipeline for EEG event classification using the Temple University Hospital (TUH) EEG Corpus—one of the largest publicly available EEG datasets. By the end, you'll have a working deep learning model that can classify EEG segments into different event types.

The Complete Pipeline

EDF Files → Montage Transform → Windowing → Normalization → Model Training → Predictions

TUH Label Mapping

The TUH EEG Corpus uses a standardized labeling scheme across its datasets. Here's the complete mapping:

TUSZ / TUEV Labels (Events & Seizures)

Code Label Description
0NULLUnlabeled / padding
1SPSWSpike and slow wave
2GPEDGeneralized periodic epileptiform discharge
3PLEDPeriodic lateralized epileptiform discharge
4EYBLEye blink
5ARTFArtifact
6BCKGBackground (normal activity)
7SEIZSeizure (generic)
8FNSZFocal non-specific seizure
9GNSZGeneralized non-specific seizure
10SPSZSimple partial seizure
11CPSZComplex partial seizure
12ABSZAbsence seizure
13TNSZTonic seizure
14CNSZClonic seizure
15TCSZTonic-clonic seizure
16ATSZAtonic seizure
17MYSZMyoclonic seizure
18NESZNon-epileptic seizure
19INTRInterictal
20SLOWSlowing
21EYEMEye movement
22CHEWChewing artifact
23SHIVShivering artifact
24MUSCMuscle artifact
25ELPPElectrode pop
26ELSTElectrostatic artifact
27CALBCalibration
28HPHSHyperventilation
29TRIPPhotic stimulation

TUAR Labels (Artifact Combinations)

Code Label Description
30ELECElectrode artifact
31EYEM_MUSCEye movement + Muscle
32MUSC_ELECMuscle + Electrode
33EYEM_ELECEye movement + Electrode
34EYEM_CHEWEye movement + Chewing
35CHEW_MUSCChewing + Muscle
36CHEW_ELECChewing + Electrode
37EYEM_SHIVEye movement + Shivering
38SHIV_ELECShivering + Electrode

02Loading EDF Files & Understanding EEG Data

EEG data is typically stored in EDF (European Data Format) files—a standard format for biosignal recordings. Each EDF file contains multiple channels of continuous voltage measurements sampled at a fixed frequency (commonly 250 Hz).

Loading with MNE-Python

The MNE-Python library provides robust tools for EEG analysis. Here's how to load an EDF file:

import mne

file_path = 'recording.edf'
raw = mne.io.read_raw_edf(file_path, preload=False)

# Access metadata
print(f"Channels: {raw.ch_names}")
print(f"Sampling frequency: {raw.info['sfreq']} Hz")
print(f"Duration: {raw.n_times / raw.info['sfreq']:.1f} seconds")

The 10-20 Electrode System

EEG electrodes are placed according to the 10-20 international system—a standardized arrangement ensuring consistent positioning across recordings. Each electrode has a name like FP1, C3, or O2:

International 10-20 electrode placement system showing 21 electrode positions on the scalp
The International 10-20 system for electrode placement
  • Letters indicate brain region (F=frontal, C=central, T=temporal, P=parietal, O=occipital)
  • Odd numbers are on the left hemisphere
  • Even numbers are on the right hemisphere
  • Z indicates the midline (zero)

Montage Transformation

Raw EDF files often store data in referential montage—each electrode measured against a common reference. However, clinical analysis typically uses bipolar montage, where adjacent electrodes are subtracted to highlight local differences.

Why bipolar? Bipolar montage reduces common noise and better localizes abnormalities. A spike appearing between two electrodes will show opposite deflections in adjacent channels.

# Transform to bipolar montage
# Example: FP1-F7 = FP1 electrode - F7 electrode
bipolar_data[0, :] = referential_data[fp1_idx, :] - referential_data[f7_idx, :]

The difference between referential and bipolar montage is clearly visible when visualizing the same recording:

data = raw.get_data()
EEG visualization without montage transformation showing raw referential data
Raw EEG data in referential montage—each electrode measured against a common reference
EEG visualization after bipolar montage transformation
Same recording after bipolar montage transformation—adjacent electrodes subtracted to highlight local activity

The TUH dataset uses a 22-channel TCP (Temporal-Central-Parietal) montage organized into chains:

Chain Channels
Left Temporal FP1-F7, F7-T3, T3-T5, T5-O1
Right Temporal FP2-F8, F8-T4, T4-T6, T6-O2
Left Parasagittal FP1-F3, F3-C3, C3-P3, P3-O1
Right Parasagittal FP2-F4, F4-C4, C4-P4, P4-O2

Loading Event Labels

The TUH dataset provides labels in .rec files (CSV format) alongside each EDF:

18,10.5,11.5,4    # channel, start_time, end_time, event_code
18,2.6,3.6,4
18,358.3,359.3,4
18,359.3,360.3,5

We convert these time-based annotations into a labels array matching the data dimensions—shape (n_channels, n_samples)—where each sample has an associated event code.

EEG visualization with event labels showing eye movement (EYEM) annotations
Montaged EEG with event annotations—purple regions indicate eye movement (EYEM) artifacts detected across multiple channels

03Exploring Dataset Statistics

Before training a model, understanding your data distribution is crucial. The TUH EEG Corpus provides two complementary datasets with different annotation schemes. Note that we use only the 22-channel TCP montage described in the previous section across both datasets.

Background vs Unlabeled: These terms are often confused but have distinct meanings:

  • Background (BCKG): An explicit label assigned by expert annotators indicating normal brain activity—the annotator reviewed the signal and confirmed nothing clinically significant is present. Serves as reliable "negative" examples for training.
  • Unlabeled: Segments that haven't been reviewed at all—they could contain normal activity, artifacts, or clinical events, but we simply don't know. Should typically be excluded to avoid contaminating training with mislabeled samples.

TUH EEG Events Distribution

The Events dataset contains 359 EDF files with annotations for six clinical event types. Background activity dominates, while clinically significant patterns like spike-and-wave are rare:

Distribution of event types in TUH EEG Events dataset showing BCKG, GPED, PLED, ARTF, EYBL, and SPSW
TUH EEG Events dataset distribution—background (BCKG) accounts for 46% of labeled data, with periodic discharges (GPED, PLED) making up ~35%

TUH EEG Artifacts Distribution

The Artifacts dataset contains 290 EDF files with fine-grained annotation of non-brain signals. Understanding artifact patterns helps build robust preprocessing pipelines:

Distribution of artifact types in TUH EEG Artifacts dataset showing MUSC, EYEM, ELEC, and other artifact categories
TUH EEG Artifacts dataset distribution—muscle artifacts (MUSC) are most common at 33.5%, followed by eye movements (EYEM) at 25.9%

Class Imbalance Alert: Both datasets exhibit severe class imbalance. Rare but clinically important events like SPSW (1.3%) require careful handling during training—techniques like class weighting, oversampling, or focal loss can help.

04Preparing Data for Training

From this point on, we'll focus exclusively on the TUH EEG Events dataset. Our strategy is to split continuous EEG recordings into fixed-size windows for classification.

Windowing Strategy

We slice recordings into fixed-length windows (e.g., 1 second = 250 samples at 250 Hz). Key considerations:

  • Filter unwanted labels: Skip windows containing only background/unlabeled segments
  • Respect discontinuities: Only generate windows within contiguous valid segments
  • Configurable overlap: Overlapping windows increase training data but risk data leakage
def window_generator(dataset, filter_labels, window_size_sec, overlap):
"""Generate windows from EEG data, avoiding discontinuities."""
for data, labels, metadata in dataset:
    window_size = int(window_size_sec * metadata['sfreq'])
    step_size = int(window_size * (1 - overlap))
    
    # Create mask for valid samples (not in filter_labels)
    mask = np.any(~np.isin(labels, filter_labels), axis=0)
    
    # Find contiguous segments using diff
    padded = np.concatenate([[False], mask, [False]])
    changes = np.diff(padded.astype(int))
    starts = np.where(changes == 1)[0]
    ends = np.where(changes == -1)[0]
    
    for seg_start, seg_end in zip(starts, ends):
        for win_start in range(seg_start, seg_end - window_size + 1, step_size):
            yield data[:, win_start:win_start + window_size], ...

Normalization

EEG amplitudes vary significantly across recordings, channels, and subjects. We apply z-score normalization per channel to standardize:

def normalize_window(data: np.ndarray) -> np.ndarray:
"""Z-score normalization per channel."""
eps = 1e-8
mean = data.mean(axis=1, keepdims=True)
std = data.std(axis=1, keepdims=True)
return ((data - mean) / (std + eps)).astype(np.float32)

Label Aggregation

Each window contains multiple sample-level labels. We use majority voting to assign a single label per window:

def majority_label_vote(labels: np.ndarray) -> int:
"""Return label if all samples identical, else 0 (mixed)."""
if labels.min() == labels.max():
    return int(labels.flat[0])
return 0  # Discard mixed-label windows

Prepared Dataset Summary

After windowing 359 EDF files with 1-second windows and no overlap:

Class Description Windows Percentage
BCKG Background 51,636 68.4%
ARTF Artifact 10,118 13.4%
GPED Generalized Periodic Discharges 8,805 11.7%
PLED Periodic Lateralized Discharges 3,841 5.1%
EYEM Eye Movement 655 0.9%
SPSW Spike and Wave 397 0.5%
Total 75,453 100%

Preprocessing Tips

Before feeding EEG data into a model, consider these common preprocessing steps:

  • Band-pass filter (0.5–50 Hz): Removes DC drift and high-frequency noise while preserving clinically relevant frequencies
  • Notch filter (50/60 Hz): Eliminates power line interference (50 Hz in Europe, 60 Hz in North America)
  • Amplitude clipping: Cap extreme values (e.g., ±500 ”V) to reduce impact of movement artifacts and electrode pops

In this tutorial, we skip these steps to focus on the ML pipeline, but they can significantly improve model robustness in production.

05Training the Classification Model

With prepared data, we can now train a deep learning classifier. We use ResNet1D—a 1D adaptation of the popular ResNet architecture optimized for time series.

Class Remapping

To simplify our problem and address data scarcity in minority classes, we merge related events:

Original Mapped To Rationale
SPSW (1) Removed Too rare (397 samples)
GPED (2), PLED (3) Class 2: Periodic Similar clinical patterns
EYEM (4), ARTF (5) Class 1: Artifacts Both are non-brain signals
BCKG (6) Class 0: Background Normal activity baseline

This gives us a 3-class problem: Background, Artifacts, and Periodic Discharges.

Handling Class Imbalance

Even after remapping, background dominates. We apply random downsampling to balance:

# Before balancing:
# Class 0 (Background): 51,636
# Class 1 (Artifacts):  10,773
# Class 2 (Periodic):   12,646

dataset_balanced = downsample_class(dataset, class_to_downsample=0, target_count=13000)

# After balancing:
# Class 0: 13,000
# Class 1: 10,773
# Class 2: 12,646

Model Architecture

ResNet1D applies the residual learning framework to 1D convolutions:

ResNet1D Architecture for EEG Classification Input (1, 250) Conv1D k=7, s=2, f=32 BN → ReLU MaxPool k=3, s=2 Residual Layers Layer 1 2×Block 32 filters Layer 2 2×Block 64 filters Layer 3 2×Block 128 filters Layer 4 2×Block 256 filters GAP → (256,) Dropout FC 256 → 3 Output (3 classes) BasicBlock1D (Residual Block) x Conv1D BN→ReLU Conv1D BN + ReLU Skip Connection (identity or 1×1 conv) Legend: Convolution Pooling Residual Block Spatial dimension: 250 → 125 → 63 → 32 → 16 → 8 → 4 → 1 (GAP) Channel dimension: 1 → 32 → 32 → 64 → 128 → 256 Total parameters: ~964K (compact for efficient EEG inference)
ResNet1D architecture—input flows through initial convolution, four residual layer groups with increasing filters, then global average pooling to classification
Parameter Value Description
Input (1, 250) Single channel, 1-second window
Base Filters 32 Scales up in deeper layers
Blocks [2, 2, 2, 2] ResNet-18 style architecture
Dropout 0.2 Regularization
Parameters ~964K Compact for EEG task

Training Configuration

We use HuggingFace Trainer for a production-ready training loop:

training_args = TrainingArguments(
num_train_epochs=80,
per_device_train_batch_size=128,
learning_rate=0.0001,
weight_decay=0.01,
warmup_ratio=0.1,
lr_scheduler_type="cosine",

# Evaluation strategy
eval_strategy="steps",
eval_steps=200,
metric_for_best_model="f1_macro",
load_best_model_at_end=True,

# Logging
logging_steps=50,
report_to=["tensorboard"],
)

Metrics

For imbalanced classification, accuracy alone is misleading—a model predicting only the majority class would achieve 68% accuracy but be clinically useless. We track multiple metrics:

Classification Metrics for Imbalanced Data Precision "Of predictions for class X, how many were correct?" TP / (TP + FP) Penalizes false alarms Recall (Sensitivity) "Of actual class X samples, how many did we find?" TP / (TP + FN) Penalizes missed detections F1 Score "Harmonic mean of Precision and Recall" 2·P·R / (P + R) Balanced single metric
Key metrics for evaluating classification

For our EEG classifier, we use F1 Macro as the primary model selection metric because it treats all classes equally—critical when rare events like SPSW (0.5% of data) are clinically important.

Metric What It Measures When to Use
Accuracy % of correct predictions overall Only for balanced datasets
Precision How many positive predictions were correct When false positives are costly
Recall How many actual positives were found When missing cases is costly
F1 Macro Average F1 across classes Balance between precision and recall

Training Results

After training on 29,135 samples with 7,284 validation samples:

  • Model automatically saved with best F1 macro score
  • Training progress visualizable via TensorBoard
  • Checkpoints preserved for model recovery
Evaluation accuracy over training steps reaching 85.9%
Evaluation accuracy reaches 85.9% after 18,200 steps
Weighted F1 score over training steps reaching 85.8%
Weighted F1 score stabilizes at 85.8%

Training completed in approximately 6 minutes on GPU, demonstrating the efficiency of the ResNet1D architecture for EEG classification tasks.

More Metrics: Additional training metrics including loss curves, per-class precision/recall, and confusion matrices are available in the GitHub repository. You can visualize them locally by running tensorboard --logdir=runs/ after cloning the project.

06Conclusion & Next Steps

We've built a complete EEG event classification DEMO system from scratch:

  1. Data Loading: Parsed EDF files and transformed to bipolar montage
  2. Exploration: Analyzed dataset statistics and identified class imbalance
  3. Preparation: Windowed, normalized, and aggregated labels
  4. Training: Built and trained a ResNet1D classifier

The code and notebooks from this tutorial are available for educational purposes.