A deep learning system for automated music editing. The agent learns to make intelligent editing decisions by training on paired raw/edited audio examples.
This project explores multiple approaches to learning audio editing from examples:
| Approach | Status | Description |
|---|---|---|
| Pointer Network | Current Focus | Learns to reorder/select frames from raw audio |
| Supervised Mel Reconstruction | Implemented | Direct mel spectrogram reconstruction |
| RL with Factored Actions | Implemented | Sequential decision making with PPO |
| Mel-to-Mel Editor | Experimental | Transformer-based mel transformation |
Instead of generating new audio, the pointer network learns to copy and reorder frames from the original recording. This preserves audio quality while learning editing patterns.
Raw Mel (T_raw frames)
│
▼
┌───────────────────────┐
│ Multi-Scale Encoder │ ← Frame/Beat/Bar level encoding
│ + Music-Aware PE │ ← Beat, bar, phrase structure
│ + Edit Style VAE │ ← Latent space for edit diversity
└───────────────────────┘
│
▼
┌───────────────────────┐
│ Hierarchical Decoder │ ← Bar → Beat → Frame prediction
│ + Sparse Attention │ ← O(n) not O(n²)
│ + KV Caching │ ← Fast inference
└───────────────────────┘
│
▼
Pointer Sequence (T_edit indices into raw)
- Preserves quality: Copies frames, never generates
- Learns patterns: Cuts, loops, reordering from examples
- Hierarchical: Coarse-to-fine (bar → beat → frame) for musical coherence
- Variable length: STOP token handles different output lengths
# 1. Generate pointer sequences from paired audio
python -m pointer_network.generate_pointer_sequences
# 2. Train the pointer network
python -m pointer_network.trainers.pointer_trainer \
--cache-dir cache \
--pointer-dir training_data/pointer_sequences \
--save-dir models/pointer_network \
--epochs 100python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/Mac
pip install torch torchaudio librosa soundfile numpy gymnasium tensorboard tqdm
pip install natten # Neighborhood attention (optional, for RL approach)
pip install demucs # Stem separation (optional)The vocoder modules (for mel-to-audio conversion) should be cloned separately:
cd vocoder
git clone https://github.com/NVIDIA/BigVGAN.git
git clone https://github.com/jik876/hifi-gan.gittraining_data/
├── input/ # Raw audio files (*_raw.wav or *_raw.mp3)
├── desired_output/ # Human-edited versions (*_edit.wav)
└── reference/ # Additional finished tracks (optional)
Files are matched by name prefix (e.g., song1_raw.wav ↔ song1_edit.wav).
editorbot/
├── pointer_network/ # Pointer-based editing (CURRENT FOCUS)
│ ├── models/
│ │ └── pointer_network.py # Main model with hierarchical pointers
│ ├── data/
│ │ └── dataset.py # PointerDataset, collate_fn
│ ├── trainers/
│ │ └── pointer_trainer.py # Training loop
│ ├── generate_pointer_sequences.py # Create training data
│ └── config.py # Configuration
│
├── super_editor/ # Multi-component editor
│ ├── models/ # Encoder, decoder, edit classifier
│ ├── trainers/ # Training utilities
│ └── config.py
│
├── mel_to_mel_editor/ # Direct mel transformation
│ └── models/
│
├── audio_slicer/ # Audio segmentation utilities
│
├── rl_editor/ # RL-based editing
│ ├── train.py # PPO training loop
│ ├── agent.py # Policy/Value networks
│ ├── environment.py # Gymnasium environment
│ ├── actions.py # Factored action space (20×5×5)
│ ├── config.py # Hyperparameters
│ ├── features.py # Audio feature extraction
│ ├── supervised_trainer.py # Supervised reconstruction
│ └── infer.py # Inference
│
├── vocoder/ # Mel-to-audio conversion
│ ├── BigVGAN/ # (clone externally)
│ └── hifi-gan/ # (clone externally)
│
├── scripts/ # Utilities
│ ├── generate_synthetic_pairs.py
│ ├── precache_labels.py
│ ├── precache_stems.py
│ ├── regenerate_cache.py
│ └── train_super_editor.py
│
├── training_data/ # Audio pairs for training
├── models/ # Saved checkpoints
├── logs/ # TensorBoard logs
├── test_audio/ # Test audio files
├── lr_finder/ # Learning rate finder outputs
└── CLAUDE.md # Development guidelines
Direct reconstruction using multi-scale perceptual losses:
python -m rl_editor.supervised_trainer \
--data-dir training_data \
--save-dir models/supervised \
--epochs 100Sequential decision making with PPO + Behavioral Cloning:
# Generate BC dataset
python -m scripts.infer_rich_bc_labels --data_dir training_data --out bc_rich.npz
# Train with BC + PPO
python -m rl_editor.train \
--save-dir models/rl_model \
--bc-mixed bc_rich.npz \
--bc-weight 0.3 \
--subprocessThe factored action space uses 3 heads instead of 500 discrete actions:
- Type (20): KEEP, CUT, LOOP, FADE, GAIN, PITCH, SPEED...
- Size (5): BEAT, BAR, TWO_BARS, PHRASE, TWO_PHRASES
- Amount (5): NEG_LARGE, NEG_SMALL, NEUTRAL, POS_SMALL, POS_LARGE
| Parameter | Default | Description |
|---|---|---|
n_mels |
128 | Mel spectrogram bins |
d_model |
256 | Transformer hidden dimension |
n_heads |
8 | Attention heads |
n_encoder_layers |
4 | Encoder layers |
n_decoder_layers |
4 | Decoder layers |
frames_per_beat |
43 | ~86ms at 22050Hz, hop=256 |
frames_per_bar |
172 | 4 beats per bar |
| Parameter | Value | Description |
|---|---|---|
| Sample rate | 22050 Hz | Standard for music |
| n_mels | 128 | Mel frequency bins |
| n_fft | 2048 | FFT window size |
| hop_length | 256 | ~11.6ms per frame |
tensorboard --logdir logspointer_loss- Frame-level pointer accuracybar_pointer_loss- Bar-level predictionbeat_pointer_loss- Beat-level predictionval_accuracy- Validation pointer accuracy
- Python 3.10+
- PyTorch 2.0+
- CUDA recommended
- ~8GB+ VRAM for training
- ~16GB RAM for RL with parallel environments
See CLAUDE.md for detailed development guidelines.
# Run tests
pytest rl_editor/tests/
# Check pointer network compiles
python -c "from pointer_network import PointerNetwork; print('OK')"