A Python tool for collapsing identical sequences in FASTA files into unique sequences and restoring them back to their original form.
- Collapses identical DNA/RNA sequences into unique representatives
- Supports field-based grouping for sequence names
- Deterministic output (sequences with the same abundance are alphabetically sorted)
- Tracks abundance and original sequence names
- Reversible process through uncollapsing
Clone this repository:
git clone https://github.com/MullinsLab/sequence_collapsing.git
cd sequence_collapsingNo additional dependencies required - uses Python standard library only.
Collapse all identical sequences in a FASTA file:
python collapse_sequences.py input.fastaOutput files:
input_collapsed.fasta- Collapsed unique sequencesinput_collapsed_name.txt- Mapping file (tab-delimited)
Collapsed sequence naming format: {id}_{uniqueNumber}_{abundance}
Example: sample1_1_5 means the first unique sequence with 5 copies
Collapse sequences grouped by specific fields in the sequence name:
python collapse_sequences.py input.fasta -i 1,2,3This groups sequences by fields 1, 2, and 3 of underscore-separated sequence names.
Example:
If your sequence names follow the pattern: projectID_sampleID_timepoint_region_other
-i 1- Collapse by project only-i 1,2- Collapse by project and sample-i 1,2,3- Collapse by project, sample, and timepoint
Output files:
input_collapsed_by_field123.fastainput_collapsed_by_field123_name.txt
Restore collapsed sequences back to their original form:
python uncollapse_sequences.py collapsed.fasta collapsed_name.txtOutput:
collapsed_uncollapsed.fasta- Restored original sequences
Standard FASTA format with sequence names starting with >:
>sequence1_sample_timepoint_region
ATCGATCGATCG
>sequence2_sample_timepoint_region
ATCGATCGATCG
>sequence3_sample_timepoint_region
GCTAGCTAGCTA
Unique sequences with abundance information:
>sample_timepoint_region_1_2
ATCGATCGATCG
>sample_timepoint_region_2_1
GCTAGCTAGCTA
Tab-delimited file mapping collapsed names to original names:
sample_timepoint_region_1_2 sequence1_sample_timepoint_region,sequence2_sample_timepoint_region
sample_timepoint_region_2_1 sequence3_sample_timepoint_region
The script ensures reproducible results by sorting sequences with the same abundance alphabetically. Running the same input multiple times will always produce identical output.
Within each group, sequences are ordered by abundance (most common first), with ties broken alphabetically.
The field index option allows flexible grouping strategies based on your sequence naming convention, making it suitable for analyzing sequences across different samples, timepoints, or experimental conditions.
- Python 3.x
- Input FASTA files must have
.fastaextension
For any questions, bugs and suggestions, please send email to cohnlabsupport@fredhutch.org and include a few sentences describing, briefly, the nature of your questions and include contact information.