Bioinformatics file formats
Sequence formats
FASTA
Stores nucleotide or amino acid sequences. Each record has a header line starting with > followed
by one or more lines of sequence.
- Encoding: Plain text
- Schema: No formal schema; convention-based header format
- Validation: No standard validator; parsers are permissive
- Reference: NCBI FASTA description
>sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL
LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRFASTQ
Stores sequences with per-base quality scores. Standard output of high-throughput sequencers.
- Encoding: Plain text
- Schema: Four-line record structure (header, sequence, separator, quality)
- Validation: No formal schema; tools like
FastQCassess quality - Reference: NCBI FASTQ description
@SEQ_ID instrument:run:flowcell:lane:tile:x:y read:filtered:control:index
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65GenBank
NCBI's annotated sequence format. Contains sequence data, features, references, and metadata.
- Encoding: Plain text
- Schema: Defined by INSDC Feature Table specification
- Validation: Validated on submission to NCBI
- Reference: NCBI GenBank format
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
KEYWORDS .
SOURCE Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Saccharomycetes.
FEATURES Location/Qualifiers
source 1..5028
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
CDS complement(join(687..3158,3300..4037))
/gene="TCP1-beta"
/codon_start=1
/product="T-complex protein 1 subunit beta"
/protein_id="AAA98665.1"
ORIGIN
1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac
//Alignment formats
SAM (Sequence Alignment/Map)
Tab-delimited format for storing read alignments against a reference genome.
- Encoding: Plain text
- Schema: Formally specified by the SAM/BAM specification (hts-specs)
- Validation:
samtools quickcheck,picard ValidateSamFile - Reference: hts-specs SAM/BAM
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:248956422
@RG ID:sample1 SM:sample1 PL:ILLUMINA
r001 99 chr1 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 chr1 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *Fields: QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL, followed by optional tags.
BAM
Binary, BGZF-compressed equivalent of SAM. Supports random access via index files (.bai).
- Encoding: Binary (BGZF-compressed)
- Schema: Same as SAM specification
- Validation:
samtools quickcheck,picard ValidateSamFile - Reference: hts-specs SAM/BAM
CRAM
Highly compressed alignment format that stores differences relative to a reference genome.
- Encoding: Binary (reference-based compression)
- Schema: Defined by the CRAM specification (hts-specs)
- Validation:
samtools quickcheck - Reference: hts-specs CRAM
Variant formats
VCF (Variant Call Format)
Stores genomic variants (SNPs, indels, structural variants) with genotype information across samples.
- Encoding: Plain text (often BGZF-compressed as
.vcf.gz) - Schema: Formally specified by the VCF specification (hts-specs); header defines INFO, FORMAT, and FILTER fields
- Validation:
bcftools +check,gatk ValidateVariants - Reference: hts-specs VCF
##fileformat=VCFv4.3
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr1 10177 rs367896724 A AC 100 PASS DP=1000 GT:GQ 0/1:50
chr1 10352 rs555500075 T TA 100 PASS DP=950 1/1:40BCF
Binary equivalent of VCF. Supports random access via index files (.csi).
- Encoding: Binary (BGZF-compressed)
- Schema: Same as VCF specification
- Validation:
bcftools +check - Reference: hts-specs VCF/BCF
MAF (Mutation Annotation Format)
Tab-delimited format used by TCGA/GDC to describe somatic mutations.
- Encoding: Plain text
- Schema: Defined by the GDC MAF specification
- Validation:
vcf2mafvalidates during conversion; GDC validates on submission - Reference: GDC MAF format
Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2 Variant_Classification Variant_Type Tumor_Sample_Barcode
TP53 chr17 7577548 7577548 C T Missense_Mutation SNP TCGA-AB-2901-03
BRAF chr7 140753336 140753336 A T Missense_Mutation SNP TCGA-AB-2901-03Annotation formats
GFF3 (General Feature Format)
Tab-delimited format for describing genomic features (genes, exons, UTRs, etc.).
- Encoding: Plain text
- Schema: Formally specified by the Sequence Ontology GFF3 specification
- Validation:
genometools gff3validator,GFF3toolkit - Reference: GFF3 specification
##gff-version 3
chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972;Name=DDX11L1;biotype=transcribed_unprocessed_pseudogene
chr1 HAVANA transcript 11869 14409 . + . ID=ENST00000456328;Parent=ENSG00000223972;biotype=lncRNA
chr1 HAVANA exon 11869 12227 . + . ID=exon001;Parent=ENST00000456328Fields: seqid, source, type, start, end, score, strand, phase, attributes.
GTF (Gene Transfer Format)
GFF2-derived format widely used for gene annotations (e.g. GENCODE, Ensembl).
- Encoding: Plain text
- Schema: Subset of GFF2 with required
gene_idandtranscript_idattributes - Validation: No dedicated validator; parsers enforce attribute requirements
- Reference: Ensembl GTF format
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_type "transcribed_unprocessed_pseudogene";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; transcript_type "lncRNA";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number 1;BED (Browser Extensible Data)
Tab-delimited format for genomic intervals. Supports 3 to 12 columns.
- Encoding: Plain text
- Schema: Defined by the UCSC BED specification; columns are positional
- Validation:
bedtoolsperforms implicit validation;UCSC bedToBigBedvalidates strictly - Reference: UCSC BED format
chr1 11868 14409 DDX11L1 0 + 11868 14409 0,128,0 3 354,109,1189 0,744,1352
chr1 14403 29570 WASH7P 0 - 14403 29570 0,0,128 11 ...Fields (BED12): chrom, chromStart, chromEnd, name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts.
WIG / bigWig
Continuous signal tracks (e.g. coverage, conservation scores) over genomic coordinates.
- Encoding: WIG is plain text; bigWig is binary (indexed)
- Schema: Defined by the UCSC WIG specification
- Validation:
UCSC wigToBigWigvalidates during conversion - Reference: UCSC WIG format, UCSC bigWig format
variableStep chrom=chr1 span=10
10001 1.5
10011 2.3
10021 3.1
fixedStep chrom=chr1 start=10001 step=10 span=10
1.5
2.3
3.1Structure formats
PDB (Protein Data Bank)
Fixed-width column format for 3D macromolecular structure coordinates.
- Encoding: Plain text
- Schema: Defined by the wwPDB PDB Format specification
- Validation: Validated on wwPDB deposition;
pdb-toolsfor local checks - Reference: wwPDB PDB format v3.3
HEADER OXYGEN TRANSPORT 07-JAN-84 1HHO
ATOM 1 N VAL A 1 27.340 24.430 2.614 1.00 9.67 N
ATOM 2 CA VAL A 1 26.266 25.413 2.842 1.00 10.38 C
ATOM 3 C VAL A 1 26.913 26.639 3.531 1.00 9.62 C
ATOM 4 O VAL A 1 27.886 26.463 4.263 1.00 9.62 O
ATOM 5 CB VAL A 1 25.112 24.880 3.649 1.00 13.27 C
HETATM 2001 O HOH A 301 30.135 11.748 -1.074 1.00 17.35 O
ENDFields (ATOM records): record type, serial, atom name, residue name, chain ID, residue sequence number, x, y, z, occupancy, temperature factor, element.
mmCIF / PDBx
Dictionary-driven replacement for PDB format; now the primary deposition format for the PDB archive.
- Encoding: Plain text (CIF syntax)
- Schema: Formally defined by the PDBx/mmCIF dictionary; machine-readable schema
- Validation: Validated on deposition;
Maxit,pdbx-validationtools - Reference: PDBx/mmCIF documentation
data_1HHO
_entry.id 1HHO
_cell.length_a 63.150
_cell.length_b 83.590
_cell.length_c 53.800
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_seq_id
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
ATOM 1 N N VAL A 1 27.340 24.430 2.614
ATOM 2 C CA VAL A 1 26.266 25.413 2.842Phylogenetic formats
Newick
Parenthetical notation for representing tree topologies with optional branch lengths and labels.
- Encoding: Plain text
- Schema: No formal schema; grammar defined by convention
- Validation: No standard validator; parsers check syntax
- Reference: Newick format description
((Human:0.1,Chimpanzee:0.2):0.05,(Gorilla:0.3,Orangutan:0.4):0.1);Nexus
Multi-block format that can store sequences, trees, distances, and analysis parameters.
- Encoding: Plain text
- Schema: Block-based structure; loosely defined by Maddison et al. (1997)
- Validation: No standard validator
- Reference: Nexus format (Maddison et al., 1997)
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=4;
TAXLABELS Human Chimpanzee Gorilla Orangutan;
END;
BEGIN TREES;
TREE best = ((Human:0.1,Chimpanzee:0.2):0.05,(Gorilla:0.3,Orangutan:0.4):0.1);
END;Tabular / matrix formats
AnnData (H5AD)
HDF5-based format used by the scanpy/AnnData ecosystem for single-cell data. Stores an
observation-by-variable matrix with row/column annotations and unstructured metadata.
- Encoding: Binary (HDF5)
- Schema: Defined by the AnnData spec; structured groups within HDF5
- Validation:
anndatalibrary validates on read - Reference: AnnData documentation
MTX (Matrix Market)
Sparse matrix exchange format. Used by Cell Ranger for count matrices.
- Encoding: Plain text
- Schema: Defined by the Matrix Market specification
- Validation: Header line declares matrix properties
- Reference: NIST Matrix Market
%%MatrixMarket matrix coordinate integer general
%
32738 737280 7348543
1 1 3
1 5 1
2 3 7Fields: rows, columns, non-zero entries (header), then row, column, value per line.