Bioinformatics file formats

Sequence formats

FASTA

Stores nucleotide or amino acid sequences. Each record has a header line starting with > followed by one or more lines of sequence.

Encoding: Plain text
Schema: No formal schema; convention-based header format
Validation: No standard validator; parsers are permissive
Reference: NCBI FASTA description

example.fasta

>sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL
LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

FASTQ

Stores sequences with per-base quality scores. Standard output of high-throughput sequencers.

Encoding: Plain text
Schema: Four-line record structure (header, sequence, separator, quality)
Validation: No formal schema; tools like FastQC assess quality
Reference: NCBI FASTQ description

example.fastq

@SEQ_ID instrument:run:flowcell:lane:tile:x:y read:filtered:control:index
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

GenBank

NCBI's annotated sequence format. Contains sequence data, features, references, and metadata.

Encoding: Plain text
Schema: Defined by INSDC Feature Table specification
Validation: Validated on submission to NCBI
Reference: NCBI GenBank format

example.gb

LOCUS       SCU49845     5028 bp    DNA     PLN       21-JUN-1999
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds.
ACCESSION   U49845
VERSION     U49845.1  GI:1293613
KEYWORDS    .
SOURCE      Saccharomyces cerevisiae (baker's yeast)
  ORGANISM  Saccharomyces cerevisiae
            Eukaryota; Fungi; Ascomycota; Saccharomycetes.
FEATURES             Location/Qualifiers
     source          1..5028
                     /organism="Saccharomyces cerevisiae"
                     /db_xref="taxon:4932"
     CDS             complement(join(687..3158,3300..4037))
                     /gene="TCP1-beta"
                     /codon_start=1
                     /product="T-complex protein 1 subunit beta"
                     /protein_id="AAA98665.1"
ORIGIN
        1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac
//

Alignment formats

SAM (Sequence Alignment/Map)

Tab-delimited format for storing read alignments against a reference genome.

Encoding: Plain text
Schema: Formally specified by the SAM/BAM specification (hts-specs)
Validation: samtools quickcheck, picard ValidateSamFile
Reference: hts-specs SAM/BAM

example.sam

@HD	VN:1.6	SO:coordinate
@SQ	SN:chr1	LN:248956422
@RG	ID:sample1	SM:sample1	PL:ILLUMINA
r001	99	chr1	7	30	8M2I4M1D3M	=	37	39	TTAGATAAAGGATACTG	*
r002	0	chr1	9	30	3S6M1P1I4M	*	0	0	AAAAGATAAGGATA	*

Fields: QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL, followed by optional tags.

BAM

Binary, BGZF-compressed equivalent of SAM. Supports random access via index files (.bai).

Encoding: Binary (BGZF-compressed)
Schema: Same as SAM specification
Validation: samtools quickcheck, picard ValidateSamFile
Reference: hts-specs SAM/BAM

CRAM

Highly compressed alignment format that stores differences relative to a reference genome.

Encoding: Binary (reference-based compression)
Schema: Defined by the CRAM specification (hts-specs)
Validation: samtools quickcheck
Reference: hts-specs CRAM

Variant formats

VCF (Variant Call Format)

Stores genomic variants (SNPs, indels, structural variants) with genotype information across samples.

Encoding: Plain text (often BGZF-compressed as .vcf.gz)
Schema: Formally specified by the VCF specification (hts-specs); header defines INFO, FORMAT, and FILTER fields
Validation: bcftools +check, gatk ValidateVariants
Reference: hts-specs VCF

example.vcf

##fileformat=VCFv4.3
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample1
chr1	10177	rs367896724	A	AC	100	PASS	DP=1000	GT:GQ	0/1:50
chr1	10352	rs555500075	T	TA	100	PASS	DP=950	1/1:40

BCF

Binary equivalent of VCF. Supports random access via index files (.csi).

Encoding: Binary (BGZF-compressed)
Schema: Same as VCF specification
Validation: bcftools +check
Reference: hts-specs VCF/BCF

MAF (Mutation Annotation Format)

Tab-delimited format used by TCGA/GDC to describe somatic mutations.

Encoding: Plain text
Schema: Defined by the GDC MAF specification
Validation: vcf2maf validates during conversion; GDC validates on submission
Reference: GDC MAF format

example.maf

Hugo_Symbol	Chromosome	Start_Position	End_Position	Reference_Allele	Tumor_Seq_Allele2	Variant_Classification	Variant_Type	Tumor_Sample_Barcode
TP53	chr17	7577548	7577548	C	T	Missense_Mutation	SNP	TCGA-AB-2901-03
BRAF	chr7	140753336	140753336	A	T	Missense_Mutation	SNP	TCGA-AB-2901-03

Annotation formats

GFF3 (General Feature Format)

Tab-delimited format for describing genomic features (genes, exons, UTRs, etc.).

Encoding: Plain text
Schema: Formally specified by the Sequence Ontology GFF3 specification
Validation: genometools gff3validator, GFF3toolkit
Reference: GFF3 specification

example.gff3

##gff-version 3
chr1	HAVANA	gene	11869	14409	.	+	.	ID=ENSG00000223972;Name=DDX11L1;biotype=transcribed_unprocessed_pseudogene
chr1	HAVANA	transcript	11869	14409	.	+	.	ID=ENST00000456328;Parent=ENSG00000223972;biotype=lncRNA
chr1	HAVANA	exon	11869	12227	.	+	.	ID=exon001;Parent=ENST00000456328

Fields: seqid, source, type, start, end, score, strand, phase, attributes.

GTF (Gene Transfer Format)

GFF2-derived format widely used for gene annotations (e.g. GENCODE, Ensembl).

Encoding: Plain text
Schema: Subset of GFF2 with required gene_id and transcript_id attributes
Validation: No dedicated validator; parsers enforce attribute requirements
Reference: Ensembl GTF format

example.gtf

chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_type "transcribed_unprocessed_pseudogene";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; transcript_type "lncRNA";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number 1;

BED (Browser Extensible Data)

Tab-delimited format for genomic intervals. Supports 3 to 12 columns.

Encoding: Plain text
Schema: Defined by the UCSC BED specification; columns are positional
Validation: bedtools performs implicit validation; UCSC bedToBigBed validates strictly
Reference: UCSC BED format

example.bed

chr1	11868	14409	DDX11L1	0	+	11868	14409	0,128,0	3	354,109,1189	0,744,1352
chr1	14403	29570	WASH7P	0	-	14403	29570	0,0,128	11	...

Fields (BED12): chrom, chromStart, chromEnd, name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts.

WIG / bigWig

Continuous signal tracks (e.g. coverage, conservation scores) over genomic coordinates.

Encoding: WIG is plain text; bigWig is binary (indexed)
Schema: Defined by the UCSC WIG specification
Validation: UCSC wigToBigWig validates during conversion
Reference: UCSC WIG format, UCSC bigWig format

example.wig

variableStep chrom=chr1 span=10
10001	1.5
10011	2.3
10021	3.1

fixedStep chrom=chr1 start=10001 step=10 span=10
1.5
2.3
3.1

Structure formats

PDB (Protein Data Bank)

Fixed-width column format for 3D macromolecular structure coordinates.

Encoding: Plain text
Schema: Defined by the wwPDB PDB Format specification
Validation: Validated on wwPDB deposition; pdb-tools for local checks
Reference: wwPDB PDB format v3.3

example.pdb

HEADER    OXYGEN TRANSPORT                        07-JAN-84   1HHO
ATOM      1  N   VAL A   1      27.340  24.430   2.614  1.00  9.67           N
ATOM      2  CA  VAL A   1      26.266  25.413   2.842  1.00 10.38           C
ATOM      3  C   VAL A   1      26.913  26.639   3.531  1.00  9.62           C
ATOM      4  O   VAL A   1      27.886  26.463   4.263  1.00  9.62           O
ATOM      5  CB  VAL A   1      25.112  24.880   3.649  1.00 13.27           C
HETATM 2001  O   HOH A 301      30.135  11.748  -1.074  1.00 17.35           O
END

Fields (ATOM records): record type, serial, atom name, residue name, chain ID, residue sequence number, x, y, z, occupancy, temperature factor, element.

mmCIF / PDBx

Dictionary-driven replacement for PDB format; now the primary deposition format for the PDB archive.

Encoding: Plain text (CIF syntax)
Schema: Formally defined by the PDBx/mmCIF dictionary; machine-readable schema
Validation: Validated on deposition; Maxit, pdbx-validation tools
Reference: PDBx/mmCIF documentation

example.cif

data_1HHO
_entry.id   1HHO
_cell.length_a    63.150
_cell.length_b    83.590
_cell.length_c    53.800
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_seq_id
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
ATOM  1  N  N   VAL  A  1  27.340  24.430  2.614
ATOM  2  C  CA  VAL  A  1  26.266  25.413  2.842

Phylogenetic formats

Newick

Parenthetical notation for representing tree topologies with optional branch lengths and labels.

Encoding: Plain text
Schema: No formal schema; grammar defined by convention
Validation: No standard validator; parsers check syntax
Reference: Newick format description

example.nwk

((Human:0.1,Chimpanzee:0.2):0.05,(Gorilla:0.3,Orangutan:0.4):0.1);

Nexus

Multi-block format that can store sequences, trees, distances, and analysis parameters.

Encoding: Plain text
Schema: Block-based structure; loosely defined by Maddison et al. (1997)
Validation: No standard validator
Reference: Nexus format (Maddison et al., 1997)

example.nex

#NEXUS
BEGIN TAXA;
  DIMENSIONS NTAX=4;
  TAXLABELS Human Chimpanzee Gorilla Orangutan;
END;
BEGIN TREES;
  TREE best = ((Human:0.1,Chimpanzee:0.2):0.05,(Gorilla:0.3,Orangutan:0.4):0.1);
END;

Tabular / matrix formats

AnnData (H5AD)

HDF5-based format used by the scanpy/AnnData ecosystem for single-cell data. Stores an observation-by-variable matrix with row/column annotations and unstructured metadata.

Encoding: Binary (HDF5)
Schema: Defined by the AnnData spec; structured groups within HDF5
Validation: anndata library validates on read
Reference: AnnData documentation

MTX (Matrix Market)

Sparse matrix exchange format. Used by Cell Ranger for count matrices.

Encoding: Plain text
Schema: Defined by the Matrix Market specification
Validation: Header line declares matrix properties
Reference: NIST Matrix Market

example.mtx

%%MatrixMarket matrix coordinate integer general
%
32738 737280 7348543
1 1 3
1 5 1
2 3 7

Fields: rows, columns, non-zero entries (header), then row, column, value per line.

Bioinformatics file formats

On this page