LCM Logo
Bioinformatics

Bioinformatics file formats

Sequence formats

FASTA

Stores nucleotide or amino acid sequences. Each record has a header line starting with > followed by one or more lines of sequence.

  • Encoding: Plain text
  • Schema: No formal schema; convention-based header format
  • Validation: No standard validator; parsers are permissive
  • Reference: NCBI FASTA description
example.fasta
>sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL
LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

FASTQ

Stores sequences with per-base quality scores. Standard output of high-throughput sequencers.

  • Encoding: Plain text
  • Schema: Four-line record structure (header, sequence, separator, quality)
  • Validation: No formal schema; tools like FastQC assess quality
  • Reference: NCBI FASTQ description
example.fastq
@SEQ_ID instrument:run:flowcell:lane:tile:x:y read:filtered:control:index
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

GenBank

NCBI's annotated sequence format. Contains sequence data, features, references, and metadata.

example.gb
LOCUS       SCU49845     5028 bp    DNA     PLN       21-JUN-1999
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds.
ACCESSION   U49845
VERSION     U49845.1  GI:1293613
KEYWORDS    .
SOURCE      Saccharomyces cerevisiae (baker's yeast)
  ORGANISM  Saccharomyces cerevisiae
            Eukaryota; Fungi; Ascomycota; Saccharomycetes.
FEATURES             Location/Qualifiers
     source          1..5028
                     /organism="Saccharomyces cerevisiae"
                     /db_xref="taxon:4932"
     CDS             complement(join(687..3158,3300..4037))
                     /gene="TCP1-beta"
                     /codon_start=1
                     /product="T-complex protein 1 subunit beta"
                     /protein_id="AAA98665.1"
ORIGIN
        1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac
//

Alignment formats

SAM (Sequence Alignment/Map)

Tab-delimited format for storing read alignments against a reference genome.

example.sam
@HD	VN:1.6	SO:coordinate
@SQ	SN:chr1	LN:248956422
@RG	ID:sample1	SM:sample1	PL:ILLUMINA
r001	99	chr1	7	30	8M2I4M1D3M	=	37	39	TTAGATAAAGGATACTG	*
r002	0	chr1	9	30	3S6M1P1I4M	*	0	0	AAAAGATAAGGATA	*

Fields: QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL, followed by optional tags.

BAM

Binary, BGZF-compressed equivalent of SAM. Supports random access via index files (.bai).

  • Encoding: Binary (BGZF-compressed)
  • Schema: Same as SAM specification
  • Validation: samtools quickcheck, picard ValidateSamFile
  • Reference: hts-specs SAM/BAM

CRAM

Highly compressed alignment format that stores differences relative to a reference genome.

  • Encoding: Binary (reference-based compression)
  • Schema: Defined by the CRAM specification (hts-specs)
  • Validation: samtools quickcheck
  • Reference: hts-specs CRAM

Variant formats

VCF (Variant Call Format)

Stores genomic variants (SNPs, indels, structural variants) with genotype information across samples.

  • Encoding: Plain text (often BGZF-compressed as .vcf.gz)
  • Schema: Formally specified by the VCF specification (hts-specs); header defines INFO, FORMAT, and FILTER fields
  • Validation: bcftools +check, gatk ValidateVariants
  • Reference: hts-specs VCF
example.vcf
##fileformat=VCFv4.3
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample1
chr1	10177	rs367896724	A	AC	100	PASS	DP=1000	GT:GQ	0/1:50
chr1	10352	rs555500075	T	TA	100	PASS	DP=950	1/1:40

BCF

Binary equivalent of VCF. Supports random access via index files (.csi).

  • Encoding: Binary (BGZF-compressed)
  • Schema: Same as VCF specification
  • Validation: bcftools +check
  • Reference: hts-specs VCF/BCF

MAF (Mutation Annotation Format)

Tab-delimited format used by TCGA/GDC to describe somatic mutations.

example.maf
Hugo_Symbol	Chromosome	Start_Position	End_Position	Reference_Allele	Tumor_Seq_Allele2	Variant_Classification	Variant_Type	Tumor_Sample_Barcode
TP53	chr17	7577548	7577548	C	T	Missense_Mutation	SNP	TCGA-AB-2901-03
BRAF	chr7	140753336	140753336	A	T	Missense_Mutation	SNP	TCGA-AB-2901-03

Annotation formats

GFF3 (General Feature Format)

Tab-delimited format for describing genomic features (genes, exons, UTRs, etc.).

example.gff3
##gff-version 3
chr1	HAVANA	gene	11869	14409	.	+	.	ID=ENSG00000223972;Name=DDX11L1;biotype=transcribed_unprocessed_pseudogene
chr1	HAVANA	transcript	11869	14409	.	+	.	ID=ENST00000456328;Parent=ENSG00000223972;biotype=lncRNA
chr1	HAVANA	exon	11869	12227	.	+	.	ID=exon001;Parent=ENST00000456328

Fields: seqid, source, type, start, end, score, strand, phase, attributes.

GTF (Gene Transfer Format)

GFF2-derived format widely used for gene annotations (e.g. GENCODE, Ensembl).

  • Encoding: Plain text
  • Schema: Subset of GFF2 with required gene_id and transcript_id attributes
  • Validation: No dedicated validator; parsers enforce attribute requirements
  • Reference: Ensembl GTF format
example.gtf
chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_type "transcribed_unprocessed_pseudogene";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; transcript_type "lncRNA";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number 1;

BED (Browser Extensible Data)

Tab-delimited format for genomic intervals. Supports 3 to 12 columns.

  • Encoding: Plain text
  • Schema: Defined by the UCSC BED specification; columns are positional
  • Validation: bedtools performs implicit validation; UCSC bedToBigBed validates strictly
  • Reference: UCSC BED format
example.bed
chr1	11868	14409	DDX11L1	0	+	11868	14409	0,128,0	3	354,109,1189	0,744,1352
chr1	14403	29570	WASH7P	0	-	14403	29570	0,0,128	11	...

Fields (BED12): chrom, chromStart, chromEnd, name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts.

WIG / bigWig

Continuous signal tracks (e.g. coverage, conservation scores) over genomic coordinates.

example.wig
variableStep chrom=chr1 span=10
10001	1.5
10011	2.3
10021	3.1

fixedStep chrom=chr1 start=10001 step=10 span=10
1.5
2.3
3.1

Structure formats

PDB (Protein Data Bank)

Fixed-width column format for 3D macromolecular structure coordinates.

example.pdb
HEADER    OXYGEN TRANSPORT                        07-JAN-84   1HHO
ATOM      1  N   VAL A   1      27.340  24.430   2.614  1.00  9.67           N
ATOM      2  CA  VAL A   1      26.266  25.413   2.842  1.00 10.38           C
ATOM      3  C   VAL A   1      26.913  26.639   3.531  1.00  9.62           C
ATOM      4  O   VAL A   1      27.886  26.463   4.263  1.00  9.62           O
ATOM      5  CB  VAL A   1      25.112  24.880   3.649  1.00 13.27           C
HETATM 2001  O   HOH A 301      30.135  11.748  -1.074  1.00 17.35           O
END

Fields (ATOM records): record type, serial, atom name, residue name, chain ID, residue sequence number, x, y, z, occupancy, temperature factor, element.

mmCIF / PDBx

Dictionary-driven replacement for PDB format; now the primary deposition format for the PDB archive.

example.cif
data_1HHO
_entry.id   1HHO
_cell.length_a    63.150
_cell.length_b    83.590
_cell.length_c    53.800
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_seq_id
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
ATOM  1  N  N   VAL  A  1  27.340  24.430  2.614
ATOM  2  C  CA  VAL  A  1  26.266  25.413  2.842

Phylogenetic formats

Newick

Parenthetical notation for representing tree topologies with optional branch lengths and labels.

  • Encoding: Plain text
  • Schema: No formal schema; grammar defined by convention
  • Validation: No standard validator; parsers check syntax
  • Reference: Newick format description
example.nwk
((Human:0.1,Chimpanzee:0.2):0.05,(Gorilla:0.3,Orangutan:0.4):0.1);

Nexus

Multi-block format that can store sequences, trees, distances, and analysis parameters.

example.nex
#NEXUS
BEGIN TAXA;
  DIMENSIONS NTAX=4;
  TAXLABELS Human Chimpanzee Gorilla Orangutan;
END;
BEGIN TREES;
  TREE best = ((Human:0.1,Chimpanzee:0.2):0.05,(Gorilla:0.3,Orangutan:0.4):0.1);
END;

Tabular / matrix formats

AnnData (H5AD)

HDF5-based format used by the scanpy/AnnData ecosystem for single-cell data. Stores an observation-by-variable matrix with row/column annotations and unstructured metadata.

  • Encoding: Binary (HDF5)
  • Schema: Defined by the AnnData spec; structured groups within HDF5
  • Validation: anndata library validates on read
  • Reference: AnnData documentation

MTX (Matrix Market)

Sparse matrix exchange format. Used by Cell Ranger for count matrices.

example.mtx
%%MatrixMarket matrix coordinate integer general
%
32738 737280 7348543
1 1 3
1 5 1
2 3 7

Fields: rows, columns, non-zero entries (header), then row, column, value per line.

On this page