Reference files

What is a reference file

A Genozip reference file is a file with a .ref.genozip file name extension. It is derived from a set of genomic sequences retrieved from one or more FASTA files. These sequences represent the genome(s) of the organism(s) from which the data to-be-compressed originated.

genozip uses the reference file to better compress files that contain genomic sequences, essentially by storing the location in the reference file containing the sequence at hand rather than storing the sequence itself, thereby obtaining a smaller representation of the data – the essence of compression.

A reference file is useful for compressing these file types:

- FASTQ files.

- SAM / BAM / CRAM files.

- VCF files - this is particularly effective for certain types of VCF files.

- Certain types of FASTA files (details).

- A reference file is also needed for converting 23andMe format to VCF format.

While Genozip can always compress without using a reference, if one is available, it is a good idea to use it – the compression will be better, and also faster. In particular, the effect is dramatic on FASTQ files and unmapped SAM / BAM / CRAM files – these, really, should always be compressed using a reference file.

Making a reference file

A reference file is made like this:

$ genozip --make-reference hs37d5.fa.gz

It can also be made from a URL:

$ genozip --make-reference ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/

reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

It can be made by combining multiple FASTAs, and it is the user's responsibility to make sure that contig names are unique:

$ cat organismA.fa.gz organismB.fa.gz | genozip --make-reference - --output myref.ref.genozip

Genozip supports references of up to 1 Tbps (~ 1 trillion bases), but if you plan on using references larger than 10 Gbps or so we advise contacting [email protected] to discuss.

Reference file sizes

When making a reference file you may optionally specify a size parameter, for example:

$ genozip --make-reference=large hs37d5.fa.gz

The available sizes are tiny, small, medium and large (default: medium). The size impacts compression of FASTQ, FASTA and unmapped reads within SAM / BAM / CRAM. A larger size results in better and also faster compression, at the cost of more RAM (shared memory) consumption and a larger reference file size on disk.

A special size is minimal - a minimal reference file cannot be used to compress FASTQ, FASTA and unmapped SAM/BAM/CRAM files, but can be used to compress all other file types, and can be used to uncompress all files.

Using hs37d5 (a common human genome reference file) as an example, these are the sizes for this particular reference file:

file size shared memory (RAM)

minimal 693 MB 0.73 GB

tiny 1.2 GB 1.23 GB

small 1.5 GB 1.73 GB

medium 1.9 GB 2.73 GB

large 2.2 GB 4.73 GB

Using a reference file

The most common way of using a reference file is in external reference mode, as this achieves the best compression, for example:

$ genozip --reference hs37d5.ref.genozip myfile-R1.fq.gz

In this mode, no data from the reference file is copied into the compressed file. Instead, the reference file is needed again to uncompress the data with genounzip or genocat:

$ genounzip --reference hs37d5.ref.genozip myfile-R1.fq.genozip

The --reference option may be omitted in genounzip and genocat in which case the reference file will be sought in the same location on the filesystem as was used when the file was generated with genozip.

In contrast, in the stored reference mode, the parts of the reference data that are actually referenced by the file being compressed are stored in the compressed file itself, adding about 0.23 bytes per reference base to its size. For example, for a human WGS FASTQ or BAM file, this would increase the compressed file size by about 700 MB.

$ genozip --REFERENCE hs37d5.ref.genozip myfile-R1.fq.gz

The advantage is that the reference file is not needed to uncompress:

$ genounzip myfile-R1.fq.genozip

Note that the --reference and --REFERENCE command line options can also take a FASTA file name as an argument, instead of a .ref.genozip file. This is just a shortcut for convenience: Genozip searches for the corresponding .ref.genozip in the same directory as the FASTA, and if not found, it makes it.

Using $GENOZIP_REFERENCE: this environment variable can be set to the path of a reference file, which will be used by genozip, genounzip or genocat if the --reference option is omitted, as an external reference. For genounzip and genocat (but not genozip), it can also be set to a directory name, in which case the reference file used by genozip to generate the compressed file, will be sought in that directory.

In-memory caching of reference files

Since reference files are RAM-hungry and take a few seconds to load, genozip caches them in shared memory. This way, all concurrent genozip processes share the same reference in memory, thereby saving RAM. Also, it saves time, because loading the reference file from disk occurs only the first time a particular reference file is used.

The reference data remains in RAM until it is removed: either explicitly (see below) or automatically: any genozip process that runs also removes cached reference files that have not been used in 24 hours.

To see the references currently cached in memory:

$ genols --cache

shmid owner perm size loaded name
2 divon 666 2797630464 2026-04-01 8:46:39 hs37d5.ref.genozip
3 divon 666 2817628528 2026-04-09 23:39:47 GRCh38.ref.genozip

Note: shared memory permissions (seen in the perm column) are inherited from the reference file, and the owner is the user to run genozip or genounzip that caused the caching of the reference file.

Note: genols --cache peeks into the reference caches, thereby reseting the countdown to their automatic removal after 24 hours of non-use.

To remove them from memory:

$ genozip --no-cache
genozip: Unloading reference cache "hs37d5.ref.genozip"
genozip: Unloading reference cache "GRCh38.ref.genozip"

Removing a single cached reference from memory:

$ genozip --no-cache --reference hs37d5.ref.genozip

The actual removal of these shared memory segments will occur after all processes currently running and using the reference data have completed.

It is possible to instruct Genozip to load a copy of the reference from disk, neither seeking it in the cache, nor storing it there:

$ genozip --reference hs37d5.ref.genozip --no-cache myfile-R1.fq.gz

Note on Docker containers and caching of reference files

In order to avoid loading the reference file from disk with each execution in a Docker container and having mulitple copies of the reference consuming RAM if multiple containers are running Genozip in parallel, it is advisable to share the shared memory between docker containers. Luckily, Docker allows doing just that using docker run --ipc.

One strategy is to have a docker container which holds the reference, and the other containers using it. To load reference data into cache, compress a tiny dummy file, and do so at least once every 24 hours to avoid the reference being removed:

$ genozip --force --no-test --reference hs37d5.ref.genozip tiny.fq

Note on Mac and caching of reference files

Many Mac systems have limits on shared memory size set to values too small for typical genomic reference files. To increase these system limits:

$ sudo sysctl -w kern.sysv.shmmax=3200000000

$ sudo sysctl -w kern.sysv.shmall=781250 # this is (shmmax / 4096)

These values ↑ are sufficient for human genome references (with --make-reference=medium).

Note on Windows and caching of reference files

Automatic removal of cached reference files after 24 hours of non-use does not work for Windows, and therefore the reference data remains in memory until removed with genozip --no-cache or Windows is restarted.

Viewing and subsetting the reference data

In addition to their primary use for compressing files, reference files are also useful for analyzing the genome contained within:

they can be used to easy view sub-sequences of contigs in certain regions (forward or reverse complemented) using --regions and --regions-file, for finding IUPAC non-ACGTN pseudo-bases in the file with --show-ref-iupacs, and seeing properties of the contigs with --show-ref-contigs. See more here: Reference file options.

Note on backwards compatability of reference files

Genozip version 15, the current version, is able to use reference files made by older versions - as old as Genozip version 8. However, a sequence of major improvements occurred to Genozip reference file technology over time, and it is highly recommended to use a reference file made by genozip 15.0.81 and above which provide much faster and better compression.

Files compressed with a reference file made by Genozip version 15, can be uncompressed with any other reference file made by Genozip version 15 from the same FASTA file regardless of the size parameter used in --make-reference (as long as the reference file was not made by a newer version of Genozip than the version trying to use it).

To find out which Genozip version was used to make a particular reference file, use:

$ genocat --stats myref.ref.genozip

Genozip's backward compatibility of reference files notwithstanding, the best practice is to store the reference file used to compress together with the archive of compressed files, and to use the same reference file to uncompress.