fsort - External Sorting Utility

An efficient external sorting utility for large binary files using radix sort and merge sort algorithms.

Author

Written by Huy T. Vo while a PhD student at the University of Utah under the supervision of Prof. Cláudio T. Silva.

Overview

fsort implements external sorting algorithms designed to handle files larger than available memory. The project includes multiple implementation variants optimized for different use cases:

mradix.cpp: Main implementation with configurable block sizes and multi-way merge
radix-sort.c: Memory-mapped file version with 64-bit file offset support
mradix-sort.c: Memory-mapped version with 32-bit file offsets
miradix-sort.c: Chunked memory-mapped version for processing in configurable buffers

Project Structure

fsort/
├── src/              # Source files
│   ├── mradix.cpp    # Main fsort implementation
│   ├── radix-sort.c  # Memory-mapped variants
│   ├── gentest.c     # Test data generator
│   └── dump.cpp      # Verification utility
├── scripts/          # Utility scripts
│   └── benchmark.sh  # Performance benchmark
├── build/            # Build directory (created by CMake)
├── CMakeLists.txt    # CMake build configuration
└── README.md         # This file

Building

Using CMake (recommended)

Create a build directory and compile:

mkdir build
cd build
cmake ..
make

Build with debug symbols:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Debug ..
make

Build in release mode with optimizations:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

All executables will be in the build/ directory.

Using the legacy Makefile

The original Makefile is still available for direct builds:

make            # Build main executable
make gdb        # Build with debug symbols
make gprof      # Build with profiling support
make gentest    # Build test data generator
make dump       # Build verification utility

Usage

Basic sorting (from the build directory):

./fsort -fi input.bin -fo output.bin

Or if using legacy Makefile from the project root:

./fsort -fi input.bin -fo output.bin

Command-line Options

Option	Description	Default
`-fi <file>`	Input file	Required
`-fo <file>`	Output file	stdout
`-bsize <int>`	Block size for in-core sorting (records)	262144
`-osize <int>`	Output buffer size (records)	262144
`-len <int>`	Record length (bytes)	4
`-off <int>`	Key offset within record (bytes)	0
`-threads <int>`	Number of merge threads	128
`-tsize <int>`	Thread buffer size (records)	16384
`-tmpdir <path>`	Temporary file location	$TMPDIR or .
`-overwrite`	Sort file in-place	false
`-reverse`	Reverse sort order (descending)	false
`-swapbytes`	Swap byte order (endianness)	false

Examples

Sort a file in-place:

./fsort -fi data.bin -overwrite

Sort with custom block size:

./fsort -fi input.bin -fo output.bin -bsize 524288

Sort in descending order:

./fsort -fi input.bin -fo output.bin -reverse

Testing

Generate test data (1 million random unsigned integers):

./gentest test.bin 1000

Sort the data:

./fsort -fi test.bin -fo sorted.bin

Verify correctness:

./dump sorted.bin

Benchmarking

Run the benchmark script to compare fsort performance against the system sort command:

# From project root
./scripts/benchmark.sh [SIZE_IN_BLOCKS]

# Or from build directory (if using CMake)
cd build
cmake --build . --target benchmark

The script will:

Generate random test data (default: 10,000 blocks = 10M integers = 40MB)
Sort using fsort and measure time
Sort using system sort and measure time
Verify both outputs are correct
Display performance comparison

Example with different sizes:

./scripts/benchmark.sh 1000    # 1M integers (4MB)
./scripts/benchmark.sh 10000   # 10M integers (40MB)
./scripts/benchmark.sh 100000  # 100M integers (400MB)

Note: The system sort comparison involves converting binary data to text and back, which adds overhead. For very large files, fsort's external sort algorithm should show significant advantages as it operates directly on binary data and uses efficient memory-mapped I/O.

Algorithm

fsort uses a two-phase approach:

Radix Sort Phase: Divides the input into blocks that fit in memory and sorts each block using a 4-pass byte-level radix sort
Merge Phase: Uses a multi-way merge with a priority queue to combine sorted blocks, repeating merge phases until the entire file is sorted

The algorithm is particularly efficient for sorting large datasets of fixed-size records with numeric keys.

File Format

By default, fsort works with binary files containing unsigned 32-bit integers. Custom record formats are supported using the -len and -off options to specify record size and key location.

License

Please contact the authors regarding licensing and usage terms.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fsort - External Sorting Utility

Author

Overview

Project Structure

Building

Using CMake (recommended)

Using the legacy Makefile

Usage

Command-line Options

Examples

Testing

Benchmarking

Algorithm

File Format

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fsort - External Sorting Utility

Author

Overview

Project Structure

Building

Using CMake (recommended)

Using the legacy Makefile

Usage

Command-line Options

Examples

Testing

Benchmarking

Algorithm

File Format

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages