An efficient external sorting utility for large binary files using radix sort and merge sort algorithms.
Written by Huy T. Vo while a PhD student at the University of Utah under the supervision of Prof. Cláudio T. Silva.
fsort implements external sorting algorithms designed to handle files larger than available memory. The project includes multiple implementation variants optimized for different use cases:
- mradix.cpp: Main implementation with configurable block sizes and multi-way merge
- radix-sort.c: Memory-mapped file version with 64-bit file offset support
- mradix-sort.c: Memory-mapped version with 32-bit file offsets
- miradix-sort.c: Chunked memory-mapped version for processing in configurable buffers
fsort/
├── src/ # Source files
│ ├── mradix.cpp # Main fsort implementation
│ ├── radix-sort.c # Memory-mapped variants
│ ├── gentest.c # Test data generator
│ └── dump.cpp # Verification utility
├── scripts/ # Utility scripts
│ └── benchmark.sh # Performance benchmark
├── build/ # Build directory (created by CMake)
├── CMakeLists.txt # CMake build configuration
└── README.md # This file
Create a build directory and compile:
mkdir build
cd build
cmake ..
makeBuild with debug symbols:
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Debug ..
makeBuild in release mode with optimizations:
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
makeAll executables will be in the build/ directory.
The original Makefile is still available for direct builds:
make # Build main executable
make gdb # Build with debug symbols
make gprof # Build with profiling support
make gentest # Build test data generator
make dump # Build verification utilityBasic sorting (from the build directory):
./fsort -fi input.bin -fo output.binOr if using legacy Makefile from the project root:
./fsort -fi input.bin -fo output.bin| Option | Description | Default |
|---|---|---|
-fi <file> |
Input file | Required |
-fo <file> |
Output file | stdout |
-bsize <int> |
Block size for in-core sorting (records) | 262144 |
-osize <int> |
Output buffer size (records) | 262144 |
-len <int> |
Record length (bytes) | 4 |
-off <int> |
Key offset within record (bytes) | 0 |
-threads <int> |
Number of merge threads | 128 |
-tsize <int> |
Thread buffer size (records) | 16384 |
-tmpdir <path> |
Temporary file location | $TMPDIR or . |
-overwrite |
Sort file in-place | false |
-reverse |
Reverse sort order (descending) | false |
-swapbytes |
Swap byte order (endianness) | false |
Sort a file in-place:
./fsort -fi data.bin -overwriteSort with custom block size:
./fsort -fi input.bin -fo output.bin -bsize 524288Sort in descending order:
./fsort -fi input.bin -fo output.bin -reverseGenerate test data (1 million random unsigned integers):
./gentest test.bin 1000Sort the data:
./fsort -fi test.bin -fo sorted.binVerify correctness:
./dump sorted.binRun the benchmark script to compare fsort performance against the system sort command:
# From project root
./scripts/benchmark.sh [SIZE_IN_BLOCKS]
# Or from build directory (if using CMake)
cd build
cmake --build . --target benchmarkThe script will:
- Generate random test data (default: 10,000 blocks = 10M integers = 40MB)
- Sort using fsort and measure time
- Sort using system sort and measure time
- Verify both outputs are correct
- Display performance comparison
Example with different sizes:
./scripts/benchmark.sh 1000 # 1M integers (4MB)
./scripts/benchmark.sh 10000 # 10M integers (40MB)
./scripts/benchmark.sh 100000 # 100M integers (400MB)Note: The system sort comparison involves converting binary data to text and back, which adds overhead. For very large files, fsort's external sort algorithm should show significant advantages as it operates directly on binary data and uses efficient memory-mapped I/O.
fsort uses a two-phase approach:
-
Radix Sort Phase: Divides the input into blocks that fit in memory and sorts each block using a 4-pass byte-level radix sort
-
Merge Phase: Uses a multi-way merge with a priority queue to combine sorted blocks, repeating merge phases until the entire file is sorted
The algorithm is particularly efficient for sorting large datasets of fixed-size records with numeric keys.
By default, fsort works with binary files containing unsigned 32-bit integers. Custom record formats are supported using the -len and -off options to specify record size and key location.
Please contact the authors regarding licensing and usage terms.