Skip to content

ctsilva/fsort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fsort - External Sorting Utility

An efficient external sorting utility for large binary files using radix sort and merge sort algorithms.

Author

Written by Huy T. Vo while a PhD student at the University of Utah under the supervision of Prof. Cláudio T. Silva.

Overview

fsort implements external sorting algorithms designed to handle files larger than available memory. The project includes multiple implementation variants optimized for different use cases:

  • mradix.cpp: Main implementation with configurable block sizes and multi-way merge
  • radix-sort.c: Memory-mapped file version with 64-bit file offset support
  • mradix-sort.c: Memory-mapped version with 32-bit file offsets
  • miradix-sort.c: Chunked memory-mapped version for processing in configurable buffers

Project Structure

fsort/
├── src/              # Source files
│   ├── mradix.cpp    # Main fsort implementation
│   ├── radix-sort.c  # Memory-mapped variants
│   ├── gentest.c     # Test data generator
│   └── dump.cpp      # Verification utility
├── scripts/          # Utility scripts
│   └── benchmark.sh  # Performance benchmark
├── build/            # Build directory (created by CMake)
├── CMakeLists.txt    # CMake build configuration
└── README.md         # This file

Building

Using CMake (recommended)

Create a build directory and compile:

mkdir build
cd build
cmake ..
make

Build with debug symbols:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Debug ..
make

Build in release mode with optimizations:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

All executables will be in the build/ directory.

Using the legacy Makefile

The original Makefile is still available for direct builds:

make            # Build main executable
make gdb        # Build with debug symbols
make gprof      # Build with profiling support
make gentest    # Build test data generator
make dump       # Build verification utility

Usage

Basic sorting (from the build directory):

./fsort -fi input.bin -fo output.bin

Or if using legacy Makefile from the project root:

./fsort -fi input.bin -fo output.bin

Command-line Options

Option Description Default
-fi <file> Input file Required
-fo <file> Output file stdout
-bsize <int> Block size for in-core sorting (records) 262144
-osize <int> Output buffer size (records) 262144
-len <int> Record length (bytes) 4
-off <int> Key offset within record (bytes) 0
-threads <int> Number of merge threads 128
-tsize <int> Thread buffer size (records) 16384
-tmpdir <path> Temporary file location $TMPDIR or .
-overwrite Sort file in-place false
-reverse Reverse sort order (descending) false
-swapbytes Swap byte order (endianness) false

Examples

Sort a file in-place:

./fsort -fi data.bin -overwrite

Sort with custom block size:

./fsort -fi input.bin -fo output.bin -bsize 524288

Sort in descending order:

./fsort -fi input.bin -fo output.bin -reverse

Testing

Generate test data (1 million random unsigned integers):

./gentest test.bin 1000

Sort the data:

./fsort -fi test.bin -fo sorted.bin

Verify correctness:

./dump sorted.bin

Benchmarking

Run the benchmark script to compare fsort performance against the system sort command:

# From project root
./scripts/benchmark.sh [SIZE_IN_BLOCKS]

# Or from build directory (if using CMake)
cd build
cmake --build . --target benchmark

The script will:

  1. Generate random test data (default: 10,000 blocks = 10M integers = 40MB)
  2. Sort using fsort and measure time
  3. Sort using system sort and measure time
  4. Verify both outputs are correct
  5. Display performance comparison

Example with different sizes:

./scripts/benchmark.sh 1000    # 1M integers (4MB)
./scripts/benchmark.sh 10000   # 10M integers (40MB)
./scripts/benchmark.sh 100000  # 100M integers (400MB)

Note: The system sort comparison involves converting binary data to text and back, which adds overhead. For very large files, fsort's external sort algorithm should show significant advantages as it operates directly on binary data and uses efficient memory-mapped I/O.

Algorithm

fsort uses a two-phase approach:

  1. Radix Sort Phase: Divides the input into blocks that fit in memory and sorts each block using a 4-pass byte-level radix sort

  2. Merge Phase: Uses a multi-way merge with a priority queue to combine sorted blocks, repeating merge phases until the entire file is sorted

The algorithm is particularly efficient for sorting large datasets of fixed-size records with numeric keys.

File Format

By default, fsort works with binary files containing unsigned 32-bit integers. Custom record formats are supported using the -len and -off options to specify record size and key location.

License

Please contact the authors regarding licensing and usage terms.

About

External sorting utility using radix sort and multi-way merge for handling files larger than memory

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors