A simple, memory-efficient Python tool to split large JSON files into smaller, manageable chunks.
It takes a large JSON file (even multiple gigabytes!) and splits the data inside (specifically, an array of objects) into multiple smaller files. You can choose how to split it:
- By a specific number of items per file (
count). - By an approximate file size for each file (
size, e.g.,10MB). - By the value of a specific key within the data (
key, e.g., grouping all items with the sameuser_idtogether).
- Handles Huge Files: Designed for files too large to fit into memory, using efficient streaming via
ijson. - Flexible Splitting: Choose the method that best suits your needs.
- Works with Nested Data: Can target data deep within the JSON structure using a simple dot-notation path (e.g.,
data.records.item). - Memory Efficient Key Splitting: Uses an LRU (Least Recently Used) cache to limit the number of simultaneously open files when splitting by key, preventing issues with resource limits.
- Easy to Use: Run via command-line or a helpful interactive mode.
- Python 3.7 or newer
-
Clone the repository:
git clone https://github.com/hamzasahin/json-splitter.git # Or your repo URL cd json-splitter
-
Set up a virtual environment (Recommended): This keeps dependencies isolated.
python -m venv venv source venv/bin/activate # On Windows use: venv\\Scripts\\activate
-
Install dependencies: (Ensure
cachetoolsis included if it wasn't already)pip install -r requirements.txt
You have three ways to run the splitter, with the configuration file method being recommended for clarity and reproducibility:
You can define all splitting parameters in a YAML configuration file and load it using the --config option. This is useful for complex or frequently used configurations.
Example:
python -m src.main --config my_split_config.yaml(If you need to temporarily override a setting from the config file, you can still pass specific command-line arguments.)
Argument Precedence:
- Command-Line Arguments: Have the highest priority and will override any settings from the config file.
- Configuration File: Values specified in the loaded YAML file override the built-in defaults.
- Built-in Defaults: Have the lowest priority (e.g.,
output_dir=".",base_name="chunk").
See the config.example.yaml file in the repository for a template and examples of all available options.
For scripting or direct control without a config file, use the command line from the project root directory:
python -m src.main <input_file> --split-by <strategy> --value <split_value> --path <json_path> [options]Core Arguments:
| Argument | Description |
|---|---|
input_file |
Path to your large input JSON file. |
--split-by |
How to split: count, size, or key. |
--value |
The value for the split strategy (e.g., 10000, 50MB, product_id). |
--path |
Dot-notation path to the array to split (e.g., item, data.records.item). Use item or leave empty for root array. |
Common Options:
| Option | Description |
|---|---|
--config <file> |
Path to a YAML configuration file (overrides defaults before other CLI args). |
--output-dir <dir> |
Directory to save output files (default: current directory .). |
--base-name <name> |
Base name for output files (default: chunk). |
--output-format |
json (default, pretty-printed) or jsonl (JSON Lines). (Note: key split forces jsonl) |
--max-records <N> |
Secondary limit: Max number of items per output file part. |
--max-size <size> |
Secondary limit: Max approximate size per output file part (e.g., 100MB). |
--filename-format |
Customize output file names (see Filename Formatting below). |
--report-interval <N> |
Report progress every N items (default: 10000). Set to 0 to disable. |
-v, --verbose |
Show detailed debug messages. |
Key Splitting Options:
(Only relevant when using --split-by key)
| Option | Description |
|---|---|
--on-missing-key |
What to do if an item lacks the key: group (default, into __missing_key__ file), skip, or error (stop script). |
--on-invalid-item |
What to do if an item at --path isn't an object: warn (default, prints warning and skips), skip, or error (stop script). |
If you're unsure about the options or running a one-off split, just run the script without any arguments from the project root directory. It will guide you step-by-step:
python -m src.main(These examples use the CLI method for conciseness, but the same options can be put into a config file)
Example 1: Split by Size
Split large_log.json into files of roughly 100MB each, saving them in the log_parts/ directory. The data to split is the array found under the events key.
python -m src.main large_log.json --output-dir log_parts/ --base-name log --split-by size --value 100MB --path events.itemCreates files like log_parts/log_chunk_0000.json, log_parts/log_chunk_0001.json, ...
(Note: Adjusted path to events.item assuming events is an object containing an array named item. Adjust if events itself is the array)
Example 2: Split by Count (JSON Lines)
Split user_data.json into files containing 50,000 users each, outputting as JSON Lines into the user_outputs/ directory.
python -m src.main user_data.json --output-dir user_outputs/ --base-name user --split-by count --value 50000 --path results.users.item --output-format jsonlCreates files like user_outputs/user_chunk_0000.jsonl, user_outputs/user_chunk_0001.jsonl, ...
Example 3: Split by Key
Group items from orders.json based on their customer_id, saving files into customer_orders/. The items are in the root array. Output will be JSON Lines format.
# Use 'item' or empty string for root array path
python -m src.main orders.json --output-dir customer_orders/ --base-name order --split-by key --value customer_id --path itemCreates files like customer_orders/order_key_cust101.jsonl, customer_orders/order_key_cust456.jsonl, ... (Note the .jsonl extension)
You can control the output filenames using --filename-format (either via CLI or in the config file). Available placeholders:
{base_name}: The--base-nameyou provided (default:chunk).{type}: How the split was done (chunkfor count/size,keyfor key split).{index}: The chunk number (e.g.,0000,0001) or the sanitized key value (e.g.,user123).{part}: An optional part suffix (e.g.,_part_0001) added if secondary limits (max-records/max-size) cause a split within a primary chunk/key.{ext}: The file extension (jsonorjsonl).
Default Format: {base_name}_{type}_{index:04d}{part}.{ext} (for count/size) or {base_name}_key_{index}{part}.{ext} (for key).
- Input Must Be Valid JSON: The script expects a syntactically correct JSON file. If you have issues, validate your input file first.
- Memory Use with Many Keys: Splitting by
keyon data with millions of unique keys uses an LRU cache to manage open file handles, limited by a fixed size (currently 1000, seeMAX_OPEN_FILES_KEY_SPLITinsplitters.py). This prevents hitting OS limits but means files for less frequent keys might be closed and reopened, impacting performance slightly compared to keeping all files open. If you encounter memory issues with extreme key cardinality, this limit might need adjustment in the code. - Key Split Output Format: Splitting by
keyalways produces output files in JSON Lines (.jsonl) format, regardless of the--output-formatsetting. This is more efficient for appending items to many different files. - Size Estimation: Splitting by
sizeis an approximation. Actual file sizes may vary slightly due to JSON formatting overhead and how items are grouped. - JSON Path: The
--pathargument usesijson's dot notation (e.g.,data.records.item). If your target array is at the root of the JSON, useitemor leave the path empty (--path ""). - Pretty JSON Output: When using the default
--output-format json, the output JSON files will be pretty-printed with indentation for better readability.
Found a bug or have an idea? Feel free to open an Issue or Pull Request!
This project is licensed under the MIT License.