“Find the hidden formula. Conquer the vineyard.”
Symbolic Regression is a regression technique used to generate interpretable mathematical expressions.
In this project, Genetic Programming (GP) is used to create a regressor to predict the number of lugs produced in 1991 using the vineyard dataset. The GP evolves candidate equations over generations to find the best fit.
The dataset is taken from Smoothing Methods in Statistics (pages 146–147). It contains harvest data from a vineyard on a small island in Lake Erie:
- Rows: 52 (northwest → southeast)
- Observations per row: Number of lugs harvested in 1989, 1990, and 1991
- Lug weight: ~13.6078 kg of grapes per lug
- Train/Test Split: 70% training, 30% testing
- Duplicate Rows: Dropped during preprocessing
- Title:
192 vineyard - Variables: 3
- Observations: 52
- Attribute Type: Continuous
- Missing Values: None
- Duplicate Rows: 1 (dropped during preprocessing)
The GP regressor is represented as an Expression Tree:
- Internal Nodes: Operators from the functional set
- Leaf Nodes: Operands from the terminal set or functional set
Expression trees allow flexible node manipulation and easy evaluation of candidate equations.
- Generation Method: Growth Method
- Initial Tree Depth: 0 → start with simple trees
Mean Absolute Error (MAE) is used to evaluate each individual.
- Always non-negative
- Robust to outliers
- Lower MAE indicates a stronger regressor
Tournament Selection is used:
- Randomly pick
kindividuals from the population - Select the one with the lowest fitness score
- Chosen parents reproduce for the next generation
- Keeps top 10% of individuals for the next generation
- Ensures strong candidates are preserved
- Swaps subtrees between parents at a random point
- Crossover Rate: 80%
- Stricter rule: if offspring fitness > parent, crossover retries
- Replaces a node at a random point with a new subtree
- Mutation Rate: 15%
- Stricter rule: if mutated fitness > parent, mutation retries
- Maximum tree depth
- Maximum number of generations
- Population size
- Population Size: 500 → encourages diversity
- Crossover Rate: 80%
- Mutation Rate: 15%
- Elitism Rate: 5%
- Maximum Tree Depth: 3 → manageable tree growth
- Initial Population Method: Growth Method
Functional Set:
{ +, -, *, / }
Terminal Set:
{ x1, x2, c }
x1→ lugs produced in 1989x2→ lugs produced in 1990c→ constant (value = 2, tuned for better regressors)
🧩 Main Quest: Symbolic Regression with Genetic Programming
🚀 Objective: Predict 1991 grape harvests with an interpretable formula
🏆 Reward: A robust mathematical regressor for the vineyard dataset
Honours-level project — built for evolution, not brute force.