A comprehensive machine learning project implementing and comparing multiple classification models for astronomical object classification using the Sloan Digital Sky Survey (SDSS17) dataset.
- Overview
- Dataset
- Data Processing
- Models Implemented
- Results Comparison
- Key Findings
- Installation
- Autorship Note
- License
This project builds and evaluates four distinct machine learning classification models to classify astronomical objects from the SDSS17 survey into three categories:
- Stars β
- Galaxies π
- Quasars β¨
The project demonstrates the end-to-end machine learning pipeline: data exploration, preprocessing, model training, evaluation, and comparative analysis.
Source: Kaggle - Stellar Classification Dataset - SDSS17
- Total Samples: 100,000 astronomical observations
- Total Features: 17 attributes
- Classes: 3 (Galaxy, Star, Quasar)
| Feature | Type | Description |
|---|---|---|
obj_ID |
ID | Object identifier in the image catalog |
alpha |
Angle | Right ascension (longitude projected on the sky) |
delta |
Angle | Declination (latitude projected on the sky) |
u |
Magnitude | Ultraviolet filter magnitude |
g |
Magnitude | Green spectrum filter magnitude |
r |
Magnitude | Red spectrum filter magnitude |
i |
Magnitude | Near-infrared filter magnitude |
z |
Magnitude | Infrared filter magnitude |
run_ID |
ID | Telescope scan identifier |
rerun_ID |
ID | Image reprocessing number |
cam_col |
ID | Camera column identifier |
field_ID |
ID | Sky field identifier |
spec_obj_ID |
ID | Object identifier in spectroscopy catalog |
class |
Target | Classification (Galaxy, Star, Quasar) |
redshift |
Physical | Cosmological redshift (distance/velocity indicator) |
plate |
ID | Aluminum plate identifier |
MJD |
Date | Modified Julian Date of observation |
fiber_ID |
ID | Optical fiber identifier |
- Identified and removed artificial values (-9999) in magnitude columns (u, g, z)
- Removed incomplete or erroneous telescope measurements
- Final dataset: ~96,000 samples (after cleaning)
- Applied DBSCAN clustering algorithm for outlier removal
- Parameters: eps=0.75, min_samples=16
- Outliers removed: ~15% of data
- Final clean dataset: ~81,000 samples
- Removed non-informative columns: identification numbers and technical telescope parameters
- Used Mutual Information (MI) analysis to rank feature importance
- Final features for modeling: 8 features (alpha, delta, u, g, r, i, z, redshift)
- Applied StandardScaler normalization (zero mean, unit variance)
- Train/Test Split: 80/20 with stratification to maintain class balance
- Encoded class labels to numerical values:
- 0 = Galaxy
- 1 = Star
- 2 = Quasar
Approach: Linear probabilistic classifier using Softmax activation for multiclass problems
Hyperparameters:
- Multi-class: Multinomial
- Solver: LBFGS
- Max iterations: 1000
Results:
- Accuracy: 96%
- Best-performing class: Star (97% F1-Score)
- Worst-performing class: Quasar (91% F1-Score)
Analysis: Shows that data has a linear separability component but struggles with overlapping regions between Galaxies and Quasars.
Approach: Kernel-based method using Radial Basis Function (RBF) for non-linear separation
Hyperparameters:
- Kernel: RBF (Radial Basis Function)
- C (regularization): 60
- Gamma: Scale (automatic)
- Random state: 42
Results:
- Accuracy: 97%
- Precision & Recall: Balanced across all classes
- Significant improvement over Logistic Regression
Analysis: RBF kernel effectively handles non-linear decision boundaries, capturing the overlapping patterns better than linear methods.
Approach: Artificial neural network for learning complex, non-linear relationships
Hyperparameters:
- Activation: ReLU
- Optimizer: Adam
- Learning rate: 0.001
- Hidden layers: 1 Γ 100 neurons
- Max iterations: 200
Results:
- Accuracy: 97%
- Training accuracy: 97.40%
- Test accuracy: 97.37%
- No significant overfitting detected
Analysis: Neural network converges rapidly (within 20 iterations) while maintaining excellent generalization on unseen data.
Approach: Recursive partitioning using information gain-based tree construction
Hyperparameters:
- Criterion: Gini index
- Max depth: 8
- Random state: 42
Results:
- Accuracy: 98% β (Best performer)
- Galaxy class: 98% F1-Score
- Quasar class: 94% F1-Score
- Star class: 100% F1-Score
Analysis: Achieves best overall performance through interpretable decision rules. Feature importance ranking reveals redshift as the primary discriminator.
| Model | Accuracy | Galaxy F1 | Quasar F1 | Star F1 | Computational Cost |
|---|---|---|---|---|---|
| Logistic Regression | 96% | 96% | 91% | 97% | Very Low |
| SVM (RBF) | 97% | 97% | 94% | 98% | Medium |
| MLP | 97% | 98% | 95% | 98% | High |
| Decision Tree | 98% | 98% | 94% | 100% | Low |
Key Observations:
- Star classification: Nearly perfect (99-100% recall across all models)
- Galaxy classification: Consistently accurate (95-99% recall)
- Quasar classification: Most challenging across all models (89-94% recall)
Quasar Misclassification Pattern:
- Quasars misclassified as Galaxies: ~11-12% (primary error source)
- Quasars misclassified as Stars: <1% (rare)
Redshift is the dominant discriminator:
- Mutual Information (MI) score: ~0.80 (highest among all features)
- Provides clear separation between classes due to cosmological distance-velocity relationship
- Nearly perfect separation for Stars (very low redshift values)
- Substantial overlap between Galaxies and Quasars at certain redshift ranges
Secondary important features (in order):
- z (infrared magnitude)
- i (near-infrared magnitude)
- r (red magnitude)
- u (ultraviolet magnitude)
- g (green magnitude)
Least important features:
- alpha, delta (sky coordinates - no physical relevance to classification)
- Physical similarity: Quasars are active galactic nuclei (AGN) - essentially galaxies with supermassive black holes
- Photometric overlap: Color patterns (u, g, r, i, z magnitudes) overlap significantly with normal galaxies
- Redshift ambiguity: Some galaxies have redshift values in the same range as nearby quasars
- Data outliers: Galaxy outliers at high redshift values create classification ambiguity
- Python 3.8 or higher
- pip or conda package manager
pip install -r requirements.txtOr install packages individually:
pip install pandas numpy scikit-learn matplotlib seaborn jupyterpandas>=1.3.0
numpy>=1.21.0
scikit-learn>=0.24.0
matplotlib>=3.4.0
seaborn>=0.11.0
jupyter>=1.0.0
This project was originally developed as a group assignment with LetΓcia Borges. This specific repository is maintained by Maria Clara Rodrigues Ribeiro and focuses exclusively on the Classification Models and the comparative analysis of their results.
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset: Kaggle - Stellar Classification Dataset - SDSS17
- SDSS Survey: Sloan Digital Sky Survey
- Scikit-learn Documentation: sklearn.org
Developed as a final assignment for the Artificial Inteligence course at UNIFEI.