An exploratory data analysis of global shark attack incidents using the Global Shark Attack File (GSAF) dataset.
This project was developed collaboratively as part of the IronHack Data Analytics Bootcamp, simulating a data analysis brief for the SafeWaters Travel Advisory Bureau — a tourism organisation focused on travel safety recommendations.
- Anwen Roberts
- Gabriela Cascione
- Salem Ibrahim
| Detail | Info |
|---|---|
| Source | Global Shark Attack File – Incident Log |
| File | GSAF5.xls |
| Records | 7,082 incidents |
| Columns | 23 |
| Time span | Historical records through March 2026 |
| Column | Description |
|---|---|
Date |
Date of the incident |
Year |
Year of the incident |
Type |
Provoked, Unprovoked, Watercraft, Sea Disaster, Invalid, etc. |
Country |
Country where the incident occurred |
State |
State or region |
Location |
Specific location description |
Activity |
What the victim was doing (surfing, swimming, fishing, etc.) |
Name |
Name of the victim |
Sex |
Sex of the victim |
Age |
Age of the victim |
Injury |
Description of injuries sustained |
Fatal Y/N |
Whether the incident was fatal |
Time |
Time of day of the incident |
Species |
Shark species involved |
Source |
Source of the report |
- Shark attacks are more likely to occur during summer months than winter months at each destination.
- Shark attacks are more likely to occur during afternoon hours than any other time of day.
- Swimming is the activity with the highest number of shark attacks due to being a common activity in the water.
- The USA, Australia and South Africa account for the highest number of shark attacks globally.
- The number of shark attacks has increased globally over the last decades.
- Dropped:
Name,Source,pdf,href,href formula,Case Number,Case Number.1,original order— PII or not relevant for analysis;Unnamed: 21,Unnamed: 22— legacy empty columns from the original Excel file - Renamed: all column names lowercased and renamed for consistency (
fatal y/n→fatal_yes_no,type→attack_type,sex→gender)
attack_type: stripped, lowercased and standardised toUnprovoked,Provoked,WatercraftorNaNfatal_yes_no: inconsistent values mapped toYes,NoorNaNgender: invalid entries (?,lli,m x 2, etc.) replaced withNaN; valid values uppercasedyear: rows with year before 1900 and null values removed; column converted to integercountry: lowercased, stripped and spelling inconsistencies standardised; rows where country, state and location were all null dropped (no way to identify location)
hemisphere: derived fromcountry, labelling each incident asNorthernorSouthernmonth: extracted from the messydatecolumn using regex and datetime parsingtime_of_day: derived fromtime, categorising incidents intoMorning,Afternoon,EveningorNightseason: derived frommonthandhemisphere, accounting for seasonal differences between hemispheres
The activity column contained 600+ unique values including detailed narrative descriptions of historical incidents. Since the analysis focuses on the top activities only, no further standardisation was performed. The column was converted to string type and stripped to ensure compatibility for analysis.
The analysis was scoped to 1976–2026 (last 50 years) for more reliable and relevant data.
- Season distribution analysed using
value_counts()and a bar chart - Crosstab tables built to compare season counts and percentages by hemisphere, accounting for seasonal differences between Northern and Southern hemispheres
- Time of day distribution analysed using
value_counts()and a bar chart
- Top 5 activities identified by count; all others grouped into
Other - Bar chart built from aggregated counts
- Top 10 activities listed for broader reference
- Most fatal activities in the top 3 countries identified separately
- Top 10 countries identified by count; all others grouped into
Other - Bar chart built from aggregated counts
- Top 3 countries isolated for time series analysis (attacks per year, 1976–2026)
- Global time series built to observe overall trend over the last 50 years
- Python (Pandas, Matplotlib, Seaborn)
- Jupyter Notebook
shark-attack-analysis/
│
├── data/
│ ├── raw_data_GSAF5.xls # Raw dataset (not modified)
│ └── shark_data_clean.pkl # Cleaned and wrangled dataset
│
├── Plots/ # Exported visualisations
│ ├── activity_bar.png
│ ├── country_bar.png
│ ├── fatal_activities_bar.png
│ ├── season_bar.png
│ ├── time_series_attack.png
│ ├── tod_bar.png
│ └── top_3_timeseries.png
│
├── shark_data_cleaning.ipynb # Data wrangling and cleaning
├── shark_data_analysis.ipynb # Exploratory data analysis
├── safewaters_presentation.pdf # Final presentation
├── shark_travel_advisory_logo.svg # Project logo
│
└── README.md
-
Clone the repository and open the notebook:
git clone https://github.com/craftedbygaby/shark-attack-analysis.git cd shark-attack-analysis jupyter notebook notebooks/shark_attack_analysis.ipynb -
Install dependencies if needed:
pip install pandas numpy matplotlib seaborn xlrd
- This dataset is maintained by the Global Shark Attack File, a not-for-profit organisation.
- The data spans centuries of records, so older entries may be less complete or reliable than recent ones.
- All analysis is for educational and portfolio purposes only.