Purpose: We all have taken exams throughout our academic careers. Whether they are standardized tests or exams for a specific subject in school, some people struggle through specific portions of those exams, and with this dataset, we may be able to see the correlation between different factors and attributes of a student. We would like to see the connections between students’ grades and other parameters including but not limited to test preparation, parental level of education, etc.
Core Questions:
- What is the distribution between all scores?
- What is the correlation between grades and test preparation?
- What is the correlation between grades and groups?
- What is the correlation between grades and lunch?
- What is the correlation between grades and gender?
- What is the correlation between grades and parents' level of education?
Hypothesis: We believe students that prepare for the test should score higher overall than students who do not prepare.
Source: https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams
Our dataset came from Kaggle and is called Students Performance in Exams by Aman Chauhan. The dataset has eight columns or in this case eight different attributes. There are three attributes with numerical values while the other five attributes are string values. These attributes include the student’s gender, race/ethnicity, parental level of education, lunch type, test preparation course, math score, reading score, and writing score. The sample size of the dataset is 1000, which means we have 1000 different students with varying attributes. Furthermore, we wanted to see the average score of the three scores for each student. To perform this we took, we appended a new column to the original dataset and named it “avgScore.” This column would add all three of the students’ scores up and divide by three to find the average of those scores (See Figure 1).
In later parts of the report, we will show the division of work on who answered which of our core questions. In this project, we used the material we learned in class from the lecture Jupyter Notebooks. These tools and APIs include using Seaborn and Matplotlib main for visualization. Next, we used NumPy and Pandas to manipulate our data. We also used NumPy to check and clean our data. For example, to check the cleanliness of our dataset we used the methods isnull() and isna() and sum up all the null and na values on each column to see if there were any of these values (See Figure 2).
For the results, we will be discussing our findings in our personal section because it will show the division of work, which visuals each of us formulated, and which questions we answered with those visualizations. Additionally, our hypothesis was that students that went through a test preparation course or prepared for the exam in any way would perform better, and we were able to show proof and visuals on why that hypothesis was true.
In conclusion, with our analysis of the data and visualization, we can conclude a variety of different factors and attributes that may affect a student’s exam score. We found out those students with parents that had higher levels of education usually scored higher on their exams than those with parents of lower education. Next, we found that students who had a standard lunch, essentially paid more to get lunch, would usually score higher than those who had free or reduced lunch. Relating back to our hypothesis, we can also conclude that students who prepared for the exam performed better than those who did not prepare. Finally, from our visualization, mainly the pair plots, we can see that all the test scores – math, reading, and writing – are proportional to each other. For example, we can see that as students’ math scores increased so did their reading scores.
There may be some points of view that may oppose some of our findings for this dataset. Some people may have the mindset that students with parents who went to higher education usually would perform worse in school because they do not have similar aspirations to getting into higher education to students whose parents did not go to college. On the other hand, students whose parents went to higher education may be more capable of “lending a helping hand” to their students to help them study for the exam. For instance, the parent may have a strong grasp on certain subjects like math where they are able to explain and lead their children to success. Overall, there may be other factors that play a role in determining students’ performance on these exams.
Here in this section, we will be showing the division of work and what we observed and conclude based on the visulation we have created.
Observation: These graphs show the distribution of students in the data organized by gender. It is evident that there is slightly more male than female students.
Observation: These graphs demonstrate the diversity in the data, and the division of each separate race/ethnicity taking the tests. There is a slight inequality in the total number of students in each group, which is to be expected considering each student was randomly selected.
Distribution Between Scores:
Observation: These bar graphs demonstrate the distribution between the 3 different types of tests, and then the average score of all tests combined. The shape of these graphs follows a normal distribution, with slight variation. Another observation is that there were more students scoring perfect / very high on the tests than those who scored extremely poorly.
Distribution Between Genders:
Observation: This graph shows the distribution of average scores compared between male and female students. Purple, which is the most common color in the visualization shows the overlap of male and female scores. Also, the graph followed a normal distribution with slight variation, namely more students scored extremely high than those who scored extremely low.
Observation: This section involved creating a histogram with the average test score on the x-axis, and the count on the y-axis. It displays the distribution of average test scores split up into categories determined by the education of the student’s parents.
Observation: The graph creates a scatter plot of the math scores of both males and females on the x-axis, then aligns it with the reading scores on the y-axis. It next creates a line of best fit (regression line) using the slope of the data, with red to represent the female scores and dark blue for male scores.
We believe a box plot would be the best visualization to help us visualize the correlations between the scores and a specific attribute. Box plots display the distribution, median, minimum, maximum, and mean of our data. With box plots, we can see the middle portion of the data where most students scored in this case.
Observation: For the math score plot, it shows that male students have a higher average math score than female students. However, for both reading and writing score plots, it shows that female students have a higher average score than male students.
Observation: Students with parents who went to college or some higher education perform better in all subjects. The interquartile range also indicates that the score range also seems better as the parental level of education increases.
Observation: There seems to be less of a noticeable difference between students of different races and ethnicity when it comes to exam scores. However, we can see that in all box plots groups D and group E are usually higher than groups A, B, and C. These two sections also have means similar to each other.
Observation: Students that complete a test preparation course or do some kind of test preparation score higher than students who do not do any test preparation.
Observation: Students that have standard lunch score higher on the exams than students who have free or reduced lunches.
Observation: The diagonal shows the distribution between scores, while the non-diagonal plots show the relationship between two differing scores. For example, the first-row middle column shows that as reading score increases math score also increases. For the second-row middle column, it shows that as the reading score increases writing score also increases, but the points are more tightly clustered which means that these two scores have a better correlation.
Observation: The difference between this pair plot and the last one is that it splits the points into two parts (by gender). Some plots show that there are distinct sections between males and females. For example, in the first-row middle column, for female students as the reading score increases the math score increases slower than for male students. Another example is in the second-row first column for male students as the math score increases the writing score increases slower than female students.
Observation: Created a pair plot that shows the correlation between the different scores. Each color shows a different group corresponding to an unknown race or ethnicity. In this pair plot, we can see that race and ethnicity has no effect on how one score may affect another score. Additionally, the diagonal of the pair plot shows the distribution and we can see which group would need more help in a certain subject. In this case, group A would need help with math.
Observation: For the technique of data analysis, I used the LinearRegression in sklearn to find the equation for the correlation between reading scores and writing scores. I was able to find that with the intercept set to 0, the equation would come out to be y=0.9829247088231474x. This concludes our initial findings that reading and writing score are highly correlated and that as reading score increases writing score also increases. In this case, the writing score increases about 0.98 times the reading score.
Payton Falcone
Ethan Wu















