Hypothyroidism is one of the most common endocrine disorders, yet diagnosis often requires multiple lab tests that can be costly and difficult to access. Globally, more than a billion people are iodine deficient, and in many places access to full thyroid panels is limited. Faced with this reality, I was inspired to explore whether it would be possible to build a predictive model that identifies hypothyroidism with high sensitivity while relying on fewer, less expensive features.
Working on this project taught me that effective data science depends as much on data cleaning as on modeling. Much of the dataset contained missing values, “?” placeholders, and inconsistent encodings, all of which had to be carefully addressed before building any models. I also learned that clinical priorities differ from purely machine learning metrics: accuracy alone can be misleading, and in this context it was essential to emphasize sensitivity (recall) to minimize false negatives.
Another important takeaway was that a small set of well-chosen features can be just as powerful as a larger, more complex feature set. Age, Sex, TSH, and FTI alone captured most of the predictive signal, showing the value of thoughtful feature selection over brute force. Finally, I realized that interpretability is critical in medicine. While Random Forests gave strong results, Logistic Regression and Decision Trees offered transparency that helps build trust with clinicians who may rely on the model.


Log in or sign up for Devpost to join the conversation.