Week 5: CST383 Introduction to Data Science
This week felt like a big jump from the basics into more real data cleaning and actual decision-making. We focused on handling missing data, scaling features, and learning how to properly evaluate models. A big takeaway for me was understanding why we should never train and test on the same data. It seems obvious once you see it, but it would actually be very easy to do that and think your model is amazing. That made me realize how careful you really have to be when separating training and testing data. Handling zeros was honestly confusing. The idea makes sense, because some zeros represent missing values, but deciding what to do about them feels less clear. I can follow the code, but I don’t feel fully confident yet in making those decisions on my own and knowing what is statistically responsible. This week we also practiced cross-validation and tuning KNN parameters, which helped me see how model performance can change...