Sabbatical November Report: My Year as a Data Scientist

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist. First, note that gasstationwithoutpumps pointed out that my explanation of a novel use of joins is, well, pointless (he was kinder about it). I haven’t yet worked out whether the problem was with my coding or my explanation of it, but I am betting it was the former while hoping it was the latter.

I am now halfway through my expected tenure as a data scientist. I have learned a lot. I have a lot more to learn.

I have moved on to building models, which is what most people think of when they think about data science (if they think about anything at all). Basically, this is the part where I make predictions based on the data. I am playing around with the following tools (I am trying to classify data).

  • AdaBoost
  • k-Nearest Neighbors
  • Logistic Regression
  • Naive Bayes
  • Neural Networks
  • Random Forests
  • Support Vector Machine
  • XGBoost

I have a couple of models that seem to do better than the others, and now I am trying to milk what I can out of the models to improve their performance. I will probably be working on the same thing for the rest of the month.

How academia and business are different.

See the section about my feelings.

How will this experience influence my teaching?

I spoke last month about my thoughts about ungrading. I think that this sabbatical experience is reinforcing the thought that grading really isn’t good. I am having to train myself how to function at this job, not having been trained directly in what to do—just like my students will do at their first job. Lifelong learning and all.

This is not just about me being naive (which I admit to). I understand that some students won’t respond well without the grade incentives. So I am not being idealistic. Rather, I think that it is a valuable skill to be able to self-assess (which I am assuming is part of ungrading), and then learn to address your weaknesses.

My feelings about being in industry.

I am struggling a bit. I am recognizing exactly how spoiled we are as professors. I don’t get a break a Christmas. Christmas is on a Saturday, and I will be back to work on the Monday (unless I use a vacation day, which I might. I do get paid for a full day on Friday if I work a half-day, which is nice). Frankly, working in industry requires a certain endurance that I don’t exactly have right now. I will do fine, but I can tell that my body/mind/spirit is expecting a break that it will not get. I might get paid less as a professor than I would in industry, but I certainly appreciate the time off.

Tags: ,

8 Responses to “Sabbatical November Report: My Year as a Data Scientist”

  1. Andy Rundquist Says:

    (caught up!)

    Wow I love this series! I’m so glad you keep adding to it. What a gold mine for people like me who both are interested in cool sabbatical approaches and are interested in data science as a thing for students to learn.

    I’m really struggling with some of the machine learning approaches, because, while they find cool patterns, they rarely let me understand what’s really going on. I’m using them a ton to try to understand student retention and persistence based on courses they take and I think I’m seeing patterns that we can address, but I’m not as confident as when I’m the one slowly building the model with particular terms representing particular things.

    How are you doing with that? Am I misrepresenting things with that framing?

    • bretbenesh Says:

      I am just beginning with the machine learning, but my experience largely matches yours. The degree to which my experience matches depends on the model (Naive Bayes is pretty easy to understand, in principle, what happens, whereas neural networks are impossible to tell what is going on).

      I am betting that you already know this (I am pretty certain that you know a ton more about data science than I do; I am mostly writing this to remind myself in the future), but the .feature_importances__ method (or permutation_importance, in some cases) combined a knowledge of how the model works can provide some insights.

      But—yes—these can be a bit of a black box. Depending on details (like how much time you have), I might try approaching it from both ends. Slowly build the model yourself, build the machine learning black box along with it, and then use what you learn from the black box to inform your model (and vice versa).

  2. Sabbatical December Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] November Report […]

  3. January Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] November Report […]

  4. February Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] November Report […]

  5. March Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] November Report […]

  6. April Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] November Report […]

  7. Solvable by Radicals Says:

    […] November Report […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: