Sabbatical June Report: My Year as a Data Scientist

I am on sabbatical for the 2021–2022 school year. I am working at a bank as a data scientist. My intent is to give regular reports on how this is going. Here are several things that I intend to focus on.

  • The skills needed to be a data scientist. In particular, I will need to write up a report about my experience that can be used if we ever have a data science major.
  • How academic and business is different. What can we learn from business? Robert Talbert did something similar when he was working at Steel Case.
  • How my experience is going to influence my teaching. In particular, I am not starting this sabbatical with mad data science skills, so I get to be in a student’s role this year.
  • My feelings about being in industry. I worked for two years before I went to graduate school (20 years ago!), so I am used to (and like) industry. However, I love academic. I will let you know if I have any interesting feelings.

I just finished my first month on the job. So far, I am really enjoying it. My team is good and helpful, and I have an interesting project. I am going to generally keep the details of my projects vague, since I am working with a lot of sensitive data. My first two projects will be using data to predict how existing business loans will fare. That is, I will be trying to figure out if we can predict when businesses might default on loans.

Here are my thoughts from the first month.

  • I am using python, mostly pandas, in Jupyter notebooks. I am doing a lot less of what I think of as traditional “coding:” I don’t think I have used any loops yet (aside from the ones that pandas uses in the background), and I have only defined two functions. The python part of the job really is about learning how to do things with pandas—I think that my coding skills are much higher than are needed for the job so far (I am a proficient coder, but I am nowhere near great).
  • I making SQL calls to databases. These have been pretty basic so far, and my very rudimentary SQL skills from my previous stint in industry are more than sufficient to get me going. I have definitely learned a lot of SQL in the last month, but it has been painless.
  • One of the most challenging parts of my job is learning my way through the databases. There are several, and they contain similar (but not the same) information. Some parts of each database are fake. For example, if I want to find someone’s name, I might find that a field called “Name” is largely empty and I need to look at a field called xpadsf_name__d instead.
  • The most challenging part of the job—by far—is learning about banking (loans specifically). This is something that I can’t really reason out (or Google, oftentimes). There is technical vocabulary, and I have to learn how to think like a banker. I don’t see a way to be a good data scientist without the knowledge of the industry—you need to know what data might be important to look at.

I really like the job so far. My experience has been very similar to my research sabbatical—I just need to sit down each day and make progress on my project. I think that my boss has largely shielded me from other obligations (e.g. other departments asking for help), and I am grateful for that. It really does feel similar to doing research—I often feel lost, I usually make small progress, and I sometimes figure out something that makes a whole bunch of stuff come together.

I have been aware that, as a new learner, I am often completely confused, both when talking to other people and when working on my own. I deal with it by largely letting it wash over me, knowing that I am learning even though I am confused. I have been taking details notes as much of as I can, and I have been saving text files with instructions on how to do technical things that I know I am going to need to remember how to do. The whole process has been really enjoyable.

However, I think that it is enjoyable because I have already had a lot of practiced being confused. Research in mathematics is hard, and I spend most of my research time being confused about something. This is something that my students have not necessarily experienced. It might be nice to scaffold this in some way.

Similarly and specific to data science: it might be nice to have students work on a series of databases. They might start with exactly the data that one needs, and then graduate build (over a couple of years, perhaps) to a complicated mess full of extraneous and missing information.

Again, specific to data science: our curriculum should match something that our students can easily develop expertise in. This is not easy, and my best idea so far is to have them student school-related issues (they all have a lot of experience with schools, after all).

Again, specific to data science: it is often said that 80% of a data scientist’s time is spent cleaning data (or something close to 80%), where cleaning data means finding it, making sure that it is the right type, combining data from difference sources, create new data out of old data (e.g. averaging some numbers), etc. This is the tedious prep work (although it is enjoyable in its own way) you must do before doing the fun stuff. This has largely been true in my experience. While I don’t necessarily think that students should spend 80% (or whatever) of their time in a data science major cleaning data, students should definitely be doing this on a regular basis.

I will end by noting that I am mindful of the fact that ethics can be an issue with data science. I will write about them when appropriate and useful.

So: this has been a great experience so far! I already miss being a professor, and it is weird to think that I am going to be away from my campus for about 2.5 years (1.5 years due to the pandemic including the summer, and one year due to my sabbatical). I am going to be excited to be back on campus, but I am enjoying my sabbatical for now!

Tags: ,

15 Responses to “Sabbatical June Report: My Year as a Data Scientist”

  1. identityelement Says:

    This sounds like a fantastic experience! I look forward to the next installment. Have a great holiday weekend!

  2. Andy Rundquist Says:

    This sounds like such a cool experience! I’m so glad this opportunity is working out for you. I was really struck by your “I haven’t used any loops” comment. I think about that kind of thing a lot when teaching in our Computational Data Science program. I want my students to have tools available to get things done, especially that 80% stuff you’re talking about, and I think I’m going to pay close attention to what you learn this year.

    My question for you: Were you able to make any choices among the tools you’re using (SQL, python, pandas, jupyter, etc)?

    • bretbenesh Says:

      I should note that there are some implicit loops built into pandas, but I bet you already know that. I have used these. For instance, there is a loop built into a definition for a new column, as below.

      df[“New Column”]=df.a+df.b

      I have definitely used those. I haven’t have to literally type “for” or “while.”

      I have had no choice of tool yet. They had everything set up initially. Fortunately, I really wanted to work with python, and I was thrilled that that is their default language.

  3. Sabbatical July Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] « Sabbatical June Report: My Year as a Data Scientist […]

  4. Sabbatical August Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] June Report […]

  5. Sabbatical September Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] June Report […]

  6. Sabbatical October Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] June Report […]

  7. Sabbatical November Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] June Report […]

  8. Sabbatical December Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] June Report […]

  9. January Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] June Report […]

  10. February Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] June Report […]

  11. March Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] June Report […]

  12. April Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] June Report […]

  13. Solvable by Radicals Says:

    […] June Report […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: