May 27, 2022

## May/Final Sabbatical Report: My Year as a Data Scientist

May 4, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

Summary: I am officially done working for the bank—I turned in my computer to them today. The timing was almost perfect: I was able to finish both of my projects, and I was able to document them well. I ended up creating ten videos explaining the most challenging parts of the project, so I hope that maintaining the code will be much easier.

The one open thread is automating the reports. They use Windows Task Scheduler, which gave me issues until the end. I am simply trying to schedule the running of a .bat batch file, where the .bat file calls the appropriate .py Python file. I was able to confirm that both the Python and the batch files worked when I operated them manually. However, I was not able to get it to work through Task Scheduler. Stranglely, Task Scheduler cased Python to start running (which is good), followed by the Python program using a lot of RAM bieng used (which is good, since it is a data heavy program), followed by the Python program using a lot less RAM (which is bad), followed by Python using a lot of RAM again (which is weird), followed by Python closing (which is bad). Ultimately, the Python program never produced a report. If anyone knows what is going on with this, please let me know.

I learned a ton this year. This isn’t hard given that I started the year knowing almost nothing about data science. I knew Python reasonably well, but I didn’t actually use those skills much this year. It was more about knowing a bunch of one-line Pandas commands. I now know much more about the data science process (e.g. testing and training data), and I understand a bunch of the models.

Most importantly, I have a lot of muscle memory on how to work with data in a Jupyter notebook. I think that this could be really helpful in teaching statistics. I previous used Jupyter notebooks in my statistics courses before, but my use of the notebooks is probably best described as “clumsy.” I would have to Google every third step, and I wasn’t comfortable doing it on the fly in the classroom. I am very comfortable with it now.

One skill that I didn’t because fluent in is creating cool graphs and visualizations. I did a bit of this at the end of 2021, but I didn’t do enough to get fluent.

I am glad that I did this. I learned a lot, I have a lot to give to the students (particularly with respect to advising students who are interested in data science), and I made some great friends (I miss them already).

That said, I am grateful to be a professor again. I am itching to start planning my classes, and I think that I have come out of this experience a changed teacher. I will blog about that later this summer.

## April Sabbatical Report: My Year as a Data Scientist

May 4, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist.

I am finishing up my second project. It is mainly on regression (two models), however, I also used a classification model for each of the two regression models to identify outliers (which can be very naturally defined in my case). This allowed for drastically better regression models—my models basically went from garbage to good.

How will this experience influence my teaching? No report.

My feelings about being in industry.  I am feeling the same way I did last month: I am really looking forward to being a professor again (especially having a summer vacation), but I am going to miss this job. I have good coworkers, and I am finding the work interesting.

## March Sabbatical Report: My Year as a Data Scientist

April 4, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist.

I thought that I was done cleaning and validating my code, but it turns out that I had a bad data source. I spent a lot of last month re-doing the code so that I could get the same data from a different, reliable source.

I did build one model prior to that, which I now need to re-do. However, I am going to end up doing three or four models for this data, and the one I did first is the least important and most redundant. I did it first because it was the easiest for me to implement.

The reason that model was the easiest to implement was because it was a classification problem, which is what my first problem was. The other models I am doing for Project 2 are regression.

How will this experience influence my teaching?  My manager leaving made me realize something: I was doing the project thinking that he was the main audience. He technically wasn’t—a different part of the business was going to use my code—but he was the one who was going to have to maintain the code and get all of the questions from the real end users.

I like him, and part of my motivation was to do a good job for him to make him look good. I realized that I had a dip in motivation once I found out that he was leaving. I was able to realize this and re-frame it in my mind to re-gain my motivation, but the dip was there (I also really like my new manager, but this is less of her project since she hasn’t been involved in the project until now).

This reminds me that we are all motivated for different reasons, and some of my students might be motivated by their relationship with me. I am thinking about this generally, and I am also thinking about this in the context of Ungrading, which I am planning on using next year. Like most people, I am concerned about motivating students without the stick of grades. However, I think having a positive relationship with my students can be a carrot that replaces much of the grade stick.

My feelings about being in industry.  I am looking forward to being a professor again. I miss it. But I am now in the phase where I have 1.5 months left here, and I am starting to miss this job already. I am definitely not struggling the same way I was in November.

## February Sabbatical Report: My Year as a Data Scientist

March 4, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist. I have two main things happening. The first is that I am still trying to automate the results of my first project. There are still silly issues. I can’t do the equivalent of “pip install packagename,” which should take less than a minute to install a new package. Instead, I have been emailing spreadsheets. IT is working on it, though.

I finished cleaning and validating the data for my second project. So I was able to do in two months what took me five months for the first project. I am guessing that roughly 30% of this was reusing code, and 70% was being more familiar with data science generally and the bank’s data in particular. I am about to start working on doing some visualization, and I get to build a couple of models after that—we are likely to try to model three different variables this time, but that shouldn’t take as much time.

How academia and business are different. One of my colleagues at the bank is about to leave for another job. In academia, professors either are expected to leave (if they are not tenure-track) or generally don’t leave (if they are tenured). So there are very few surprises when someone leaves. So it is slightly jarring to hear that a professor-analogue at the bank suddenly decided to leave (I am thrilled for him, but it isn’t the sort of thing that happens often in academia).

That said, there have been a couple of tenured professors at my school who just announced they are leaving, so maybe this is not so different after all.

How will this experience influence my teaching?  No report.

My feelings about being in industry.  After struggling with missing academia in December, I am in the groove now. In fact, I am starting to preemptively feel a bit sad about leave, since it is only three months away. But I am still looking forward to teaching again.

## January Sabbatical Report: My Year as a Data Scientist

January 28, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist. I have spent the last month starting my second project and putting the finishing touches on the first. If all goes well (which it normally doesn’t), I am about one day away from finishing the first project.

The second project is recycling a lot of the work I did for the first project, so there are no new skills yet. I am currently working on reporting out the results of the first project to the client.

I hadn’t really thought about it before, but part of the gig is getting the results from the computer to the people who are going to use them. The client is going through some workflow changes, so I am simply producing an Excel file right now. This involves putting the python script on a virtual machine, then using Task Scheduler to run the code and email it to him (again—Excel and email is a temporary measure).

How academia and business are different. They had a professional magician at their (virtual) holiday party, which I can’t imagine my college doing (Kostya Kimlat, who was great).

How will this experience influence my teaching?  I don’t have any new stories from working at the bank, but I have a lot of ideas from several excellent books I have read. I will write a separate blog post about them at a later time.

My feelings about being in industry.  No Report

## Sabbatical December Report: My Year as a Data Scientist

January 3, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist. I am tentatively done with my first project, and I have started my second (and last planned) project. I tentatively finished the first project in mid-December, but I am waiting until the New Year to present it to the (internal) client due to the fact that the end of the year is apparently busy for banks.

What I learned is that once can make huge gains by tuning hyperparameters. I started the modeling process by choosing an assortment of models. I started by using the default hyperparameters for each, and the initial results were terrible. However, after using some tools to find the best hyperparameters and then doing some manual experimentation, I actually ended up with better results than I thought I would able to get. It was like going from a score of 15% to 85%, just by getting the right hyperparameters.

No report.

How will this experience influence my teaching? I am starting to think about how my new skills relate to the existing curriculum. It seems like there aren’t great fits. There aren’t any Mathematics courses that are relevant (I have some useful skills to help me with teaching statistics, but I am not actually using much actual statistics). Computer Science has a machine learning course, which I could fake my way through. However, I am sure that the Computer Science people are still much more qualified to teach this. We have three Data Analytics courses, and I think this is the closest match. The first course is on visualization, which is a skill I haven’t really developed yet. The second course seems to be a slightly more programming-heavy course, which might be a better fit. The third course is a capstone course, which seems to be the closest match to what I have learned—I haven’t learned a lot about any one subject (like machine learning), but I feel like I have gotten a lot of practice in solving very practical problems as they come up.

My feelings about being in industry. I was struggling with not having a winter break, but I got over that by mid-December. I am, however, grateful that the university schedule is what it is.

## Halmos’s Automathography

December 22, 2021

I just finished I Want to Be a Mathematician: An Automathography by Paul Halmos. I found the book to be really interesting, although I don’t think that everyone will. In particular, he describes his career from the 1930s to the mid-1980s, and academia was a different world back then: money was seemingly easy to get for travel, and there was a lot less bureaucracy (Halmos seems to have gotten tenure after his third year without having to apply for it). There was plenty of talk about mathematicians he knew—some I had heard of, some I hadn’t—and I learned how to pronounce his last name (as an American would—‘hal-moss’—rather than as a Hungarian would—‘hal-mush’). I also had heard several things credited to Halmos, and they came from this book.

• He is credited for inventing the notation ‘iff’ for ‘if and only if.’
• He is credited for the little box that denotes the end of a proof.
• A quote that has been circulating around the IBL types is found on Page 69: “It’s been said before and often, but it cannot be overemphasized: study actively. . Don’t just read it; fight it! Ask your own questions, look for your own examples, discover your own proofs. Is the hypothesis necessary? Is the converse true? What happens in the classical special case? What about the degenerate cases? Where does the proof use the hypothesis?

There are two things I want to share, though. The first is some of his comments about grading, and the second is a nice way to describe how to be productive in any field.

On productivity:

Archimedes taught us that a small quantity added to itself often enough because a large quantity (or, in proverbial terms, that every little bit helps). When it comes to accomplishing the bulk of of the world’s work, and, in particular, the work of a mathematician, whether it is proving a theorem, writing a book, teaching a course, chairing a department, or editing a journal, I claim credit for the formulation of the converse: Archimedes’s way is the only way to get something done. Do a small bit, steadily every day, with no exception, with no holiday.

Page 401

I just thought that was a nice summary of how work progresses.

Analysis was taught by Steimley, who taught like a marine drill sergeant. He prepared detailed notes for his advanced calculus course and used them over and over again. He graded homework and exams promptly and fussily. Your grade was not likely to be just B or 80 or 85, but something like 83. The digit to the right of the decimal point in your average could play an important role in determining your course grade.

Page 32

Halmos makes it seem like a grade of “83” was strange for the 1930s, and it was more common for instructors to give grades without the pretense of objective precision.

So how does Halmos determine grades? He also explains this.

Assigning grades to the students in my class is part of my job; it is a necessary evil. Grading is bad because students often pay too much attention to it, because it is often regraded as more accurate than it can possibly be, and incidentally, because it often makes students feel bad. It is, however, necessary because in our present educational and social organization the teacher in a later course must know what the students learned in an earlier one, and a prospective employer wants to know how good the student is likely to be on the job. I can’t think of a way of designing an organization of learning and working in which these items of information are not needed.

I do not, however, think that the assignment of informative grades is all that hard. At the end of a course I usually have a pretty clear idea that certain students know the material well (A), and certain others don’t (F). In between, there are those who knew some of it but have gaps in knowledge—possible big ones (B), and there are those who can use some of it, but don’t really understand it (C). Then, of course, there are those who can prove that they have been exposed to it, but for sure don’t know enough to go on to a course on a highly level (D). . .Of course my “pretty clear idea” is subjective, but it’s remarkable how nearly unanimous such subjective grade assignments turn out to be: students keep getting the same sort of grades time after time, in different courses from different teachers. I don’t agree with my colleagues who advocate a more “objective” numerical grading system: Problem 4 is worth 15 points, and you 3 points if the answer is right, and 2 points off for each of the six most obvious missteps you can take on your way to it. To my mind, it is my duty to use my best judgement about how much my students know when I’m finished with them; anything else would be an evasion of my responsibility.

Pages 136–137

The book contained a mixture of this type of opinion with a more historical account of where he was an with whom he was in contact with at various points in his life. If these seems interesting to you, go ahead and read it. A book with a similar feel is Indiscrete Thoughts by Gian-Carlo Rota. I enjoyed both of them quite a bit, but they are not for everyone.

## Sabbatical November Report: My Year as a Data Scientist

December 2, 2021

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist. First, note that gasstationwithoutpumps pointed out that my explanation of a novel use of joins is, well, pointless (he was kinder about it). I haven’t yet worked out whether the problem was with my coding or my explanation of it, but I am betting it was the former while hoping it was the latter.

I am now halfway through my expected tenure as a data scientist. I have learned a lot. I have a lot more to learn.

I have moved on to building models, which is what most people think of when they think about data science (if they think about anything at all). Basically, this is the part where I make predictions based on the data. I am playing around with the following tools (I am trying to classify data).

• k-Nearest Neighbors
• Logistic Regression
• Naive Bayes
• Neural Networks
• Random Forests
• Support Vector Machine
• XGBoost

I have a couple of models that seem to do better than the others, and now I am trying to milk what I can out of the models to improve their performance. I will probably be working on the same thing for the rest of the month.

See the section about my feelings.

How will this experience influence my teaching?

I spoke last month about my thoughts about ungrading. I think that this sabbatical experience is reinforcing the thought that grading really isn’t good. I am having to train myself how to function at this job, not having been trained directly in what to do—just like my students will do at their first job. Lifelong learning and all.

This is not just about me being naive (which I admit to). I understand that some students won’t respond well without the grade incentives. So I am not being idealistic. Rather, I think that it is a valuable skill to be able to self-assess (which I am assuming is part of ungrading), and then learn to address your weaknesses.

My feelings about being in industry.

I am struggling a bit. I am recognizing exactly how spoiled we are as professors. I don’t get a break a Christmas. Christmas is on a Saturday, and I will be back to work on the Monday (unless I use a vacation day, which I might. I do get paid for a full day on Friday if I work a half-day, which is nice). Frankly, working in industry requires a certain endurance that I don’t exactly have right now. I will do fine, but I can tell that my body/mind/spirit is expecting a break that it will not get. I might get paid less as a professor than I would in industry, but I certainly appreciate the time off.

## Sabbatical October Report: My Year as a Data Scientist

October 29, 2021

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

It took me most of the month, but I finally finished cleaning this version of the data (five months into a 12 month sabbatical). I just started doing some graphs and charts of the data in hopes of noticing tends that can help me out.

The skills needed to be a data scientist

I will focus on two skills: a special way of using joins, which I previously alluded to, and visualization with graphs and charts. I will start with the latter, since I have had less than a week’s experience with it and hence is short: it is more difficult than I imagined it to be. To look at the videos online, you simply need to type “plt.plot(data=MyDataSet)” and it works. What I (naively) didn’t realize is that you need to make sure that MyDataSet is formatting correctly. I had to do things like learn how to melt the data so that I could graph it correctly. I hadn’t expected that.

Deciding how to represent the data (scatterplot, histogram, etc) is also nontrivial, but I expected this. There isn’t a right answer, and it seems like it might be a bit of an art to (1) pick the right subset of the data and (2) use the right type of graph. I look forward to learning more about this.

I want to give an example of a use of a join that I hadn’t known about before. We will be talking about left joins, which you think of starting with the left table, and then attaching additional information from a second (right) table to the left table (without removing anything from the left table).

Here is the some fake data. This is fake, since dates aren’t single-digit integers, but all we need is to be able to put the “dates” in order. Our goal will be to find the Value that occurred most recently prior to 6 (so the Value corresponding to Date 5 if possible, otherwise Date 4, Date 3, etc.).

To my knowledge, there is no slick way of doing this (please let me know if you know of a natural way to do this). It is not obvious how a join would be useful, since there is only one table. However, we will create a second table by creating a new data frame that drops the Value column and any date that is not less than 6 (both these things are easy and natural to do in Python’s Pandas package). This gives us a second table.

We can now create a third table by doing a groupby on Name that picks the maximum value for Date that goes with each Name. This creates the third table.

Now, we can do a left join where this third table is the starting table, and we append any relevant information from the original table that corresponds to the Name and the Date that are in this third table. This yields the fourth, final table that has the most recent Value prior to 6 for each Name.

So we used a join to turn a table with seven entries with five redundant ones to a table with only the two relevant rows via the join.

No report.

How will this experience influence my teaching?

I don’t know how much of this is influence by my sabbatical, but I am thinking a ton about ungrading. I think this is mainly because Blum and company recently legitimized the practice. However, the sabbatical reinforces it. I am learning a ton, and I am grateful that my manager isn’t grading me on what I do. I have made a lot of mistakes that I wouldn’t have wanted to be penalized for, but I have also taken a lot of chances in how I approached the data that I likely wouldn’t have if I were seeking my manager’s approval.

Additionally, this sabbatical is going to help me when I teach statistics later. I am now fluent with what I would use in teaching the course, and I can imagine pulling out a Jupyter notebook in class to answer someone’s question on the fly. Previously, I was only really confident in preparing Jupyter notebooks for them to use on their homework (and I will be able to do that a lot better now). This is pretty huge since I teach statistics a lot.

My feelings about being in industry.

No report.

## Sabbatical September Report: My Year as a Data Scientist

October 1, 2021

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

I am continuing working on the same project. Last month, I was testing and validating the data. This month, I am still testing and validating the data. This normally should take less than a week to do, but there were complications. See the next section for an explanation.

The skills needed to be a data scientist

My testing and validating found mistakes, and I spent most of the month fixing them (which point out how essential validating the data is). The bulk of my time spent fixing the mistakes was spent poking around one of the databases in order to learn about it. I am going to go into a bit of detail about this, since I have a related recommendation for a Data Science major below.

I was looking at invoices (bills asking a customer to pay a certain amount of money) and payments (records of what customers actually paid). My goal was to figure out if the customers paid the amount they were required to pay.

This was a remarkably difficult goal for to me to reach. Here were the issues that I had.

• A month ago, I was only looking at payments. The reason for this is that the user interface the sales people use made the payment amounts easy to find. My problem was that I mistook “the data I knew I had” for “the data I need.” I also needed to know how much the customer was supposed to pay (the invoice). I worked on this for over a month without realizing that I only had half of the data that I needed.
• I went to search for the invoices. This is nontrivial, since the database didn’t have any sort of interface beyond the user interface the sales people used. However, I was able to hack together some things to figure out how to view the data in Microsoft Access (my manager made fun of me for using Access, since using Access to work with data is apparently the data science equivalent of using Netscape to browse the web. It works, though.). I do NOT think that this is something that we need to have students do.
• Thanks to Access, I was able to find the invoices. Thanks to my super-helpful co-worker, I was able to find the invoices in the user interface.
• However, the data in Access didn’t match the data in the user interface. For instance, they would have different due dates for the same invoice. Also, the amount due for the invoices are sometimes adjusted by the bank, and the data in Access didn’t reflect this.
• The issue was that I was not looking at the natural version of the invoice. There are sub-invoices for each invoice (I am not entirely sure what these are for, although I suspect that if you take out a single loan for two items, you get a subinvoice for each item). This turns out to be the natural way of thinking about the data for what I need—it is easy to get the amount due and the amount paid for the subinvoice, and the due dates are correct (the issue with the mismatched due dates on the proper invoices is that two subinvoices for a given invoice might have different due dates. Access would show one, and the user interface would show the other).
• One annoyance is that the user interface does not naturally allow you to work with the subinvoices (e.g. you can’t search for them). However, I can get at them, and this is allowing me to proceed with the validating.

In this process, I had two big mistakes. One was conceptual—I assumed that I could do everything I need with payments, but I actually needed both invoices and payments. The other had to do with a lack of familiarity of how the database was organized.

In fact, I bet that I spent at least one week of this past month just exploring the database. I would perform mini experiments—if I saw $x$ in the user interface, I made a guess where I could find $x$ in Access. If I was correct, I would move on. If not, I would make another guess. I also spent at least one week exploring and testing the two other main databases. Thus, I have spent about one month (out of my four months on the job) just exploring and getting to know where (and how) the data lives. I don’t know if I could have done this more quickly, and I don’t know if it would have been desirable to do so—I am confident that this will be useful knowledge for my other sabbatical projects.

So my recommendation for a Data Science major is this: students should be exposed to increasing messy databases. In this first couple of courses, it probably makes sense for them to exclusively work with databases that are well-designed and contain exactly the data they need (likely in just one or two different sources). By the end of the program, students should gain experience working with more complex and messy databases.

I didn’t know this, so perhaps it is worth saying. In many (most?) companies, databases seem to grow according to the principles of evolution, rather than the principles of intelligent design (the latter of which would require a lot of time and money). So data is often in multiple places, and there are often places that seem like they should have data—but don’t.

As such, here are some qualities of databases I think seniors should work with.

• There should be many irrelevant tables.
• Within most tables, there should be many columns of irrelevant data.
• Within most tables, there should be many columns without any data.
• For some data, the required data should be stored in multiple tables (ideally with different formats).
• The naming conventions for tables and columns should not be terribly clear.
• When using these databases, there should not be explicit instructions on what data they need. They should be given a outcome, with the expectation that they will find the data needed to achieve the outcome.

I want to make it clear that I am not complaining about the messy databases—I think that it would be nearly impossible for a company to maintain pristine databases. I just want students to have practical skills when they graduate.

Additionally, students should be explicitly taught to (and how to) test and validate their data. If I hadn’t done that, I would have gone forward without using invoices at all, which would have made the data look very, very different from reality.