Posts Tagged ‘Data Science’

May 27, 2022

May/Final Sabbatical Report: My Year as a Data Scientist

May 4, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

Summary: I am officially done working for the bank—I turned in my computer to them today. The timing was almost perfect: I was able to finish both of my projects, and I was able to document them well. I ended up creating ten videos explaining the most challenging parts of the project, so I hope that maintaining the code will be much easier.

The one open thread is automating the reports. They use Windows Task Scheduler, which gave me issues until the end. I am simply trying to schedule the running of a .bat batch file, where the .bat file calls the appropriate .py Python file. I was able to confirm that both the Python and the batch files worked when I operated them manually. However, I was not able to get it to work through Task Scheduler. Stranglely, Task Scheduler cased Python to start running (which is good), followed by the Python program using a lot of RAM bieng used (which is good, since it is a data heavy program), followed by the Python program using a lot less RAM (which is bad), followed by Python using a lot of RAM again (which is weird), followed by Python closing (which is bad). Ultimately, the Python program never produced a report. If anyone knows what is going on with this, please let me know.

I learned a ton this year. This isn’t hard given that I started the year knowing almost nothing about data science. I knew Python reasonably well, but I didn’t actually use those skills much this year. It was more about knowing a bunch of one-line Pandas commands. I now know much more about the data science process (e.g. testing and training data), and I understand a bunch of the models.

Most importantly, I have a lot of muscle memory on how to work with data in a Jupyter notebook. I think that this could be really helpful in teaching statistics. I previous used Jupyter notebooks in my statistics courses before, but my use of the notebooks is probably best described as “clumsy.” I would have to Google every third step, and I wasn’t comfortable doing it on the fly in the classroom. I am very comfortable with it now.

One skill that I didn’t because fluent in is creating cool graphs and visualizations. I did a bit of this at the end of 2021, but I didn’t do enough to get fluent.

I am glad that I did this. I learned a lot, I have a lot to give to the students (particularly with respect to advising students who are interested in data science), and I made some great friends (I miss them already).

That said, I am grateful to be a professor again. I am itching to start planning my classes, and I think that I have come out of this experience a changed teacher. I will blog about that later this summer.

April Sabbatical Report: My Year as a Data Scientist

May 4, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist.

I am finishing up my second project. It is mainly on regression (two models), however, I also used a classification model for each of the two regression models to identify outliers (which can be very naturally defined in my case). This allowed for drastically better regression models—my models basically went from garbage to good.

How academia and business are different. No report.

How will this experience influence my teaching? No report.

My feelings about being in industry.  I am feeling the same way I did last month: I am really looking forward to being a professor again (especially having a summer vacation), but I am going to miss this job. I have good coworkers, and I am finding the work interesting.

March Sabbatical Report: My Year as a Data Scientist

April 4, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist.

I thought that I was done cleaning and validating my code, but it turns out that I had a bad data source. I spent a lot of last month re-doing the code so that I could get the same data from a different, reliable source.

I did build one model prior to that, which I now need to re-do. However, I am going to end up doing three or four models for this data, and the one I did first is the least important and most redundant. I did it first because it was the easiest for me to implement.

The reason that model was the easiest to implement was because it was a classification problem, which is what my first problem was. The other models I am doing for Project 2 are regression.

How academia and business are different. No report.

How will this experience influence my teaching?  My manager leaving made me realize something: I was doing the project thinking that he was the main audience. He technically wasn’t—a different part of the business was going to use my code—but he was the one who was going to have to maintain the code and get all of the questions from the real end users.

I like him, and part of my motivation was to do a good job for him to make him look good. I realized that I had a dip in motivation once I found out that he was leaving. I was able to realize this and re-frame it in my mind to re-gain my motivation, but the dip was there (I also really like my new manager, but this is less of her project since she hasn’t been involved in the project until now).

This reminds me that we are all motivated for different reasons, and some of my students might be motivated by their relationship with me. I am thinking about this generally, and I am also thinking about this in the context of Ungrading, which I am planning on using next year. Like most people, I am concerned about motivating students without the stick of grades. However, I think having a positive relationship with my students can be a carrot that replaces much of the grade stick.

My feelings about being in industry.  I am looking forward to being a professor again. I miss it. But I am now in the phase where I have 1.5 months left here, and I am starting to miss this job already. I am definitely not struggling the same way I was in November.

February Sabbatical Report: My Year as a Data Scientist

March 4, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist. I have two main things happening. The first is that I am still trying to automate the results of my first project. There are still silly issues. I can’t do the equivalent of “pip install packagename,” which should take less than a minute to install a new package. Instead, I have been emailing spreadsheets. IT is working on it, though.

I finished cleaning and validating the data for my second project. So I was able to do in two months what took me five months for the first project. I am guessing that roughly 30% of this was reusing code, and 70% was being more familiar with data science generally and the bank’s data in particular. I am about to start working on doing some visualization, and I get to build a couple of models after that—we are likely to try to model three different variables this time, but that shouldn’t take as much time.

How academia and business are different. One of my colleagues at the bank is about to leave for another job. In academia, professors either are expected to leave (if they are not tenure-track) or generally don’t leave (if they are tenured). So there are very few surprises when someone leaves. So it is slightly jarring to hear that a professor-analogue at the bank suddenly decided to leave (I am thrilled for him, but it isn’t the sort of thing that happens often in academia).

That said, there have been a couple of tenured professors at my school who just announced they are leaving, so maybe this is not so different after all.

How will this experience influence my teaching?  No report.

My feelings about being in industry.  After struggling with missing academia in December, I am in the groove now. In fact, I am starting to preemptively feel a bit sad about leave, since it is only three months away. But I am still looking forward to teaching again.

January Sabbatical Report: My Year as a Data Scientist

January 28, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist. I have spent the last month starting my second project and putting the finishing touches on the first. If all goes well (which it normally doesn’t), I am about one day away from finishing the first project.

The second project is recycling a lot of the work I did for the first project, so there are no new skills yet. I am currently working on reporting out the results of the first project to the client.

I hadn’t really thought about it before, but part of the gig is getting the results from the computer to the people who are going to use them. The client is going through some workflow changes, so I am simply producing an Excel file right now. This involves putting the python script on a virtual machine, then using Task Scheduler to run the code and email it to him (again—Excel and email is a temporary measure).

How academia and business are different. They had a professional magician at their (virtual) holiday party, which I can’t imagine my college doing (Kostya Kimlat, who was great).

How will this experience influence my teaching?  I don’t have any new stories from working at the bank, but I have a lot of ideas from several excellent books I have read. I will write a separate blog post about them at a later time.

My feelings about being in industry.  No Report

Sabbatical December Report: My Year as a Data Scientist

January 3, 2022

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist. I am tentatively done with my first project, and I have started my second (and last planned) project. I tentatively finished the first project in mid-December, but I am waiting until the New Year to present it to the (internal) client due to the fact that the end of the year is apparently busy for banks.

What I learned is that once can make huge gains by tuning hyperparameters. I started the modeling process by choosing an assortment of models. I started by using the default hyperparameters for each, and the initial results were terrible. However, after using some tools to find the best hyperparameters and then doing some manual experimentation, I actually ended up with better results than I thought I would able to get. It was like going from a score of 15% to 85%, just by getting the right hyperparameters.

How academia and business are different.

No report.

How will this experience influence my teaching? I am starting to think about how my new skills relate to the existing curriculum. It seems like there aren’t great fits. There aren’t any Mathematics courses that are relevant (I have some useful skills to help me with teaching statistics, but I am not actually using much actual statistics). Computer Science has a machine learning course, which I could fake my way through. However, I am sure that the Computer Science people are still much more qualified to teach this. We have three Data Analytics courses, and I think this is the closest match. The first course is on visualization, which is a skill I haven’t really developed yet. The second course seems to be a slightly more programming-heavy course, which might be a better fit. The third course is a capstone course, which seems to be the closest match to what I have learned—I haven’t learned a lot about any one subject (like machine learning), but I feel like I have gotten a lot of practice in solving very practical problems as they come up.

My feelings about being in industry. I was struggling with not having a winter break, but I got over that by mid-December. I am, however, grateful that the university schedule is what it is.

Sabbatical November Report: My Year as a Data Scientist

December 2, 2021

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

The skills needed to be a data scientist. First, note that gasstationwithoutpumps pointed out that my explanation of a novel use of joins is, well, pointless (he was kinder about it). I haven’t yet worked out whether the problem was with my coding or my explanation of it, but I am betting it was the former while hoping it was the latter.

I am now halfway through my expected tenure as a data scientist. I have learned a lot. I have a lot more to learn.

I have moved on to building models, which is what most people think of when they think about data science (if they think about anything at all). Basically, this is the part where I make predictions based on the data. I am playing around with the following tools (I am trying to classify data).

  • AdaBoost
  • k-Nearest Neighbors
  • Logistic Regression
  • Naive Bayes
  • Neural Networks
  • Random Forests
  • Support Vector Machine
  • XGBoost

I have a couple of models that seem to do better than the others, and now I am trying to milk what I can out of the models to improve their performance. I will probably be working on the same thing for the rest of the month.

How academia and business are different.

See the section about my feelings.

How will this experience influence my teaching?

I spoke last month about my thoughts about ungrading. I think that this sabbatical experience is reinforcing the thought that grading really isn’t good. I am having to train myself how to function at this job, not having been trained directly in what to do—just like my students will do at their first job. Lifelong learning and all.

This is not just about me being naive (which I admit to). I understand that some students won’t respond well without the grade incentives. So I am not being idealistic. Rather, I think that it is a valuable skill to be able to self-assess (which I am assuming is part of ungrading), and then learn to address your weaknesses.

My feelings about being in industry.

I am struggling a bit. I am recognizing exactly how spoiled we are as professors. I don’t get a break a Christmas. Christmas is on a Saturday, and I will be back to work on the Monday (unless I use a vacation day, which I might. I do get paid for a full day on Friday if I work a half-day, which is nice). Frankly, working in industry requires a certain endurance that I don’t exactly have right now. I will do fine, but I can tell that my body/mind/spirit is expecting a break that it will not get. I might get paid less as a professor than I would in industry, but I certainly appreciate the time off.

Sabbatical October Report: My Year as a Data Scientist

October 29, 2021

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

It took me most of the month, but I finally finished cleaning this version of the data (five months into a 12 month sabbatical). I just started doing some graphs and charts of the data in hopes of noticing tends that can help me out.

The skills needed to be a data scientist

I will focus on two skills: a special way of using joins, which I previously alluded to, and visualization with graphs and charts. I will start with the latter, since I have had less than a week’s experience with it and hence is short: it is more difficult than I imagined it to be. To look at the videos online, you simply need to type “plt.plot(data=MyDataSet)” and it works. What I (naively) didn’t realize is that you need to make sure that MyDataSet is formatting correctly. I had to do things like learn how to melt the data so that I could graph it correctly. I hadn’t expected that.

Deciding how to represent the data (scatterplot, histogram, etc) is also nontrivial, but I expected this. There isn’t a right answer, and it seems like it might be a bit of an art to (1) pick the right subset of the data and (2) use the right type of graph. I look forward to learning more about this.

I want to give an example of a use of a join that I hadn’t known about before. We will be talking about left joins, which you think of starting with the left table, and then attaching additional information from a second (right) table to the left table (without removing anything from the left table).

Here is the some fake data. This is fake, since dates aren’t single-digit integers, but all we need is to be able to put the “dates” in order. Our goal will be to find the Value that occurred most recently prior to 6 (so the Value corresponding to Date 5 if possible, otherwise Date 4, Date 3, etc.).

NameDateValue
a023
a120
a429
a931
b115
b219
b717

To my knowledge, there is no slick way of doing this (please let me know if you know of a natural way to do this). It is not obvious how a join would be useful, since there is only one table. However, we will create a second table by creating a new data frame that drops the Value column and any date that is not less than 6 (both these things are easy and natural to do in Python’s Pandas package). This gives us a second table.

NameDate
a0
a1
a4
b1
b2

We can now create a third table by doing a groupby on Name that picks the maximum value for Date that goes with each Name. This creates the third table.

NameDate
a4
b2

Now, we can do a left join where this third table is the starting table, and we append any relevant information from the original table that corresponds to the Name and the Date that are in this third table. This yields the fourth, final table that has the most recent Value prior to 6 for each Name.

NameDateValue
a429
b219

So we used a join to turn a table with seven entries with five redundant ones to a table with only the two relevant rows via the join.

How academia and business are different.

No report.

How will this experience influence my teaching?

I don’t know how much of this is influence by my sabbatical, but I am thinking a ton about ungrading. I think this is mainly because Blum and company recently legitimized the practice. However, the sabbatical reinforces it. I am learning a ton, and I am grateful that my manager isn’t grading me on what I do. I have made a lot of mistakes that I wouldn’t have wanted to be penalized for, but I have also taken a lot of chances in how I approached the data that I likely wouldn’t have if I were seeking my manager’s approval.

Additionally, this sabbatical is going to help me when I teach statistics later. I am now fluent with what I would use in teaching the course, and I can imagine pulling out a Jupyter notebook in class to answer someone’s question on the fly. Previously, I was only really confident in preparing Jupyter notebooks for them to use on their homework (and I will be able to do that a lot better now). This is pretty huge since I teach statistics a lot.

My feelings about being in industry.

No report.

Sabbatical September Report: My Year as a Data Scientist

October 1, 2021

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

I am continuing working on the same project. Last month, I was testing and validating the data. This month, I am still testing and validating the data. This normally should take less than a week to do, but there were complications. See the next section for an explanation.

The skills needed to be a data scientist

My testing and validating found mistakes, and I spent most of the month fixing them (which point out how essential validating the data is). The bulk of my time spent fixing the mistakes was spent poking around one of the databases in order to learn about it. I am going to go into a bit of detail about this, since I have a related recommendation for a Data Science major below.

I was looking at invoices (bills asking a customer to pay a certain amount of money) and payments (records of what customers actually paid). My goal was to figure out if the customers paid the amount they were required to pay.

This was a remarkably difficult goal for to me to reach. Here were the issues that I had.

  • A month ago, I was only looking at payments. The reason for this is that the user interface the sales people use made the payment amounts easy to find. My problem was that I mistook “the data I knew I had” for “the data I need.” I also needed to know how much the customer was supposed to pay (the invoice). I worked on this for over a month without realizing that I only had half of the data that I needed.
  • I went to search for the invoices. This is nontrivial, since the database didn’t have any sort of interface beyond the user interface the sales people used. However, I was able to hack together some things to figure out how to view the data in Microsoft Access (my manager made fun of me for using Access, since using Access to work with data is apparently the data science equivalent of using Netscape to browse the web. It works, though.). I do NOT think that this is something that we need to have students do.
  • Thanks to Access, I was able to find the invoices. Thanks to my super-helpful co-worker, I was able to find the invoices in the user interface.
  • However, the data in Access didn’t match the data in the user interface. For instance, they would have different due dates for the same invoice. Also, the amount due for the invoices are sometimes adjusted by the bank, and the data in Access didn’t reflect this.
  • The issue was that I was not looking at the natural version of the invoice. There are sub-invoices for each invoice (I am not entirely sure what these are for, although I suspect that if you take out a single loan for two items, you get a subinvoice for each item). This turns out to be the natural way of thinking about the data for what I need—it is easy to get the amount due and the amount paid for the subinvoice, and the due dates are correct (the issue with the mismatched due dates on the proper invoices is that two subinvoices for a given invoice might have different due dates. Access would show one, and the user interface would show the other).
  • One annoyance is that the user interface does not naturally allow you to work with the subinvoices (e.g. you can’t search for them). However, I can get at them, and this is allowing me to proceed with the validating.

In this process, I had two big mistakes. One was conceptual—I assumed that I could do everything I need with payments, but I actually needed both invoices and payments. The other had to do with a lack of familiarity of how the database was organized.

In fact, I bet that I spent at least one week of this past month just exploring the database. I would perform mini experiments—if I saw x in the user interface, I made a guess where I could find x in Access. If I was correct, I would move on. If not, I would make another guess. I also spent at least one week exploring and testing the two other main databases. Thus, I have spent about one month (out of my four months on the job) just exploring and getting to know where (and how) the data lives. I don’t know if I could have done this more quickly, and I don’t know if it would have been desirable to do so—I am confident that this will be useful knowledge for my other sabbatical projects.

So my recommendation for a Data Science major is this: students should be exposed to increasing messy databases. In this first couple of courses, it probably makes sense for them to exclusively work with databases that are well-designed and contain exactly the data they need (likely in just one or two different sources). By the end of the program, students should gain experience working with more complex and messy databases.

I didn’t know this, so perhaps it is worth saying. In many (most?) companies, databases seem to grow according to the principles of evolution, rather than the principles of intelligent design (the latter of which would require a lot of time and money). So data is often in multiple places, and there are often places that seem like they should have data—but don’t.

As such, here are some qualities of databases I think seniors should work with.

  • There should be many irrelevant tables.
  • Within most tables, there should be many columns of irrelevant data.
  • Within most tables, there should be many columns without any data.
  • For some data, the required data should be stored in multiple tables (ideally with different formats).
  • The naming conventions for tables and columns should not be terribly clear.
  • When using these databases, there should not be explicit instructions on what data they need. They should be given a outcome, with the expectation that they will find the data needed to achieve the outcome.

I want to make it clear that I am not complaining about the messy databases—I think that it would be nearly impossible for a company to maintain pristine databases. I just want students to have practical skills when they graduate.

Additionally, students should be explicitly taught to (and how to) test and validate their data. If I hadn’t done that, I would have gone forward without using invoices at all, which would have made the data look very, very different from reality.

How academia and business are different.

No report, although I discuss some similarities in the “Feelings” section below.

How will this experience influence my teaching?

I don’t know if the sabbatical has been directly influencing this next thought, but it might be. I have been thinking a lot about course-based undergraduate research experiences (CUREs). This is where I want my teaching to go anyway, but I also think that I am thinking about it because I can see how much learning I am doing on this sabbatical. I half-heartedly tried to learn about data science a couple of years ago by taking an online course. While I learned more than I should have, it was very modular in nature—I didn’t typically have to think too hard about what skills to apply (similar to how you should strongly consider using the chain rule on your calculus homework if the problem comes from the section of the book titled “The Chain Rule”). This project has forced me to make decisions about what tools to use, as well as to seek out possible new tools when I am in need of them. I think that this is the benefit of a research-like environment—it creates a sense of need when done well. Of course, doing a CURE well isn’t easy, but those details will be sorted out next summer.

My feelings about being in industry.

I feel like my experience in industry this year has been largely the same as that of doing research during my previous sabbatical. In both cases, I am working on big problems, mostly by myself. I get stuck, make things worse, and eventually make things better. Just as I didn’t have all of the ancillary distractions (e.g. committee meetings, department meetings, etc) during my last sabbatical, I don’t have the ancillary distractions associated with banking that my current colleagues have: the data analysis group provides support for other departments with data needs, and I don’t need to do any of that work. It is a good way to allow me to immerse myself in the data science (and for the bank to insulate me from making novice mistakes that make other departments’ jobs harder and more frustrating).

This all means that my experience in industry isn’t really an honest one—I am getting the best parts of the experience, and none of the more frustrating ones. I appreciate that the bank is doing that for me.

Sabbatical August Report: My Year as a Data Scientist

September 2, 2021

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

I am continuing working on the same project. I have all of the data put together (from four different databases), and now I am testing and validating the data. This essentially means that I am looking at a random sample of my data and computing the results by hand.

The skills needed to be a data scientist

I am using the same tools. I hope to talk more about more exciting tools next month, though.

I use a lot of joins, almost all of which are outer joins (either “left” or “full”). I have temporarily understood the different joins at different points in my past, but now I am fluent. I have been using joins for three (I think) purposes.

  • I use full outer joins to concatenate two data sets that contain the same information. For instance, we have one database that stores old data and one that stores news data. If I want to look at all of the data, I do a full outer join to combine them.
  • I use left joins (you could make it a right join if you like) to append new columns to a table without adding new rows. This is how I usually think of a left join.
  • I use left joins (again, you could make these right joins easily) to filter out data. This isn’t something that I thought of prior to this. Basically, I have a main table, but it has too much data. If I can create a second table that just has the rows I want, I can do “second table LEFT JOIN main table.” I don’t do this often—I am not in this situation a lot, and filtering usually works better—but I have done it.

How academia and business are different.

In academia, I get a lot of pleasure of helping students learn. This is pretty immediate, since the students are often right in front of me. Since I strongly value learning and education, I regularly see concrete ways where I help the world, albeit in small ways each time.

My experience in business has been different, which is a bit ironic. I am working for a department in the bank that helps another department in the bank collect payments from people. There are several layers between where I am and the people I am supposed to be helping (at least one department). There is also a time delay: I am working on something that won’t be used for at least a couple of months, and that is the only thing I am working on.

However, the bank lends a lot to farmers (it was voted the best bank for agriculture—and the best bank overall—in Minnesota in 2021). The bank is doing good work for society by, say, helping farmers buy equipment so that we can have food to eat. However (and I am embarrassed to say this as a mathematician), this seems a bit too abstract for me at times, and I sometimes struggle recognizing the importance of the work. But I truly believe that much of it is important—I just don’t always feel it.

How will this experience influence my teaching?

I don’t have much to say with respect to my sabbatical (unless it is subconscious), but I have been thinking a lot about labor-based grading. I am grateful to David Clark for being willing to have me bounce half-baked, rambling ideas off of him. He opened himself up to such treatment by mentioning labor-based grading in the excellent Grading for Growth Substack (with Robert Talbert).

Actually, there is one thing: it seems like there is a great demand for data-literate people in marketing. So I might push harder to get our business majors into our Data Analytics minor.

My feelings about being in industry.

My main experience right now is sadness that I am not teaching. I was briefly back on campus yesterday, and I was happy to see all of the students roaming around—and sad that I am not directly a part of it this year. This is part of the purpose of a sabbatical—to make you appreciate the great gig that you already have. Absence makes the heart grow fonder, and all.