Sabbatical September Report: My Year as a Data Scientist

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

I am continuing working on the same project. Last month, I was testing and validating the data. This month, I am still testing and validating the data. This normally should take less than a week to do, but there were complications. See the next section for an explanation.

The skills needed to be a data scientist

My testing and validating found mistakes, and I spent most of the month fixing them (which point out how essential validating the data is). The bulk of my time spent fixing the mistakes was spent poking around one of the databases in order to learn about it. I am going to go into a bit of detail about this, since I have a related recommendation for a Data Science major below.

I was looking at invoices (bills asking a customer to pay a certain amount of money) and payments (records of what customers actually paid). My goal was to figure out if the customers paid the amount they were required to pay.

This was a remarkably difficult goal for to me to reach. Here were the issues that I had.

  • A month ago, I was only looking at payments. The reason for this is that the user interface the sales people use made the payment amounts easy to find. My problem was that I mistook “the data I knew I had” for “the data I need.” I also needed to know how much the customer was supposed to pay (the invoice). I worked on this for over a month without realizing that I only had half of the data that I needed.
  • I went to search for the invoices. This is nontrivial, since the database didn’t have any sort of interface beyond the user interface the sales people used. However, I was able to hack together some things to figure out how to view the data in Microsoft Access (my manager made fun of me for using Access, since using Access to work with data is apparently the data science equivalent of using Netscape to browse the web. It works, though.). I do NOT think that this is something that we need to have students do.
  • Thanks to Access, I was able to find the invoices. Thanks to my super-helpful co-worker, I was able to find the invoices in the user interface.
  • However, the data in Access didn’t match the data in the user interface. For instance, they would have different due dates for the same invoice. Also, the amount due for the invoices are sometimes adjusted by the bank, and the data in Access didn’t reflect this.
  • The issue was that I was not looking at the natural version of the invoice. There are sub-invoices for each invoice (I am not entirely sure what these are for, although I suspect that if you take out a single loan for two items, you get a subinvoice for each item). This turns out to be the natural way of thinking about the data for what I need—it is easy to get the amount due and the amount paid for the subinvoice, and the due dates are correct (the issue with the mismatched due dates on the proper invoices is that two subinvoices for a given invoice might have different due dates. Access would show one, and the user interface would show the other).
  • One annoyance is that the user interface does not naturally allow you to work with the subinvoices (e.g. you can’t search for them). However, I can get at them, and this is allowing me to proceed with the validating.

In this process, I had two big mistakes. One was conceptual—I assumed that I could do everything I need with payments, but I actually needed both invoices and payments. The other had to do with a lack of familiarity of how the database was organized.

In fact, I bet that I spent at least one week of this past month just exploring the database. I would perform mini experiments—if I saw x in the user interface, I made a guess where I could find x in Access. If I was correct, I would move on. If not, I would make another guess. I also spent at least one week exploring and testing the two other main databases. Thus, I have spent about one month (out of my four months on the job) just exploring and getting to know where (and how) the data lives. I don’t know if I could have done this more quickly, and I don’t know if it would have been desirable to do so—I am confident that this will be useful knowledge for my other sabbatical projects.

So my recommendation for a Data Science major is this: students should be exposed to increasing messy databases. In this first couple of courses, it probably makes sense for them to exclusively work with databases that are well-designed and contain exactly the data they need (likely in just one or two different sources). By the end of the program, students should gain experience working with more complex and messy databases.

I didn’t know this, so perhaps it is worth saying. In many (most?) companies, databases seem to grow according to the principles of evolution, rather than the principles of intelligent design (the latter of which would require a lot of time and money). So data is often in multiple places, and there are often places that seem like they should have data—but don’t.

As such, here are some qualities of databases I think seniors should work with.

  • There should be many irrelevant tables.
  • Within most tables, there should be many columns of irrelevant data.
  • Within most tables, there should be many columns without any data.
  • For some data, the required data should be stored in multiple tables (ideally with different formats).
  • The naming conventions for tables and columns should not be terribly clear.
  • When using these databases, there should not be explicit instructions on what data they need. They should be given a outcome, with the expectation that they will find the data needed to achieve the outcome.

I want to make it clear that I am not complaining about the messy databases—I think that it would be nearly impossible for a company to maintain pristine databases. I just want students to have practical skills when they graduate.

Additionally, students should be explicitly taught to (and how to) test and validate their data. If I hadn’t done that, I would have gone forward without using invoices at all, which would have made the data look very, very different from reality.

How academia and business are different.

No report, although I discuss some similarities in the “Feelings” section below.

How will this experience influence my teaching?

I don’t know if the sabbatical has been directly influencing this next thought, but it might be. I have been thinking a lot about course-based undergraduate research experiences (CUREs). This is where I want my teaching to go anyway, but I also think that I am thinking about it because I can see how much learning I am doing on this sabbatical. I half-heartedly tried to learn about data science a couple of years ago by taking an online course. While I learned more than I should have, it was very modular in nature—I didn’t typically have to think too hard about what skills to apply (similar to how you should strongly consider using the chain rule on your calculus homework if the problem comes from the section of the book titled “The Chain Rule”). This project has forced me to make decisions about what tools to use, as well as to seek out possible new tools when I am in need of them. I think that this is the benefit of a research-like environment—it creates a sense of need when done well. Of course, doing a CURE well isn’t easy, but those details will be sorted out next summer.

My feelings about being in industry.

I feel like my experience in industry this year has been largely the same as that of doing research during my previous sabbatical. In both cases, I am working on big problems, mostly by myself. I get stuck, make things worse, and eventually make things better. Just as I didn’t have all of the ancillary distractions (e.g. committee meetings, department meetings, etc) during my last sabbatical, I don’t have the ancillary distractions associated with banking that my current colleagues have: the data analysis group provides support for other departments with data needs, and I don’t need to do any of that work. It is a good way to allow me to immerse myself in the data science (and for the bank to insulate me from making novice mistakes that make other departments’ jobs harder and more frustrating).

This all means that my experience in industry isn’t really an honest one—I am getting the best parts of the experience, and none of the more frustrating ones. I appreciate that the bank is doing that for me.

Tags: ,

10 Responses to “Sabbatical September Report: My Year as a Data Scientist”

  1. Sabbatical October Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] « Sabbatical September Report: My Year as a Data Scientist […]

  2. Sabbatical November Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] September Report […]

  3. Andy Rundquist Says:

    Sorry I missed these posts! I’m rapidly catching up

    I want to ask you about the differences between your suggestion of exposing students to messier databases, on the one hand, and having students build data sets (potentially databases but maybe just pandas data frames) from various sources, like web scraping, on the other. (that was an awfully constructed sentence, sorry).

    We do the latter in our Computational Data Science curriculum, mostly in the course I teach, which is pretty early in the curriculum. I lean that way because I don’t want my students to think that all problems already have the data in nice forms (even if messy, using your term).

    But I wonder about your approach, because of course there are tons of decently formatted data sources out there.

    I guess I get interested in a data science approach to questions like “how does antisemitism manifest on twitter” (like one of my students approached) where there’s tons of evidence but you have to go scrape it from twitter (using their API in this case) and cram it into a pandas dataframe for later analysis.

    What do you think? Are these really the same thing? Do they compliment each other?

    • bretbenesh Says:

      You raise a great point. I think that that they complement each other, since they are different skills.

      Thus, I think a program would ideally teach both skills. If a program is only able to teach one, though, I would first try to figure out if one is more valuable among working data scientists (I don’t actually know right now). If it is a toss-up, I would opt for your way: I think that gathering data would seem more “real” than using a database provided by the professor (even if it contains real data).

      Thanks for the comment—my post definitely biases the single experience I am getting this year, rather than the general data science experience.

  4. Sabbatical December Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] September Report […]

  5. January Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] September Report […]

  6. February Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] September Report […]

  7. March Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] September Report […]

  8. April Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] September Report […]

  9. Solvable by Radicals Says:

    […] September Report […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: