Sabbatical October Report: My Year as a Data Scientist

This is a continuation of the series of my sabbatical reports. Here are the previous entries.

It took me most of the month, but I finally finished cleaning this version of the data (five months into a 12 month sabbatical). I just started doing some graphs and charts of the data in hopes of noticing tends that can help me out.

The skills needed to be a data scientist

I will focus on two skills: a special way of using joins, which I previously alluded to, and visualization with graphs and charts. I will start with the latter, since I have had less than a week’s experience with it and hence is short: it is more difficult than I imagined it to be. To look at the videos online, you simply need to type “plt.plot(data=MyDataSet)” and it works. What I (naively) didn’t realize is that you need to make sure that MyDataSet is formatting correctly. I had to do things like learn how to melt the data so that I could graph it correctly. I hadn’t expected that.

Deciding how to represent the data (scatterplot, histogram, etc) is also nontrivial, but I expected this. There isn’t a right answer, and it seems like it might be a bit of an art to (1) pick the right subset of the data and (2) use the right type of graph. I look forward to learning more about this.

I want to give an example of a use of a join that I hadn’t known about before. We will be talking about left joins, which you think of starting with the left table, and then attaching additional information from a second (right) table to the left table (without removing anything from the left table).

Here is the some fake data. This is fake, since dates aren’t single-digit integers, but all we need is to be able to put the “dates” in order. Our goal will be to find the Value that occurred most recently prior to 6 (so the Value corresponding to Date 5 if possible, otherwise Date 4, Date 3, etc.).

NameDateValue
a023
a120
a429
a931
b115
b219
b717

To my knowledge, there is no slick way of doing this (please let me know if you know of a natural way to do this). It is not obvious how a join would be useful, since there is only one table. However, we will create a second table by creating a new data frame that drops the Value column and any date that is not less than 6 (both these things are easy and natural to do in Python’s Pandas package). This gives us a second table.

NameDate
a0
a1
a4
b1
b2

We can now create a third table by doing a groupby on Name that picks the maximum value for Date that goes with each Name. This creates the third table.

NameDate
a4
b2

Now, we can do a left join where this third table is the starting table, and we append any relevant information from the original table that corresponds to the Name and the Date that are in this third table. This yields the fourth, final table that has the most recent Value prior to 6 for each Name.

NameDateValue
a429
b219

So we used a join to turn a table with seven entries with five redundant ones to a table with only the two relevant rows via the join.

How academia and business are different.

No report.

How will this experience influence my teaching?

I don’t know how much of this is influence by my sabbatical, but I am thinking a ton about ungrading. I think this is mainly because Blum and company recently legitimized the practice. However, the sabbatical reinforces it. I am learning a ton, and I am grateful that my manager isn’t grading me on what I do. I have made a lot of mistakes that I wouldn’t have wanted to be penalized for, but I have also taken a lot of chances in how I approached the data that I likely wouldn’t have if I were seeking my manager’s approval.

Additionally, this sabbatical is going to help me when I teach statistics later. I am now fluent with what I would use in teaching the course, and I can imagine pulling out a Jupyter notebook in class to answer someone’s question on the fly. Previously, I was only really confident in preparing Jupyter notebooks for them to use on their homework (and I will be able to do that a lot better now). This is pretty huge since I teach statistics a lot.

My feelings about being in industry.

No report.

Tags: ,

11 Responses to “Sabbatical October Report: My Year as a Data Scientist”

  1. gasstationwithoutpumps Says:

    I don’t see any need to drop the value column and then join it back in. You can do the select and groupby on the original database—they are doing all the real work.

    The only reason to drop columns and join them back in would be if they have a lot of data in them, to reduce the size of the intermediate files.

    • bretbenesh Says:

      You are 100% right. While it is true that I was working with large data sets (the groupby probably took one minute to run), I am nearly certain this is not why I chose to drop the Values column.

      Really, I need to look to see why I dropped the Values column. I need to figure out if I stupidly made my code more complicated than it needed to be or whether I didn’t describe the full situation above.

      Thanks so much!

  2. Sabbatical November Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] « Sabbatical October Report: My Year as a Data Scientist […]

  3. Andy Rundquist Says:

    (still catching up!)

    in sql I would do this:
    select name, date, value
    from table t, (select name, max(value) as maxValue from table where value sql -> pandas ( -> array filters in javascript || list management in Mathematica) and so my pandas approach is usually around sql thinking before figuring out how to do it in pandas.

    Does Access let you do sql?

    • bretbenesh Says:

      I think more in terms of SQL, too. I barely use Access (I only use it to be able to view some data in an antediluvian data base that doesn’t seem to work with anything else), so I am not sure. I am betting you can use SQL, though. I use Microsoft SQL Server Management Studio, which I am surprisingly happy with (I am normally not happy with Microsoft products), although I don’t use that too much, either.

  4. Sabbatical December Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] October Report […]

  5. January Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] October Report […]

  6. February Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] October Report […]

  7. March Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] October Report […]

  8. April Sabbatical Report: My Year as a Data Scientist | Solvable by Radicals Says:

    […] October Report […]

  9. Solvable by Radicals Says:

    […] October Report […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: