Data Ramblings: Many Passes - My Journey from TidyTuesday to a Data Science Career

Kevin Kent

Last week, Jesse Mostipak tweeted about how Twitter played a major role for them in starting a career in data science. I reflected a bit on the role it has played in my career and decided to follow up on Jesse’s gentle nudge to write about it!

What you’ll see below is my experience of learning R, doing a data science bootcamp (100% in Python!), getting a job in data science and the tools, resources, and strategies I’ve found to be important for my journey. I hope it is useful for folks embarking on a similar journey.

Same. I was very novice at R until I started doing data vis in #tidytuesday. Gave me the boost to get into a data sci bootcamp and move fully into data science. https://t.co/Dq6oezg59h
— Kevin Kent (@kevin_m_kent) July 17, 2021

Starting Out

A brief bit about where I started from at the beginning of this narrative. I first encountered R while working as a research staff member in an Arizona State University psychology lab called the SoLET lab (Science of Learning and Educational Technology). The lab specialized in studying reading and writing in the context of intelligent tutoring systems. At the center of the lab’s work was natural language processing techinques. I had come into the lab with a minor in mathematics and some graduate coursework in statistics, but this was my first real exposure to a statistical programming language.

I also had (and continue to have) a strong interest in dynamical systems theory, particular with its applications to modeling the process of learning and development. Luckily, the lab at ASU also had an interest in this area and conducted some research studying writing using cross recurrence analysis.

One of the graduate students in the lab pointed me to the Text Mining with R book and the tidytext package to start out in text analysis and natural language processing to support the lab’s research. This was also my first introduction to the concept of tidy data and the tidyverse generally. I remember being legitimately blown away when I was able to unnest a text field, remove stopwords, and identify common n-grams in only a few lines of code.

After two years at ASU, I joined a non-profit educational research and development organization called CAST as a research associate and instructional designer. There, I supported grant-funded research projects in schools and conducted exploratory as well as inferential analyses to answer the research questions set out by the project. The data was mostly surveys, log files from online educational platforms, and pre/post tests.

After about a year and a half at CAST, I was still considering applying to PhD programs and pursuing a career academic research. Data skills wise, I had become more familiar with R and the tidyverse through the day-to-day research work at CAST. However, I still was not very fluent in the language. I could get by with accomplishing a certain analysis or visualization needed for a project through a large amount of googling and trial and error (just to clarify, googling is 100% normal and not a bad thing, I just felt like I was doing a lot more of it than I should have).

I had a desire to be able to translate analysis or visualization ideas to code without having a bottleneck at the command line. I thought that having this fluency would allow me to quickly explore and iterate through exploratory questions, hypotheses, and methods of representing data. The primary obstacle to getting to this point was simply lack of exposure to a variety of datasets and opportunity for practice. Many people who work in academic research have the experience of working on 1-2 projects, waiting months for data collection, and conducting one major predictive or inferential analysis project as a final step before writing up the findings in a paper. While this is great experience for pursuing a career in research, it often doesn’t give you the opportunity to practice and hone your data skills in a short amount of time.

Enter TidyTuesday

One evening I stumbled upon David Robinson’s TidyTuesday live coding videos and without exaggeration it changed the trajectory of career in data. I saw how he asked questions of his data and was able to produce answers in seconds. It expanded the boundaries of what I thought was possible in interactive data science coding. My next question was - how do I move in this direction? How do I attain anywhere close to this level of fluency and comfort with the R tidyverse ecosystem? Naturally, I started to learn more about the TidyTuesday community project. Each week new datasets and articles describing and/or using those datasets are posted on the github repo (and on twitter) and people will post their visualizations and analyses on twitter using the #tidytuesday hashtag.

Before I knew it, I had posted for 3 months straight and become many times more comfortable with data visualization and exploratory analysis. It also pushed me to start a blog! I also was amazed to get the occasional bit of feedback from some of my newfound R idols - Garrett Grolemund is one that sticks out in my mind. It opened up this community of practice to me and I learned how generous and kind the R community is to newcomers. Additionally, seeing the variety of ways people approached a dataset that I had become familiar with was transformative. I learned little tricks and habits of data folks using R in many different contexts. These habits and exposure to datasets (importantly, different data formats) from beer ratings to NYC restaurant inspections would have taken me many years to pick up without tidytuesday.

At work this experience also drove me deeper into data science. I started being a little more adventurous with analysis and visualization, working to implement many of the techniques that I first saw in tidytuesday code. This was the most positive positive feedback loop I had ever experienced. I was spending 3-4 hours at my Rstudio console without looking up and having a blast doing it. I had reached a threshold of fluency that allowed me to translate ideas to code in minute rather than hours. There is some sort of a tipping point to this fluency, after which it gets REALLY fun and tidytuesday had pushed me over this edge.

A Fork in the Road

This tidytuesday journey soon provoked a new question for me - should I continue to pursue academic research or a career in data science? If I pursued a career in data science, what would be my next step? How would I translate my academic research experience to business needs?

A friend from my master’s program a few years earlier had told me about a data science bootcamp called The Data Incubator. The focus of this bootcamp is to take people with masters and/or PhD degrees with 90% of the data science skillset and give them the last 10% to make the jump to industry data science. Importantly, as someone who had spent my entire career up to this point in non-profits and university jobs, they offered a fellowship that made the bootcamp 100% free (it seems now it is free via an income sharing agreement, which wasn’t the case when I did the program). I had actually applied to it a few years earlier after my master’s program but couldn’t complete the first screening stage, so I had little hope that I could produce a successful application this time around.

I decided to give it a shot. I figured if I wasn’t successful, I could continue on my academic path and apply to PhD programs that fall. The application process was definitely challenging - the first stage was a mix of computer science and statistics problems as well as descriptive information about your experience in related areas. This time around, I was able to complete most of the questions and advanced to the next stage. To my delight, the second stage was almost identical to tidytuesday - they offered a dataset and prompted you to produce an analysis that answered some question you were interested in. Additionally, they required you to specify which steps you would complete next, given the opportunity.

Somehow, I was asked to do a final interview! I was both surprised and elated. The final stage involved completing the steps you proposed as and presenting your findings to a group of peer applicants, answering any questions along the way. A few days later I was offered a spot as a fellow in their summer 2019 NYC cohort (just a few weeks away) and could not believe it. I was about to go all-in on a data science career. The kicker was that the bootcamp was 100% in python.

Bootcamp and Interviews

I spent the next three weeks learning everything I could about python and its associated data science libraries. The program offered a really nice set of resources to get up to speed in a hurry with python. I found that there were many similarities with R and the tidyverse ecosytem (especially with pandas) that made the transition much easier. Additionally, it was helpful to make links between the generic programming concepts like control flow and functions that exist in each language. As a side note, there is really good evidence in the second language learning literature that knowing your first language really well makes learning a second language much easier via shared language concepts and vocabulary. I haven’t seen published studies but I would bet the same is true for programming languages.

The program itself lasted 8-weeks and consisted of full workdays of in-person lecture and applied assignments. There were about 10 other classmates that I worked alongside, which was one of the great benefits of the program. Getting to know others purusing a similar transition and solve hard problems together was a fantastic learning and bonding opportunity. We covered topics from web scraping to deep learning with tensorflow.

At the conclusion of the program it was job application time! The agreement with the program was that in exchange for free tuition, you had to apply exclusively to partner companies for three months after the program, so I applied and interviewed at many places from their partner list. I had some close calls but unfortunately did not find a job during this period. It was one of the toughest moments of my life - making a leap of faith to a new career only to be rejected by everywhere I applied. I still had faith that I could start out somewhere as a data scientist but I didn’t know where or how that would come.

After this period was over I found an opportunity at Nuance Communications for a Senior Data Scientist on their Site Reliability Engineering team. With nothing to lose, I applied and advanced through the interviews. After a few weeks, I received a job offer. I was completely over the moon! Finally, I saw the benefits of all this hard work that started with tidytuesday and ended with completing the bootcamp. They told me that they valued my experience with user experience research and log file analysis with an online learning platform, in addition to my data science skillset, that made me interesting as a candidate. It took just the right opportunity with a company and team who saw the value in each step in my journey, not just the more recent data science push.

Today

A year and eight months later, I am still in the same position at Nuance on the Site Reliability engineering team. It’s been an incredible place to work - everything from the work culture to the challenging business problems I’m asked to tackle on a day-to-day basis. I work on everything in the realm of cloud capacity forecasting, monitoring and alerting, root cause investigations, and deployment evaluation for Nuance’s Dragon Medical products. I feel incredibly lucky that I ended up at Nuance in a role where I’m able to use R, SQL, Python (mainly pyspark) on a daily basis to inform business decisions.

Lessons

Here are a few of the main takeway lessons from the last few years.

Take Many Passes

Many people transitioning to data science, including me, experience imposter syndrome. This is even true (maybe even moreso) after landing a position in data science. I found it really useful for me to focus on dropping any ideas of perfectionism or notion that learning data science is a destination. Data science is a vast field and no one could possibly know everything. The most important thing is to have an active (mental) map of what you know you know, what you know you don’t know and focus on discovering things you didn’t know you didn’t know. With time you can incrementally improve on knowledge gaps and advance your understanding of statistics, programming, etc.

The first time you read something or learn about a concept, you might remember the main pillars or the details of a few use cases. Even without revisting this concept, you will see variations of it in other places. Each time you see a related concept, you form links between the two and see the concept in a different light. This gives you a deeper and deeper understanding of whatever you are trying to learn.

So embrace not getting everything the first time around and seek out many examples and applications of that and related concepts. But don’t be afraid to try to apply it early to a problem at work or on a personal project (just be aware of the assumptions and limitations of the approach). This is just the process of learning and how growth works. After all, experts in a field are never done learning. By definition, they need to stay apprised of the latest advancements and evidence for best practices in the field.

Observe the Process of Experts

Witnessing David Robinson live-code was transformational. Not only was I blown away by his fluency but I also learned a ton from his think aloud of what he was doing and why he was doing what he was doing. This is one of the hardest parts of learning data science as opposed to something like tennis (although the inner game of tennis or any sport is very important as well). You see people like David Robinson’s final product, whether it is code or a neat visualization and think that is where the magic happens. And don’t get me wrong, the products are great, but the important stuff is the thought process and strategy/reasoning in the middle. So try to find as many opportunities as possible to flip the lid on the process of data science experts.

Give Back to the Community

As others, including Jesse, have called it before me, I consider myself community-taught (as opposed to self-taught). So it’s only right to contribute to that same community and give back. I have tried to do my small part by serving as a mentor of the R for Data Science slack community, answering questions and participating in discussions about R. However, there are also other major benefits to giving back in this way. The biggest one I can think of is having to know concepts on a deep enough level to be able to explain why something is true and give the right examples so that others can understand. Additionally, the learning science literature is clear that giving peer feedback can be just as powerful as getting feedback for promoting understanding. Like tidytuesday, answering community questions exposes you to a wider range of problems and techniques than you would encounter in your day-to-day work. This also connects to the larger point of getting and giving feedback as much as possible, a key ingredient to learning.

Find your Niche

For many people, there is something that drew them into the data science field. For me, I was interested in modeling the learning process as a dynamical system and dynamical systems theory in general. Over the past few years, I’ve continued to deepen my understanding in this area through courses and resources like this one on Nonlinear Dynamics.

Having a special angle to data science or interest is helpful in a variety of ways. I’ve found that it helps with some of the imposter syndrome I’ve described above - most people have no reason or interest to learn about this particular area, which lessens the pressure I feel to know a certain amount about it. It also helps differentiate my skillset a bit and links the different stages of my career in a coherent way. And I continue to seek out resources about dynamical and complex systems simply because it’s interesting.

This isn’t required by any means but I’ve found it really helpful for differentiation and developing a coherent thread linking the various stages of my career. It might be helpful for you too!

Establish your Information Channels

It is extremely valuable to have regular sources of information that expose you to new ideas in data science. I follow the #rstats hastag on twitter as well as actively participate in the R for Data Science slack community. Whenever I find an interesting blog, I add it to my Feedly collection. I also recently subscribed to Medium and I follow the data science posts. I find their recommendation algorithm really helpful for finding new posts on topics in data science that I find interesting or am trying to learn more about.

Consistenly Revist the Basics (at least at first)

I have found it really useful to read different takes on introductory statistics and probability from time to time. I’m currently doing this with reading Regression and Other Stories. This offers a new perspective of a topic I feel like I know pretty well. But the bayesian take combined with another author’s framing of statistics has been really revealing for me. I also find that at my job, 90% of problems can be solved with regression and a solid understanding of foundational statistics concepts. So revisting these areas can be hugely beneficial in that respect as well.

Many Passes - My Journey from TidyTuesday to a Data Science Career