My Workflow

note
Author

Frederick Solt

Published

July 1, 2022

Today there was a workshop on at the Cluster of Excellence Politics of Inequality on research workflows with presentations by Maarten Buis and me. I was excited for the invitation to speak at the workshop, because I think how we do our work is important: having a good workflow helps us to be more efficient, minimize errors, and so make our work more credible, which are all things to strive for. Here are the elements of my workflow that I talked about.

Automated Backup

Sometime in 2007 or 2008, I had a hard drive fail in my main work machine. I bought some software to recover the information on the drive, which seemed to work perfectly. Only months—maybe even years—later did I realize that virtually all of my files for a big project had been lost. On the one hand, the paper was already out, so the years of effort weren’t entirely in vain. On the other, all that stuff was gone, and I’d never be able to build on it. Here well over a decade later, it still makes me shudder.

So I got serious about backup. First, I use Resilio Sync to keep copies of all the files in my Documents directory on my laptop on another machine that’s in a closet at home. That machine uses Apple’s Time Machine to back up those files again to an external drive and it uses Acq Backup to also back them up to Amazon Glacier Storage.1

Drives are much more robust today, but bad things—fires, floods, a lost computer bag—can still happen. Back your stuff up! There are lots of options in this space; choose one (or, like me, a couple) and set them up to run automatically. Then you can check in on those processes once in a while (the weak point in mine is that the flaky power grid ends up shutting down the closet computer once in a while) but mostly never worry about it again. Your career deserves this insurance policy.

Discovery

Finding out about work that will help in your research is an important step. I subscribe to a bunch of journals’ email lists so I get their tables of contents in my inbox. And of course I use Google Scholar both to search by keyword and to snowball from a work I have already found—that is, to look through the works that have cited that first piece since it was published.

Reference Management

Your absolute quickest workflow payoff is to adopt reference management software for bibliographies. Hand typing bibliographies takes a lot of time, it’s no fun, you’re going to make mistakes, and the computer is happy to do it for you. Perfect. Moreover, having your own library of relevant work makes it super easy to find and pull up that paper you can’t quite remember the details about. I recommend this to junior people frequently, and I’m always surprised at how reluctant many, even most, of them are, but putting a little time in on this at the front end saves a lot later.

Starting as an undergrad and then through graduate school, I used Endnote, but I moved to Bib\(\TeX\) shortly after finishing my Ph.D., when I started writing in \(\LaTeX\) rather than Word. I have used BibDesk as my Bib\(\TeX\) front end since then; like most of my workflow, it is open source, but unfortunately, it is Mac-only software. JabRef is the standard on other platforms. I keep thinking about shifting to Zotero for this. Maybe someday that will happen.

Anyway, whenever I read anything I think I might use, I download its Bib\(\TeX\) file from the journal2 and the pdf. I take a second right then to double-check that all of the information is accurate—that the authors’ full first names are there rather than just an initial, etc.—and then I never have to even think of typing that stuff ever again.

Statistical Analysis

I was trained in Stata and was totally into it for many years (and I mean totally: I repeatedly paid the seriously big bucks for the 8MP version, and I was campus representative for the company at my university). But since maybe 2012, after three or four years of slowly moving my workflow over, I have used R more or less exclusively, and I use RStudio Desktop as my IDE. As I’ve written elsewhere, I’d say R’s advantages include being free and open source, superior graphical capabilities, “the super-helpful community, the transferable job skills you can teach your students, the free part, the cutting-edge stuff available years before it’s in Stata, the way RStudio makes it dead easy to do reproducible research through dynamic documents and version control, and, once again, the free part.” I adopted the tidyverse as it grew and teach it to my students, too—in my experience, it really does make R much easier to learn and use. Here’s an outline of my data analysis workflow and what I use to get it done:

  • Scraping data: the general tools are rvest (see my how-to here), and rselenium, but always look for a specialized package first, like icpsrdata
  • Loading data: rio is my go-to—it will load almost anything with minimal configuration
  • Wrangling data: the tidyverse is my starting point. The janitor package has the clean_names() function which I use all the time. The countrycode package provides standardized country names; imo, it is indispensable for broadly cross-national work. I wrote DCPOtools for working with large numbers of survey datasets to estimate dynamic comparative public opinion, and now I use it all the time.
  • Presenting raw data: I use ggplot which is part of the tidyverse for this; my favorite guide is the Cookbook for R’s chapter on graphs
  • Analysis: many of my projects have used multi-level models, and the lme4 package is the key tool for that.
  • Presenting results: As soon as I read it, I was sold on the argument of Kastellec and Leoni (2007) that we should always use plots instead of tables, and Hu Yue and I eventually wrote dotwhisker to make that easy.
  • Presenting quantities of interest: as King, Tomz, and Wittenberg (2000) wrote, making the most of statistical analyses involves displaying not only regression results but also how those results translate into what we’re really interested in. To this end, Hu Yue and I wrote the (interplot) package to present the marginal effects plots needed to understand interaction terms that I often use.

I also do a lot of work using a Bayesian approach to estimate latent variables with models coded up in Stan. For a while I used the RStan package to interface with Stan, but I recently switched to CmdStanR: it’s a lot faster. Between CmdStanR and Apple’s M1 Max chip, I can do most of my jobs quickly locally rather than sending everything out to the high-performance cluster. HPC is still undeniably handy for running k-fold cross-validations, though.

Dynamic Documents

Writing dynamic documents refers to having all of your statistical analysis and all of your writing together in a single document that re-runs your analyses every time you compile the document, so that any changes in, say, your data get automatically reflected into changes in your figures, tables, and even descriptions in the text. I’m a huge fan—it’s super efficient and, done well, it ensures your work is always reproducible. To take advantage of this, I moved from straight \(\LaTeX\) to Sweave, which embeds R code in \(\LaTeX\), and then switched to RMarkdown, which as the name suggests is R code embedded in the much simpler and more intuitive Markdown language, when it came out. RStudio recently released Quarto, which is like a thoroughgoing revision and update to RMarkdown, and I’ll probably make that switch for some new project coming up soon.

My practice is to have just one document that starts by downloading the raw data, then does all the data wrangling, descriptive plots, analyses, results plots, and plotted simulations of quantities of interest right there, together with the text. Admittedly, using the HPC breaks this model a bit, and once in a while my coauthors overrule me, but this is how I vastly prefer to work: having everything in a single document ensures that everything fits together correctly and reproducibly.

Version Control

Version control is used to track changes in documents over time, particularly documents that are code-heavy. I use git and GitHub for this purpose. Like backups, version control is a very good thing, and because all of your work is in the cloud, it facilitates collaboration, too—I work with a team spread over eleven time zones, and GitHub makes it easy to have conversations about what we are doing and to keep our work integrated.3 RStudio makes all of this super easy; see Jenny Bryan’s Happy Git with R for how to get set up.

Large Dataset Storage

Sometimes I need to save datasets somewhere online so that they can be downloaded by my dynamic papers. That’s when I turn to the Open Science Foundation and start a public project for the paper. As long as the project is public, you can store even really big files there for free.

Slides

For a while a decade or more ago now, I used Beamer to write my presentation slides in \(\LaTeX\), but I had always used Apple’s Keynote for teaching, and I eventually came back to the idea that Keynote was the way to go for everything. I love the fun things that Xaringan made possible for RMarkdown slides, and now Quarto offers similar capabilities, and I sometimes wish for the ease-of-updating that comes with literate programming, but given that most of my presentation slide decks are used just once, the advantages of drag-and-drop image positioning and WYSIWYG formatting have won. For now. (Perhaps ironically, I used Quarto to write the slides I presented at the workshop, because it was so easy to trim down a copy of this blog post and have Quarto turn it into slides.)

Course Management

The University of Iowa is now a Canvas shop, and so I use Canvas for almost all of my teaching more or less by default. I use the rcanvas package in a script that downloads and combines all of the assignment grades into final grades at the end of the semester and generates csv files for uploading to the university’s grading system so I don’t have to do all that data entry by hand.

I recorded all of my lectures in Panopto, but I sometimes think I should have just put everything on YouTube to avoid being locked into Panopto’s format—that format leaves me wondering how hard it will be to edit things (which I haven’t really tried) rather than just replacing an entire clip (which of course I’ve done). That was a decision made in the heat of the pandemic, so it may be revisited at some point. One trick that is part of my workflow for recording lectures comes from the vlogging/livestream world: always wear the same shirt. It makes you instantly recognizable—like a cartoon character—and builds comfort. (It also, potentially, makes edits less jarring.)

I have a graduate course on Computational Methods for Comparative Politics that I teach with Github Education, which I like a lot. I’d be tempted to move more of my teaching to that platform, but Canvas integration with the university’s enrollment system and so on wins out most of the time.

Websites

Having a website is a crucial part of getting yourself and your work known by the broader scholarly community. My very first website, from back when I was in graduate school, is recorded for posterity at the Internet Archive’s Wayback Machine, which is also a tool I end up using pretty often to prevent link rot in scripts and course assignments. Anyway, after using Jekyll and then blogdown, I now use Quarto to write my website and GitHub Pages to publish it.

Since I generate and publish a lot of data that I want people to discover, explore, and use, I include interactive graphics of these datasets on my websites. There are other options, of course, but I use Shiny to build these because it allows all the work to be done in R.

Preprints

When I finish a draft and it’s ready to send out for review, right at the same time I upload the paper to the Open Science Foundation and submit it to SocArXiv. Having this preprint on SocArXiv makes the paper really easy for me to share (and for others to find and read and use) while it is under review. And after acceptance, it ensures that paywalls won’t block people from reading the work.

Replication Materials

Acceptance is also the time for me to get my replication materials online at the Harvard Dataverse. Many journals now require this, but even if you publish at a journal that doesn’t, it is good to do anyway: it increases the work’s visibility and its citations (King 1995). I also use the Dataverse to host the twice-annual updates to the SWIID.

Social Networking

For me, the last step is then to announce the publication on Twitter. Some people are really good at using Twitter to collect and test research ideas and intermediate products (e.g., Alice Evans)—real time and continuous peer review—and you should consider trying that, too. It’s amazing. The closest I personally come to that, though, is tweeting about a presentation I gave.

Afterward

One of the best of the excellent questions we were asked after our presentations is where to start. Maarten replied that the first step is having all of your work for a given project in a single directory, and having a consistent structure that you use for all such directories. I agreed entirely—RStudio projects do that for you by default, but you have to use them—and added that next is having automated backups and using reference management software. Other pieces you may want can be added over time, as you find need for—and grow more comfortable with—the tools used for them.

References

Kastellec, Jonathan P., and Eduardo L. Leoni. 2007. Using Graphs Instead of Tables in Political Science.” Perspectives on Politics 5(4): 755–71.
King, Gary. 1995. “Replication, Replication.” PS: Political Science and Politics 28(3): 444–52.
King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44(2): 347–61.

Footnotes

  1. Also, all my research is on GitHub, too, but I’ll come to that in a minute.↩︎

  2. If it’s a book, I use BibDesk’s search of the U.S. Library of Congress.↩︎

  3. In fairness, we also meet weekly on Zoom for a couple of hours.↩︎