### SWIID Version 8.0 is available!

##### Thursday, 28 February 2019

Version 8.0 of the SWIID is now available! In addition to important behind-the-scenes improvements to the estimation routine, this new release:

For more details, you can check out the all the R and Stan code used to generate the estimates in the SWIID GitHub repository. As always, I encourage users of the SWIID to email me with their comments, questions, and suggestions.

### Using Cross-Validation to Evaluate the Comparability of the SWIID's Estimates

##### Wednesday, 16 January 2019

From its origins now over ten years ago, the goal of the Standardized World Income Inequality Database has been to provide estimates of income inequality for as many countries and years as possible while ensuring that these estimates are as comparable as the available data allow. That is to say, the SWIID’s first prority is breadth of coverage, and its second is comparability. The starting point for the SWIID estimates is a dataset with the complementary priorities: the Luxembourg Income Study, which aims to maximize comparability and, given that primary concern, to include as many countries and years as possible.1 Then the SWIID routine estimates the relationships between Gini indices based on the LIS and all of the other Ginis available for the same country-years, and it uses these relationships to estimate what the LIS Gini would be in country-years not included in the LIS but available from other sources.

How can we know if the SWIID’s approach works? In previous work, I provided the most stringent test I could come up with:2 I examined LIS data on country-years that had been included in previously-released versions of the SWIID. The results were reassuring in some ways–only seven percent of the differences between new LIS and old SWIID observations were statistically significant and larger than two Gini points, a far better record than that achieved by data carefully selected from the UNU-WIDER database or the All the Ginis dataset adjusted in accordance with its instructions–but less so in others. Most disappointingly, only 72% of the differences had 95% confidence intervals that included zero, suggesting that the SWIID’s standard errors were often too small. I’ve been working hard on the SWIID’s estimation routine to fix these issues since I conducted that test back in 2014, but the LIS doesn’t release new data frequently enough to allow for continuous testing of these revisions. So, instead, I’ve drawn on a technique developed in data science and machine learning, k-fold cross-validation, to assess the SWIID’s progress.

To get how k-fold cross-validation works, it helps to first understand the simpler form of cross-validation in which the available data are first divided into two groups of observations: the training set and the testing set. The model parameters are then estimated on only the training set. Finally, these results are used to predict the values of the testing set (that is, again, observations that were not used to estimate the model’s parameters). By comparing the model’s predictions against the test set, we avoid overfitting and get a good sense of how well the model performs in predicting other, as yet unknown, data.

Still, that sense may be biased by the exact observations that happened to be assigned to each set. We can reduce this bias by performing the process repeatedly: this is k-fold cross-validation. The available data are divided into some number k groups. One at a time, each of the k groups is treated as the testing data, with all other groups forming the training data for estimating the model. The model’s performance is then evaluated by considering how well it predicts all of the groups, and because every observation is included in the testing data at some point, the process allows us to check whether and for which observations the model is doing particularly poorly.

To provide a first assessment of the SWIID’s ability to predict the LIS, I randomly assigned the available LIS observations into groups of three, with an added check to ensure that no group included two observations from the same country.3 (Because the SWIID routine relies only on relationships observed within-country for the countries included in the LIS, the check that only a single observation from a country be assigned to the test data at a time means that the exact size of the group doesn’t really matter.)4 The figure below plots the difference between the SWIID prediction generated from this k-fold cross-validation and the LIS data for each country-year included in the LIS. Observations for which the 95% credible interval for this difference includes zero are gray; those for which it doesn’t are highlighted in blue.

The results show that the SWIID does a very good job of predicting the LIS: the 95% credible interval for the difference between the two includes zero for 92% of these observations. The point estimates for these differences are generally small, with 86% less than 2 Gini points and 62% less than a single Gini point. It’s true that there are a few observations for which the estimated difference is quite large–on the far left of the plot, the SWIID routine underestimated the LIS Gini for Hungary in 1991 by 6 $$\pm$$ 4 points, and on the extreme right the SWIID routine overestimated that for Guatemala in 2014 by 7 $$\pm$$ 4 points–but there doesn’t really seem to me to be much pattern in which countries and years are estimated poorly.

This test, though, really only assesses how well the SWIID predicts LIS-comparable inequality figures in years without LIS data in the (now fifty) countries that are included in the LIS. We can get a better sense of how well the SWIID does predicting countries not covered by the LIS with another cross-validation that, one country at a time, excludes all of the LIS observations for that country. The results of just such a test are plotted below.

Overall, the plot looks very similar to the one above. With each country’s entire run of LIS data taking a turn being excluded, the 95% credible interval for the difference between the resulting SWIID estimate and the excluded LIS data contains zero 91% of the time. And here, too, most of the point estimates for these differences are small: 74% are less than 2 Gini points, and 52% are less than one Gini point.

This analysis, though, does point to a few rough spots in need of future attention. The first appears on the far left of the plot above. There we find that the largest difference is for the sole country-year for Egypt in the LIS–for 2012–which the SWIID routine underestimates by 16(!) $$\pm$$ 6 Gini points. Egypt is currently the only country in the LIS with just a single country-year observation; given that excluding the one observation is equivalent to excluding all of the country’s observations, I skipped omitting it in the first cross-validation. LIS researchers Checchi et al. report in a footnote that Egyptian income surveys before 2012 did not include any questions to capture self-employment income, and it’s also true that most of the available Ginis for Egypt are based on the distribution of consumption expenditure, which sometimes only loosely track those for the distribution of income (see, e.g., India). These factors, however, are present in many non-LIS countries as well, so I’ll continue working to come up with ways to improve the SWIID routine for such cases.

The second is that there are two other countries for which the 95% credible interval for the differences between the LIS data and the SWIID routine’s estimates for those countries when all of their LIS data are excluded does not contain zero in any of the country’s observations: Brazil and Peru. For Brazil, the cross-validation’s estimates of the country’s four LIS observations are all too high—by 2.5 $$\pm$$ 2.0 Gini points to 3.7 $$\pm$$ 2.1 Gini points. The cross-validation’s estimates for Peru’s four LIS observations, on the other hand, are all too low—by between 5.0 $$\pm$$ 2.9 and 5.4 $$\pm$$ 3.0 Gini points. So there is some room for improvement here too, and I’ll keep working on it also.

All in all, though, these k-fold cross-validation exercises show that the SWIID does a very good job of predicting the LIS, which inspires confidence that the SWIID is indeed maximizing the comparability of income inequality data across countries and over time.

## References

Checchi, Daniele, Andrej Cupak, Teresa Munzi, and Janet Gornick. 2018. “Empirical Challenges Comparing Inequality Across Countries: The Case of Middle-Income Countries from the LIS Database.” WIDER Working Paper 2018/149.

Solt, Frederick. 2016. “The Standardized World Income Inequality Database.” Social Science Quarterly 97(5): 1267–81.

1. Still, even for the LIS perfect comparability has given way to the desire to cover more middle-income countries. Teresa Munzi and Andrej Cupak recently wrote about the difficulties the LIS team encountered including middle-income countries due to the greater importance of non-monetary and self-employment income as well as the differences in direct taxation and social security contributions in these countries in comparison to high-income countries. Despite these issues, the LIS remains the most comparable income inequality data available.

2. For the initial kernel of this idea, I remain grateful to participants in the Expert Group Meeting on Reducing Inequalities in the Context of Sustainable Development, Department of Economic and Social Affairs, United Nations, New York, October 24–25, 2013.

3. The goal of this exercise is really to assess how well the SWIID works within the LIS countries, so Egypt 2012, the only LIS observation for that country, is excluded from the analysis. This is because holding out that observation makes Egypt a non-LIS country. What happens when the SWIID is used to predict all of a country’s LIS observations at once is discussed below.

4. My experiments with groups sized from just one observation each up to six observations each confirmed this. Three observations per group struck a nice balance between the time it takes to randomly generate the groups (which increases with group size because it becomes more likely for a group to be rejected for containing two observations from a single country) and the demand the work puts on UI’s high performance computing cluster (which increases with the number of groups–which is also the number of times the SWIID routine is re-run). I probably don’t really need to worry about the latter, but whatever—like I said, it doesn’t actually matter.

### How to Switch Your Workflow from Stata to R, One Bit at a Time

##### Wednesday, 15 August 2018

A recent exchange on Twitter reminded me of my switch to R from Stata. I’d started grad school in 1999, before R hit 1.0.0, so I’d been trained exclusively in Stata. By 2008, I had way more than the proverbial 10,000 in-seat hours in Stata, and I knew all the tricks to make it do just what I wanted. I was even Stata Corp.’s on-campus rep at my university. Still, I’d started dabbling in R. Then as now, there were specific things R could do that Stata couldn’t.1 But how to get those advantages without throwing out my hard-earned skills and starting over as a complete n00b? The answer was: a little bit at a time.

Fortunately, it’s not difficult to switch back and forth within a given project, so you can start bringing some R to your Stata-based workflow while leaving it mostly intact. Then, if and when you find yourself doing more in R than in Stata, you can flip and start using Stata from within R.

So, install R and let’s get you started.

## Running R from Stata

The trick to running R from within your do-file is first to save the data you want to pass to R, then call the .R file with the commands you want to run in R (the “R script”), then—if necessary—reload the R output into Stata.

While it’s also possible to use Stata’s shell command to run an R script (for illustrative purposes, let’s pretend it’s called my_script.R), Roger Newson’s rsource module makes it particularly easy. Install it as follows:

 ssc install rsource, replace

Unfortunately, the information rsource needs about your R installation is a bit different depending on your OS, but once installed, adding this platform-independent code to your do-file will run the script:

if "c(os)'"=="MacOSX" | "c(os)'"=="UNIX" {
rsource using my_script.R, rpath("/usr/local/bin/R") roptions("--vanilla"')
}
else {  // windows
rsource using my_script.R, rpath("c:\r\R-3.5.1\bin\Rterm.exe"') roptions("--vanilla"')  // change version number, if necessary
}

Of course, you could choose to skip the whole if-else and just include the line that runs on your machine, but that’s not doing any favors to your collaborators or anyone else trying to reproduce your results. You might also just prefer to specify the rpath and roptions in your profile do-file,2 but again, then you’ll need to let others know to do the same or they won’t be able to run your do-file.

Note, too, that if you don’t have much R code to run, it might be easiest to just keep it in your do-file rather than using a separate script. You can do this using the terminator option to rsource, though a downside to this approach is that it doesn’t allow you to if-else the rsource command by your OS. In the do-file below, I also use the regsave module to save my results to pass them to R; install it using ssc install regsave, replace.

clear
set more off

sysuse auto, clear
gen wt = weight/1000
regress mpg wt displacement foreign trunk headroom length
regsave using "~/Desktop/R_Stata/auto_results.dta", replace

rsource, terminator(END_OF_R) rpath("/usr/local/bin/R") roptions("--vanilla"')
// rsource using my_script.R, rpath("c:\r\R-3.5.1\bin\Rterm.exe"') roptions("--vanilla"')  // use this line instead if you run a windows box

library(tidyverse);     # collection of all-around useful R packages
library(haven);         # for importing Stata datasets
library(dotwhisker);    # easy and beautiful regression plots, imho

rename(term = var,
estimate = coef,
std.error = stderr) %>%
filter(term != "_cons");
dwplot(auto_results);
ggsave("~/Desktop/R_Stata/auto_results.png", width = 5, height = 4);

END_OF_R


## Running Stata from R

So maybe you’ve gotten to the point where you spend more of your time in R than in Stata, but there’s still a few parts of your work that you just want (or need!) to keep in Stata. Running a do-file (my_do_file.do) from inside your R script is easy with Luca Braglia’s RStata package:

if (!require(RStata)) install.packages("RStata"); library(RStata) # this will install RStata if not already installed

stata("my_do_file.do",
stata.path = "/Applications/Stata/StataMP.app/Contents/MacOS/stata-mp", # yours probably differs: use the chooseStataBin() command on windows or linux machines; on Macs, right click on the Stata app, select "Show Package Contents", then see what's in the Contents/MacOS/ directory
stata.version = 13)  # again, specify what _you_ have

On this side as well, it’s possible to set the arguments just once, in your .Rprofile file. In my case, these two lines do the trick:

options("RStata.StataPath" = "/Applications/Stata/StataMP.app/Contents/MacOS/stata-mp")
options("RStata.StataVersion" = 13)

Since Stata isn’t free and open-source, it’s even more likely that others will have different setups anyway, so this may make the most sense. Be sure to comment your code to clue people in, though.

If you just want to use a single Stata command RStata::stata3 will do that for you, too, with no need for a do-file. From the RStata package documentation:

library("RStata")
# remember to set RStata.StataPath & RStata.StataVersion in your .Rprofile first!  See https://www.rdocumentation.org/packages/RStata/

## Data input to Stata
x <- data.frame(a = rnorm(3), b = letters[1:3])
stata("sum a", data.in = x)                         
## . sum a
##
##     Variable |       Obs        Mean    Std. Dev.       Min        Max
## -------------+--------------------------------------------------------
##            a |         3    .0086477    .9228345  -1.026655   .7447832
## Data output from Stata (e.g., obtain 'auto' dataset)
auto <- stata("sysuse auto", data.out = TRUE)
## . sysuse auto
## (1978 Automobile Data)
head(auto)
##            make price mpg rep78 headroom trunk weight length turn
## 1   AMC Concord  4099  22     3      2.5    11   2930    186   40
## 2     AMC Pacer  4749  17     3      3.0    11   3350    173   40
## 3    AMC Spirit  3799  22    NA      3.0    12   2640    168   35
## 4 Buick Century  4816  20     3      4.5    16   3250    196   40
## 5 Buick Electra  7827  15     4      4.0    20   4080    222   43
## 6 Buick LeSabre  5788  18     3      4.0    21   3670    218   43
##   displacement gear_ratio  foreign
## 1          121       3.58 Domestic
## 2          258       2.53 Domestic
## 3          121       3.08 Domestic
## 4          196       2.93 Domestic
## 5          350       2.41 Domestic
## 6          231       2.73 Domestic
## Data input/output
(y <- stata("replace a = 2", data.in = x, data.out = TRUE))
## . replace a = 2
## (3 real changes made)
##   a b
## 1 2 a
## 2 2 b
## 3 2 c

And you can embed several Stata commands in your R code as well:

data <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))
stata("
sum y x1 x2
reg y x1 x2
", data.in = data)
## .
## .     sum y x1 x2
##
##     Variable |       Obs        Mean    Std. Dev.       Min        Max
## -------------+--------------------------------------------------------
##            y |       100    .0613503      .96479  -2.129139   3.194052
##           x1 |       100   -.0912194    1.035336  -3.108819   2.063537
##           x2 |       100   -.2148564    1.023588  -2.659547   2.485997
## .     reg y x1 x2
##
##       Source |       SS       df       MS              Number of obs =     100
## -------------+------------------------------           F(  2,    97) =    0.46
##        Model |  .866654914     2  .433327457           Prob > F      =  0.6324
##     Residual |  91.2844951    97  .941077269           R-squared     =  0.0094
## -------------+------------------------------           Adj R-squared = -0.0110
##        Total |  92.1511501    99  .930819697           Root MSE      =  .97009
##
## ------------------------------------------------------------------------------
##            y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##           x1 |   .0147578   .0946682     0.16   0.876    -.1731324     .202648
##           x2 |  -.0917222   .0957548    -0.96   0.341    -.2817691    .0983246
##        _cons |   .0429894    .099359     0.43   0.666    -.1542107    .2401896
## ------------------------------------------------------------------------------
## .

## Summing Up

Moving parts of your work from Stata to R is totally feasible. Lots of people (for example, in the thread that touched this post off, Steve Rodgers) really want to take advantage of the superior graphics capabilities of R, especially the ggplot ecosystem, even while sticking to Stata for most of their work. Once your feet are wet, you may then decide R’s many other benefits (the free part, the super-helpful community, the transferable job skills you can teach your students, the free part, the cutting-edge stuff available years before it’s in Stata, the way RStudio makes it dead easy to do reproducible research through dynamic documents and version control, and, once again, the free part) make switching over all the way to be worth the additional marginal effort. Or you may not.

I completed the transition in three or four years, at my own pace: when I felt comfortable moving another chunk of my workflow over to R, I did, but not before. If I were doing it over right now, with the tidyverse packages dramatically reducing the slope of the learning curve, I might move faster, but there’s no rush, really. Do what works for you.

• This post by John Ricco describing how to translate Stata data cleaning commands to the dplyr idiom will likely be helpful to those new to tidyverse-style R and wanting to move quickly.
• Matthieu Gomez’s R for Stata Users is a more detailed phrasebook that will also be useful to new switchers (H/T Arthur Yip).4
• I also ran across the Rcall package while writing this up, but I haven’t tried it. You may find it useful.
• OTOH, these 2010 slides by Oscar Torres-Reyna were definitely useful to me back in the day, but as they pre-date both the tidyverse and RStudio—the wonders of which really cannot be overstated—they’re now more likely to cause you unnecessary confusion than help you if you’re a new switcher. Better to steer clear.
• Great complete treatments on how to do stuff in R:
• RStudio’s Cheat Sheets are also great references.
• When you’re ready to take the step to using R more than Stata, you’ll want to get fully set up on RStudio, which provides a front end for running R and can integrate with git and GitHub for version control (you will want this). The best resource that I’ve found for this process is Jenny Bryan’s Happy Git and GitHub for the UseR.
• The R community on StackOverflow is full of helpful people. As your Google-fu develops, you’ll find that links to StackOverflow are most likely to get you where you need to go.
• There are so many fantastic #rstats (dozens? hundreds?) follows on Twitter. With apologies to the—seriously—hundreds of others who’ve taught me tons of stuff over the years, I’m going to grit my teeth and rec just five to get you started: Mara Averick, Jenny Bryan, David Robinson, Julia Silge, and Hadley Wickham.

## References

Bryan, Jenny. 2018. “Happy Git and Github for the useR.” http://happygitwithr.com/.

Chang, Winston. “Cookbook for R.” http://www.cookbook-r.com.

Ismay, Chester, and Albert Y. Kim. 2018. “Modern Dive: An Introduction to Statistical and Data Sciences via R.” https://moderndive.com/.

Kastellec, Jonathan P., and Eduardo L. Leoni. 2007. “Using Graphs Instead of Tables in Political Science.” Perspectives on Politics 5(4): 755–71.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz.

1. Then, for me, it was multiple imputation, parallel computation, and the dot-and-whisker plots of regression coefficients introduced to political science by Kastellec and Lioni (2007). On this last one, see also the dotwhisker package. Now my list is different, but even longer. That’s not what I want to get into in this post, though. This post is how, not why.

2. See the technical note to the help file for rsource for details.

3. In the argot (heh), this means the stata command in the RStata package.

4. Arthur also recommends vikjam’s Mostly Harmless Replication, which replicates most of the figures and tables of Mostly Harmless Econometrics in both Stata and R (and many in Python and Julia as well). Though not intended as a guide for switchers, the site will be helpful to fans of the book looking for ways to implement its advice in R.

### SWIID Version 7.1 is available!

##### Tuesday, 14 August 2018

Version 7.1 of the SWIID is now available! In addition to important behind-the-scenes improvements to the estimation routine, this new release:

For more details, you can check out the all the R and Stan code used to generate the estimates in the SWIID GitHub repository. As always, I encourage users of the SWIID to email me with their comments, questions, and suggestions.

### SWIID Version 6.2 is available!

##### Monday, 26 March 2018

Version 6.2 of the SWIID is now available! Building on the end-to-end revision accomplished in Version 6.0 last July and the update Version 6.1 last October, this new release:

For more details, you can check out the all the R and Stan code used to generate the estimates in the SWIID GitHub repository. As always, I encourage users of the SWIID to email me with their comments, questions, and suggestions.

### SWIID Version 6.1 is available!

##### Friday, 27 October 2017

Version 6.1 of the SWIID is now available! Building on the end-to-end revision accomplished in Version 6.0 last July, this new release:

For more details, you can check out the all the R and Stan code used to generate the estimates in the SWIID GitHub repository. As always, I encourage users of the SWIID to email me with their comments, questions, and suggestions.

### The SWIID Source Data

##### Friday, 28 July 2017

Saturday, 9 February 2019: Updated with information on the source data for SWIID Version 8.0

I have been producing the Standardized World Income Inequality Database for nearly a decade. Since 2008, the SWIID has provided estimates of the Gini index of income inequality1 for as many countries and years as possible and—given the primary goal of maximizing spatial and temporal coverage—these estimates are as comparable as the available data allow. The dataset has been used widely by academics, journalists, and policymakers. It’s been successful way beyond all my hopes.2 I’ve been adding to it, revising it, improving on it pretty much the entire time since its launch. Now, with the support of the NSF, I am scrapping all of that work and starting fresh. From scratch.

This is the first in a series of posts on how I did it. It focuses on an unheralded but foundational part of the SWIID project, the source data. The basic idea behind the SWIID is to start with the data that has been most carefully picked over to ensure its utmost cross-national comparability: the data of the fantastic Luxembourg Income Study. I’ve heard that generating a single country-year of LIS data takes an average of ten person-months of work. That’s dedication to comparability. ❤️ But the flipside of maximizing comparability is that the LIS’s coverage is pretty sparse: at last count, it includes just 351 country-years.3 To address this weakness, the SWIID routine estimates the relationships between Gini indices based on the LIS and all of the other Ginis available for the same country-years, then uses these relationships to estimate what the LIS Gini would be in country-years not included in the LIS but available from other sources.4 The critical first step to making this work is getting a lot of other, non-LIS Ginis. I call these other Ginis the SWIID’s source data. Over the years, I and my research assistants built up a big spreadsheet of data collected from international organizations, national statistical offices, and scholarly books and articles. But it seemed like whenever I checked over these source data, I would find that at least a few figures had been recently revised, or their source was seemingly no longer available, or (worst of all) they evidently had been entered incorrectly. So again: it’s time to start over from scratch.

To be included in the SWIID’s source data, observations need to encompass the entire population of a country without regard to age, location,5 or employment status.6 They need to have an identifiable welfare definition and equivalence scale (more on these below). Finally, because I want to be able to direct users to sources they can easily peruse themselves, observations need to be available online.7

Hand-entering data is tedious and error-prone work, so I automated as much of the process of data collection as practicable. Most international organizations and a few national statistical offices use APIs that facilitate downloading their data, and often the #rstats community has built R packages using these APIs to make the task even easier. I took as much advantage of these resources as possible.8 In the absence of an API, I scripted downloads of any available spreadsheets, preferring clean csv files to excel-formatted ones. If there was no spreadsheet, but data were available in pdf files, I automated downloading these files and then used Thomas Leeper’s tabulizer package to read the tables into R. In the absence of a file to download, I scripted the process of scraping the data from the web. Still, for a variety of reasons, a source’s data may have been consigned to being entered in a separate spreadsheet. Many sources contain just a handful or fewer observations, making the payoff to the often laborious process of data cleaning too small to justify the effort. Some sources–including most academic articles–are behind paywalls, making reproducibility a hassle anyway (though I still often used tabulizer to read the data from the pdf before cutting-and-pasting it into the spreadsheet). Finally, at least one source contains crucial information encoded in the typeface(!!) of its tables, information lost when the tables are scanned into R. All of the entries in this spreadsheet were checked repeatedly for errors,9 and I excluded repeated reports of the exact same observation from different sources. In the end, I was able to automate the collection of more than three quarters of the source data and a much higher percentage of the series that will be updated or are subject to revision, facilitating incorporating these changes in future versions.

The resulting dataset comprises 15730(!) Gini coefficients from 2984 country-years in 196 countries or territories, making the coverage of the SWIID source data broader than that of any other income inequality dataset. This isn’t surprising given that, with the exceptions of the WIID (which, since it provides no original data, isn’t drawn on at all anymore) and the All the Ginis database (which provides little original data, and so isn’t drawn on much), the SWIID source data incorporates all of the data in these other datasets.

So, let’s check out what the source data look like. There is much more data available about the income distribution in some countries than in others. Which countries are most data-rich? The plot below shows the top dozen countries by the count of observations. Canada, by virtue of the excellent Statistics Canada as well as longstanding membership in the OECD and LIS, has 775 observations, many more than any other country. The United Kingdom, Germany, and the United States are next, followed by an interesting mix of countries from around the world with not surprisingly a sizable European representation. All are members of the LIS. On the other hand, eleven countries have only a single observation.

As we’ll see in later posts in this series, observations for the same country in the same year, but with different welfare definitions and equivalence scales or from different sources, are important to generating the SWIID’s cross-nationally comparable estimates. Still, we might be interested to know which countries have the most coverage of the years in the SWIID’s current 58-year timeframe, from 1960 to 2018, because the SWIID’s inequality estimates for countries with fewer country-year observations will include more interpolated values, which in turn will have more uncertainty.

The source data includes observations for Sweden and the United Kingdom in all but one of these years and for the United States in all but five. Iran and Argentina—two countries not included in the LIS—make the top 12, with 43 and 42 country-year observations respectively. The median country has observations in 16 different country-years.

We can also get a sense of the available inequality data by turning the question around and asking about coverage across countries over time. There are observations for 123 countries in 2005. Coverage is relatively good in the years from 2000 to 2016, at least 80 countries per year, before dropping to 43 countries for 2017 and just 3 for 2018. Country coverage is pretty thin each year through the 1960s and 1970s and still isn’t all that great until the late 1980s.10

Earlier I mentioned that to be included in the SWIID source data observations need to have an identifiable welfare definition and equivalence scale. A welfare definition is an answer to the question, this Gini measures the distribution of what? The four welfare definitions employed in the SWIID source data are market income, gross income, disposable income, and consumption. Market income is defined as the amount of money coming into the household, excluding any government cash or near-cash benefits, the so-called ‘pre-tax, pre-transfer income.’11 Gross income is the sum of market income and government transfer payments; it is ‘pre-tax, post-transfer income.’ Disposable income, in turn, is gross income minus direct taxes: ‘post-tax, post-transfer income.’12 Consumption does not refer to the money coming into the household at all but rather to the money going out.13 In the source data, Ginis of disposable income are much more common than those using other welfare definitions.

Equivalence scales are the ways in which the size and composition of a household is incorporated into the calculation of its members’ welfare. On the one hand, these factors can simply be ignored, with all households with the same amount of income or consumption treated as if they enjoy the same level of welfare, regardless of their size. One can improve on this household ‘scale’14 by dividing the household’s income by its number of members, that is, by using a per capita scale. But a household of two members and an income of $100,000 is better off than one with a single member and$50,000 due to economies of scale—that’s a big reason why people look for roommates. There are a variety of ways to try to account for these economies by calculating the number of “equivalent adults” in the household. Of the most commonly used adult-equivalent scales, the square-root scale is the most straightforward: one simply divides the household income by the square root of the number of members. The “OECD-modified” scale for the number of adult equivalents (which the OECD itself actually never used) counts the first adult as 1, all other adults as .5, and each child as .3. And there are plenty of other adult-equivalent scales, from the “old OECD” scale (1 for the first adult, 0.7 for each additional adult, and 0.5 for each child) to caloric-requirement-based scales (which are actually very nearly per capita, as it turns out) to a number of country-specific scales. In previous versions of the SWIID, all adult-equivalent scales were considered a single category. Now, the square-root scale and the OECD-modified scale have both been split out, leaving the remaining catch-all adult-equivalent category much smaller.

Differences in the welfare definition and the equivalence scale employed constitute the biggest source of incomparability across observations in the source data, and all twenty of the possible combinations are represented. I’ll take up how we get from these incomparable observations to the SWIID estimates in the next post. In the meantime, if you’d like to see the source data, you can download it from here.

# References

Jesuit, David K., and Vincent A. Mahler. 2010. “Comparing Government Redistribution Across Countries: The Problem of Second-Order Effects.” Social Science Quarterly 91(5): 1390–1404.

Morgan, Jana, and Nathan J. Kelly. 2013. “Market Inequality and Redistribution in Latin America and the Caribbean.” Journal of Politics 75(3): 672–85.

Solt, Frederick. 2016. “The Standardized World Income Inequality Database.” Social Science Quarterly 97(5): 1267–81.

1. I think the clearest explanation of the Gini index is that it is half the average difference in income between all pairs of units—say, households—as a percentage of the mean income of those units. Okay, I said “clearest,” not necessarily “clear.” Anyway, it has a theoretical range of 0 (all households have the same income) to 100 (one household has all the income and the rest have none), but Ginis below 20 or above 60 are rare in the real world. There are good reasons to prefer other measures of inequality, and there are many options, but the Gini is by far the most widely available.

2. At the time, those hopes were admittedly concerned mostly with getting #Reviewer2 off my back so I could publish a series of manuscripts I had on how the context of inequality is related to people’s political attitudes.

3. Which is what #Reviewer2 always complained about. R2: Shouldn’t you include Ruritania and Megalomania in your sample, given the broad applicability of your theory? Me: Yes, sure, but like I wrote in the paper, there’s no LIS data for those countries, and the other available data just isn’t comparable. R2: Well then, I recommend rejection. Me: Grr.

4. If you’re thinking, “hey, multiple imputation for missing data,” cool, that’s what I was thinking too. If you’re thinking of poll aggregators and house effects, yep, it’s very similar. If you’re thinking of inequality as a latent variable, with a number of indicators of varying discrimination, that also works. If you’re thinking you need to look at some cat gifs right about meow, click here.

5. The requirement for complete territorial coverage was relaxed for minor deviations such as data on Portugal that excludes Madeira and the Azores. It was relaxed somewhat further for early series that covered only the urban population of three highly urbanized countries: Uruguay, Argentina, and South Korea. The general rule, however, is that data is excluded if it measures the income distribution of only urban or rural populations, or of only selected cities, or some other such incomplete territory.

6. This last requirement is new; it means nearly 600 observations on the distribution of wages across employed individuals that were included in the source data of previous versions of the SWIID are now excluded. Between the lack of information on those out of the workforce and on how workers formed households, these data weren’t very strongly related to the LIS anyway.

7. For scholarly articles, I preferred DOIs or JSTOR stable URLs, but if those were unavailable I used the publisher website or another repository. For books, I provide the link to the relevant page in Google Books. There were two books that I decided I had to include for which Google Books wouldn’t show the relevant pages (at least not to me); in those two cases, the links I provide just go to the entire volume. I confirmed that the cited pages can be found using Amazon’s “Look Inside” feature, so I consider my “must be available online” rule only badly bent rather than completely broken.

8. Although the sources with APIs were relatively few, they contained the most data: nearly half of the observations were collected this way.

9. Which, of course, is not to say that they are error-free. If you spot any problems, or better still, know of sources I might have missed, please let me know!

10. This is partly a result of my decision to insist on sources that are available online, but it’s just as well: so little information is available about many of the so-excluded observations on that era that I find it hard to have much confidence in them.

11. It’s important, though, to not think of the distribution of market income as ‘pre-government.’ Beyond taxes and transfers, governments seeking to shape the distribution of income have a wide array of ‘market-conditioning’ or ‘predistribution’ policy options, with minimum wage regulation and labor policy two obvious examples (see, e.g., Morgan and Kelly 2013). Further, even taxes and transfers can profoundly shape the distribution of market income through ‘second-order effects.’ Where robust public pension programs exist, for example, people save less for retirement, leaving many of the elderly without market income in old age and so raising the level of market-income inequality (see, e.g., Jesuit and Mahler 2010).

12. Note that disposable income still does not take into account, on the one hand, indirect taxes like sales taxes and VAT, or, on the other, public services and indirect government transfers such as price subsidies. There is very little information available about the distribution of such ‘final income,’ pretty much only that generated by the Commitment to Equity Institute, so I exclude it from the SWIID source data at least for the time being.

13. In previous versions of the SWIID, market and gross income were treated as a single welfare definition, and I am glad to finally be able to split them apart (c.f., Solt 2016, 1272). The consumption welfare definition might now be the most heterogeneous within the SWIID source data, varying considerably in whether and how observations treat expenditures on durable goods. Another source of differences within a single welfare definition is the extent to which nonmonetary income—such as the value of food grown for the household’s own consumption or of housing that the owner occupies—is included. The SWIID source data include the variable monetary that indicates whether any nonmonetary income is taken into account, but at present this information is not incorporated into the classification of welfare definitions.

14. Scare quotes because, strictly speaking, nothing is being scaled at all; it’s simply treating the household as the unit of analysis.

### SWIID Version 6.0 is available!

##### Thursday, 27 July 2017

Version 6.0 of the SWIID is now available! It represents a complete, starting from scratch, end-to-end revision, with all the heavy lifting now done using #rstats and Stan.

As always, I encourage users of the SWIID to email me with their comments, questions, and suggestions.

### Notes for Those New to Writing Peer Reviews

##### Friday, 14 April 2017

Today we had a workshop for our graduate students on writing peer reviews. Here are the notes I spoke from:

I get asked to do a lot of reviews. At the beginning of this semester, I got seven requests within two or three weeks. I used to always say yes, but doing 35 or 40 reviews a year just took too much time. When I was first starting out, I’d take something like six or eight hours on each review, though that pretty quickly got down to four or so. Nowadays it might even be a touch less, spread over two days. I like to give the paper a close read on one day, while taking notes and maybe doing a bit of research. Then the next day, I write up my review, after my thoughts have had a chance to percolate. Anyway, now I have a two-per-month rule to protect my time, though I sometimes break it: I took four out of those seven requests back in January.

I always start my reviews with a quick summary of the piece, but as reviewers, our focus should be on theory, data, and method. For the big-three journals, the old saw is that the standard is “new theory, new data, new method—choose any two,” but regardless of the journal that has asked you to review, for a work to make a contribution, it has to be sound—not new, not super-great, just sound—on all three. Here are a couple of quick notes on each, mostly of the points I find myself most often making:

1. Theory: if you think that the authors1 have overlooked some plausible rival theory, be sure to explain and include specific citations. You don’t have to have a full bibliographic entry; author and year are probably enough, though I usually throw the journal abbreviation in too just to be sure. Reviews aren’t the place to develop your own new rival theory. If you’re really tempted to do so, plan instead on responding to this paper when it comes out in print.

2. Data: do the authors take advantage of all available data? Probably not—we can’t all look at everything all the time—but if they’ve neglected obvious things: using, for example, just that oddball third wave of the WVS instead of all the waves, or if they have very little data and you know of other sources they can draw on, say so. Of course, if they use some source and you know that there’s other, better data available, point that out to them.

3. Methods: First, are the methods appropriate? In answering this, you have to judge the methods on their own terms: NOT, oh, this study uses survey data, so tells us nothing about causality! OR this study just reports an experiment, so it has no external validity!

4. Note what you’re NOT evaluating: the results themselves. Don’t filter on statistical significance: we need to avoid contributing to publication bias and the pressure way too many people apparently feel to p-hack their way to publication. And this should go without saying, but be sure to check your own presuppositions about what the results ‘should’ show at the door.

• Nor the question asked. Don’t suggest that authors “reframe” their work around some similar (or not so similar) question. Don’t say that the question just isn’t important enough for the AJPS.3 If you’ve been in my classes, you’ve probably had me push you to ask important questions; you know I totally think that’s a big deal. But as a reviewer, as Justin Esarey argued in the TPM Special Issue on Peer Review, deciding whether the question asked was sufficiently important for publication isn’t your job. That’s for the editor maybe, but really it is for us all as a discipline, as readers.

• Nor typos, grammar, or citation formatting. If it’s really, really bad, I’ll point out that it’s something the author should be sure to work on. But don’t send in a bunch of line edits. I will always note if I see that cited works are not included in the bibliography. BibTeX is your friend, people!

1 I’ve settled on always writing reviews with the assumption that the piece is co-authored and that the appropriate pronoun is therefore “they.”

2 This is point number one on Brendan Nyhan’s “Checklist Manifesto for Peer Review” in The Political Methodologist’s Special Issue on Peer Review. Read the whole issue!

3 OTOH, you should give people credit when they take on hard questions with less-than-ideal data and methods if those data and methods are (approximately) the best available.

4 Not that jerk, #Reviewer2. In addition to checking out #BeReviewer1 on Twitter, you should also be sure to read Thomas Leeper’s manifesto that started it all.

### SWIID Version 5.1 is available!

##### Thursday, 21 July 2016

Version 5.1 of the SWIID is now available! It revises and updates the SWIID’s source data and estimates. It also includes expanded training modules explaining how to take into account the uncertainty in the estimates in both R and Stata.

As always, I encourage users of the SWIID to email me with their comments, questions, and suggestions.

##### Friday, 20 May 2016

If you download data from the Inter-university Consortium for Political and Social Research archive archive, you can make your research reproducible with the icpsrdata package, now available on CRAN.

### Try pewdata!

##### Friday, 13 May 2016

If you use Pew Research Center surveys, you can make your research reproducible with the pewdata package now available on CRAN.

### SWIID Wins NSF Support!

##### Monday, 10 August 2015

The National Science Foundation has awarded three years of support to update and improve the SWIID! Yay!

### Use dotwhisker for your APSA slides!

##### Thursday, 30 July 2015

With the APSA coming up, and in the interest of minimizing the number of times we hear “sorry, I know you won’t really be able to see these regression coefficients,” I thought I’d point R users to dotwhisker, a package UI Ph.D. student Yue HU and I just published to CRAN. dotwhisker makes regression plots in the style of Kastellec and Leoni’s (2007) Perspectives article quick and easy: after data entry, just two lines of R code produced the easy-to-read-even-from-the-back-of-the-room plot attached to this post. I hope you’ll find it useful, and if you have any suggestions for us, that you’ll file an issue at https://github.com/fsolt/dotwhisker, tweet to me @fredericksolt, or just send me an email [email protected].

### Now on CRAN: interplot

##### Friday, 26 June 2015

Hu Yue and I just published interplot on CRAN, our first R package. interplot makes graphing the coefficients of variables in interaction terms easy. It outputs ggplot objects, so further customization is simple. Check out the vignette and give it a try!

### Inequality in China

##### Friday, 27 March 2015

A new working paper by IMF researchers Serhan Cevik and Carolina Correa-Caro observes that sharply rising inequality has made China one of the most unequal countries in the world. Here’s a graph of SWIIDv5.0 data that illustrates their point.

### SWIID Version 5.0 is available!

##### Thursday, 2 October 2014

Version 5.0 of the SWIID is now available, and it is a major update. A new article of record (currently available as a working paper while under peer review) reviews the problem of comparability in cross-national income inequality data, explains how the SWIID addresses the issue, assesses the SWIID’s performance in comparison to the available alternatives, and explains how to use the SWIID data in cross-national analyses.

The new version also marks the debut of the SWIID web application. The web application allows users to graph the SWIID estimates of any of net-income income, market-income inequality, relative redistribution, or absolute redistribution in as many as four countries or to compare these measures within a single country. Its output can be downloaded with a click for use in reports or articles. I hope that it will be of particular value to policymakers, journalists, students, and others who need to make straightforward comparisons of levels and trends in income inequality.

As always, I encourage users of the SWIID to email me with their comments, questions, and suggestions.

### SWIID Version 4.0 is available!

##### Monday, 30 September 2013

Version 4.0 of the SWIID is now available here. Drawing on nearly 14,000 Gini observations in more than 3100 country-years, this version provides even better estimates of income inequality in countries around the world than in previous versions.

This version introduces two other improvements. First, many users have had trouble making appropriate use of the standard errors associated with the SWIID estimates. The uncertainty, however, can sometimes be substantial, making it crucial to incorporate in one’s analyses. Fortunately, there are now tools in Stata and R that make it quite straightforward to analyze data that is measured with error, and this version of the SWIID includes files that are pre-formatted for use with these tools. The file “Using the SWIID.pdf”, which is also included in the data download, explains how. Some additional examples of using the SWIID with Stata’s mi estimate command prefix can be found towards the end of the slides posted here.

Second, I’ve received several requests for measures of top income share, so in this version I am including estimates of the top 1 percent’s share (the variable share1), standardized to the data provided in the World Top Incomes Database: Country-years included in that dataset are reproduced without modification in the SWIID, and comparable figures for other country-years are estimated using the SWIID’s custom multiple-imputation algorithm. Like all inequality datasets, Top Incomes has tradeoffs—among other things, the share of pre-tax, pre-transfer income reported on tax returns by the richest filers may not be of much theoretical interest to many investigators—but the additional estimates the SWIID provides may prove to be useful to some.

I encourage users of the SWIID to email me with their comments, questions, and suggestions.

### My talk at the UN

##### Sunday, 29 September 2013

Earlier this month, I gave a talk previewing Version 4.0 of the SWIID to the Development Policy and Analysis Division of the United Nations’ Department of Economic and Social Affairs. I had some great conversations and got lots of useful feedback. Slides for the talk can be found here.

### SWIID Version 3.1 now available!

##### Monday, 2 January 2012

Version 3.1 of the SWIID is now available here. The primary difference introduced in Version 3.1 is that the data on which the SWIID is based have again been expanded. Now nearly 4500 Gini observations are added to those collected in the UNU-WIDER data, and for many countries the available data extend to 2010. Also, I made one semantic change: to try to avoid confusion among those who neglect to read about the data they use, the series on pre-tax, pre-transfer inequality is now labeled gini_market rather than gini_gross. Otherwise, very small revisions were made to the SWIID routine from Version 3.0. As always, I encourage users of the SWIID to email me with their comments, questions, and suggestions.

### SWIID Version 3.0 is now available!

##### Sunday, 11 July 2010

Version 3.0 of the SWIID is now available, with expanded coverage and improved estimates.

The data on which the SWIID is based have been expanded. I have collected another 2100 Gini observations (in addition to the 1500 added in v2.0), again with special attention to addressing the thinner spots in the WIID. As before, these data are available in the replication materials for those who are interested. Major sources for these data include the World Bank’s Povcalnet, the Socio-Economic Database for Latin America, Branko Milanovic’s World Income Distribution data (“All the Ginis”), and the ILO’s Household Income and Expenditure Statistics, but a multitude of national statistical offices and other sources were also consulted.

The SWIID also now incorporates the University of Texas Inequality Project’s UTIP-UNIDO dataset on differences in pay across industrial sectors. Across countries and years, these data explain only about half of the variation in net income inequality (and much less of gross income inequality) and so yield predictions with prohibitively large standard errors when employed in this way, but where there was sufficient data available, I used the UTIP data to make within-country loess predictions of both net and gross income inequality that informed the SWIID estimates.

The imputation routine used for generating the SWIID was cleaned up: the code now runs more efficiently, and a few errors were corrected.

Many researchers have asked me about using the SWIID to examine questions of redistribution, so I now include in the dataset the percentage reduction in gross income inequality (that is, the difference between the gross and net income inequality, divided by gross income inequality, multiplied by 100) as an estimate of redistribution (“redist”) as well as its associated standard error (“redist_se”). The standard errors for redistribution are particularly important to take into account, as they can often be quite large relative to the size of the estimates. Observations for redistribution are omitted for countries for which the source data do not include multiple observations of either net or gross income inequality: in such cases, although the two inequality series each still constitute the most comparable available estimates, the difference between them reflects only information from other countries, and treating it as meaningful independent information about redistribution would be unwise. Similarly, because the underlying data is often thin in the early years included in the SWIID, redistribution is only reported after 1975 for most of the advanced countries and only after 1985 for most countries in the developing world.

As always, I encourage users of the SWIID to email me with their comments, questions, and suggestions.

### Using the SWIID Standard Errors

##### Sunday, 20 June 2010

Incorporating the standard errors in the SWIID estimates into one’s analyses is the right thing to do, but it is not a trivial exercise. I myself have left it out of some work where I felt the model was already maxed out on complexity (though in such cases, I advise at least excluding observations with particularly large errors). The short story is that one generates a bunch of Monte Carlo simulations of the SWIID data from the estimates and standard errors, then analyses each simulation, then combines the results of the multiple analyses as one would in a multiple-imputation setup (this should be easier to do with Stata 11’s new multiple-imputation tools, but I won’t get my copy of Stata 11 until the fall–oh well). The code below does the trick.

**Using the SWIID Standard Errors: An Example**
//Load SWIID and generate fake data for example
use "SWIIDv2_0.dta", clear
set seed 4533160
gen x1 = 20*rnormal()
gen x2 = rnormal()
gen x3 = 3*rnormal()
gen y = .03*x1 + 3*x2 + .5*x3 + .05*gini_net + 5 + 20*rnormal()
reg y x1 x2 x3 gini_net

//Generate ten Monte Carlo simulations of the gini_net series
egen ccode=group(country)
tsset ccode year
set seed 3166
forvalues a = 1/10 {
gen e0 = rnormal()
quietly tssmooth ma e00 = e0, weight (1 1 <2> 1 1)
quietly sum e00
quietly gen ga'=gini_net+e00*(1/r(sd))*gini_net_se
drop e0 e00
}

//Perform analysis using each of the ten simulations, saving the results
local other_ivs = "x1 x2 x3"        /*to be replaced with your other IVs, that is, not including gini_net or the constant*/
local n_ivs = 5             /*to be replaced with the number of IVs, now *including* gini_net and the constant*/
matrix coef = J(n_ivs', 10, -99)
matrix se = J(n_ivs', 10, -99)
matrix r_sq = J(1, 10, -99)
forvalues a = 1/10 {
quietly reg y other_ivs' ga'  /*to be replaced with your analysis*/
matrix coef[1,a'] = e(b)'
matrix A = e(V)
forvalues b = 1/n_ivs' {
matrix se[b', a'] = (A[b',b'])
}
matrix r_sq[1, a'] = e(r2)
}

local cases = e(N)

svmat coef, names(coef)
svmat se, names(se)
svmat r_sq, names(r_sq)

//Display results across all simulations
egen coef_all = rowmean(coef1-coef10)

gen ss_all = 0
forvalues a = 1/10 {
quietly replace ss_all = ss_all + (coefa'-coef_all)^2
}
egen se_all = rowmean(se1-se10)
replace se_all = se_all + (((1+(1/10)) * ((1/9) * ss_all))) /*Total variance, per Rubin (1987)*/
replace se_all = (se_all)^.5 /*Total standard error*/

gen t_all = coef_all/se_all
gen p_all = 2*normal(-abs(t_all))

egen r_sq_all = rowmean(r_sq1-r_sq10)

gen vars = " " in 1/n_ivs'
local i = 0
foreach iv in other_ivs' "Inequality" "Constant" {
local i = i'+1
replace vars = "iv'" in i'
}
mkmat coef_all se_all p_all if coef_all~=., matrix(res_all) rownames(vars)
matrix list res_all, format(%9.3f)
quietly sum r_sq_all
local r2 = round(r(mean)', .001)
di "R-sq = r2'"
di "N = cases'"
`

Please feel free to drop me an email if you have any questions or comments.

### SWIID Version 2.0

##### Friday, 31 July 2009

Version 2.0 of the SWIID is now available, and it is a major upgrade. It introduces two important changes from Version 1.1 (the version described in the SSQ article). First, I collected a large number (1500+) of Gini observations that are excluded from the WIID with an eye towards addressing some of the thinner spots in the SWIID’s underlying data. Second, I rewrote several parts of the missing-data algorithm. The key change is a switch from multilevel to (flat) linear regression modeling for the imputation of conversion ratios between the 21 categories of available Gini data. Given the patterns of missingness in the data, complete pooling (as occurs in a flat linear regression) proved superior to partial pooling (as occurs in multilevel modeling). The result, along with some minor improvements in coverage, is considerably smaller standard errors in the Gini index estimates, particularly in Latin America and Africa, than in Version 1.1. All SWIID users are encouraged to use these new data in their work.

### SWIID Version 1.1

##### Sunday, 12 October 2008

So much for version control. With apologies to v1.0 users, Version 1.1 is the SWIID as reported in “Standardizing the World Income Inequality Database.”

### SWIID Version 1.0

##### Saturday, 13 September 2008

“Standardizing the World Income Inequality Database” has been accepted for publication in the Social Science Quarterly. Version 0.9 of the SWIID is now released as Version 1.0 without modification.

### SWIID Version 0.9

##### Tuesday, 5 August 2008

The SWIID is currently undergoing peer review for publication.