In this tutorial we will be analyzing a dataset that plausibly is experimental. Throughout we will tie some skills together that we have learned so far this quarter, such as graphing coefficient estimates.
We’ll use a dataset compiled by Ebonya Washington to study whether a politician’s number of daughters affects how he or she votes on women’s issues (washington.csv, available on Coursework). Each row in the dataset has information on a Member of Congress (435 in total). The key variables are:
Before we dive into analyzing the data, lets familiarize ourselves with the difference between an experiment and a quasi-experiment (also called natural experiment). After discussing as a class, answer the following questions in your own words:
What is the key element of an experiment and why (at least as the sample size increases) does it lend itself to causal inference?
What is the key difference between an experiment and a natural experiment?
Imagine we wanted to compare couples in the U.S. who had one child. Is having a daughter (as opposed to a son) as-if random? What if we compared couples with 3 children—is the number of daughters as-if random?
Imagine we compared couples with three children to couples with one child. Is the number of daughters these types of couples have as-if random?
Note: With as-if random, we mean that nature assigns the gender of each child with some probability (in this case, roughly equal to 0.5) and that this assignment was exogenous to parental decisions or characteristics.
The main relationship we’re interested in is that between number of daughters and representatives’ voting score on women’s issues. We will try to come up with an appropriate model to estimate this relationship. To decide what covariates (if any) to include, we begin by looking at the relationship between the total number of children and voting score. Examine this in two ways:
Create a scatter plot between total number of children and voting score. You will see that jittering the points is helpful (see geom_jitter
). Instead of overlaying a line of best fit or loess smoother, overlay the mean voting score for each number of children, and connect these points with a line. What do you see?
While a linearity assumption may or may not be valid here (based on 1), regress the voting score on total number of children. Interpret the coefficient and standard error.
Now look at the relationship between total number of children and number of daughters. There is (or should be) an obvious relationship here. Show it with an appropriate summary graph.
\(y_i = \beta_0 + \beta_1 ngirls_i + \epsilon_i\)
where \(y_i\) is the voting score on women’s issues. Interpret the coefficient.
Write down an OLS model (involving ngirls
and one more covariate) that potentially provides an unbiased causal estimate of number of daughters on voting score. In light of the previous questions, explain why this model may provide a causal estimate. Then execute this model in R, and carefully interpret the coefficient and standard error on ngirls.
Look up Ebonya Washington’s article “Female Socialization: How Daughters Affect Their Legislator Fathers’ Voting on Women’s Issues” to confirm that she used a similar model as you in Question 4. Then take a look at the section “Identifying Assumptions” (p. 317) of this article. Washington identifies two threats to causal inference: (a) stopping rules and (b) constituency selections correlated with child gender. In your own words, describe the logic behind the first threat to inference (stopping rules). That is, in what way may this be biasing the relationship of interest? Do you find Washington’s argument that such a bias is unlikely convincing? Why or why not?
Relax the linearity assumption on number of girls by estimating the following model:
\(y_i = \beta_0 + \alpha_{j[i]} + \gamma Z_i + \epsilon_i\)
Here, \(\alpha_{j[i]}\) is what’s called a “fixed effect” for number of daughters: it is a set of dummy variables, each of which equals 1 if a representative \(i\) has \(j\) number of daughters and 0 otherwise, for \(j>0\) (so \(j = 0\) is left as a base category). \(Z_i\) is the same covariate you should have included in Question 5.
Implement this model in R in two ways:
By treating ngirls as a factor variable instead of a numeric variable in R. R will automatically code the dummy variables for you.
By manually coding the dummy variables \(\alpha_{j[i]}\) for \(j>0\).
In each case, you should estimate a total of 8 parameters. Carefully interpret each parameter estimate in the model, and discuss whether the model supports the hypothesis that number of daughters is positively related to liberal voting on women’s issues.
Get the number of observations for each value of ngirls (e.g., using table()
). In light of this, some of the estimated parameters in Question 7 should seem less credible. Why? Account for this by reestimating the model in Question 7 for categories \(j\) with at least 4 observations, and discuss what you find.
Graph the estimated \(\alpha_{j[i]}\) from Question 8 (that is, for categories \(j\) with at least 4 observations). Include 95% confidence intervals. Organize the x-axis in a way that makes sense. Clearly label the axes, and describe what the graph means.
Continuing with the model from Question 8, graph the predicted voting scores for a representative with 4 children, varying his or her number of daughters (on the x-axis) between 0 and 4. Again include 95% confidence intervals, clearly label the axes, and describe what you see.