Many of you will have to produce professional reports and presentations, for which producing nice tables and graphs is important. The best way to automate this process is to use LaTeX and Beamer rather than Microsoft Word and Powerpoint. Alternatively, check out the RStudio team’s slick R Markdown package, which makes producing beautiful reports really simple.
However, here are some options if you still want to use Word.
You can produce tables quickly without having to copy and paste every number from R to Word.
To do so:
The text should now end up in a table that you can format in Word.
# Read data states <- read.csv("states.csv") # (1) Create a table of Bush support by U.S. region in 2000 (South versus Non-South): t <- with(states, table(south, gb_win00)) t <- prop.table(t, margin = 1) t #large Majority of Southerners voted for Bush:
## gb_win00 ## south Bush win Gore win ## Nonsouth 0.4706 0.5294 ## South 0.8750 0.1250
# (2) Write this table to a comma separated .txt file: write.table(t, file = "bush_south.txt", sep = ",", quote = FALSE)
The .txt file will end up in your working directory. Now follow steps 3 and 4 to create a table in Word.
Here’s another example that again uses the states.csv dataset. Say we wanted to create a table with summary statistics for five of the variables in this dataset:
# Keep 5 variables in states dataset states_sub <- select(states, blkpct, attend_pct, bush00, obama08, womleg) # Find summary statistics for each variable library(plyr) #to access colwise function means <- colwise(mean)(states_sub) stdev <- colwise(sd)(states_sub) mins <- colwise(min)(states_sub) maxs <- colwise(max)(states_sub) # Create df with summary statistics, putting variables in rows using transpose function t() df <- data.frame(t(means), t(stdev), t(mins), t(maxs)) # Clean column and row names names(df) <- c("Mean", "SD", "Min", "Max") row.names(df) <- c("Black (%)", "Attend Church (%)", "Bush -00 (%)", "Obama -08 (%)", "Women in Legislature (%)") # Restrict number of decimal points to 1 df <- round(df, 1) df
## Mean SD Min Max ## Black (%) 10.3 9.7 0.4 36.8 ## Attend Church (%) 38.9 9.4 22.0 60.0 ## Bush -00 (%) 50.4 8.7 31.9 67.8 ## Obama -08 (%) 50.5 9.5 32.5 71.8 ## Women in Legislature (%) 23.2 7.3 8.8 37.8
# Write data frame to .txt file write.table(df, file = "sumstats.txt", sep = ",", quote = FALSE)
hispanic. The table should include the number of observations (n), mean, median, 10th percentile, and 90th percentile of each of the variables. Put the variables in the rows of the table and the summary statistics in the columns, like we did in the example above. Format your table in Word to make it look similar to this table.
In Tutorial 2, we covered graphing with the
ggplot package. Let’s talk about how to ensure that the graphs you produce look good when you include them in your write-ups.
Saving images as .pdf is usually your best option. This format ensures that images don’t pixelate. (And you can insert .pdfs into word like you do with other image file formats.)
To save a .pdf, use the
pdf() function before the image you want to save, and include
dev.off() after the image.
Here’s an example, again using the states.csv dataset:
states <- read.csv("states.csv") library(ggplot2) p <- ggplot(states, aes(x = attend_pct, y = bush00)) + geom_point() + geom_text(aes(label = stateid, y = bush00 - 0.7), size = 3) + geom_smooth(method = "loess", se = F) + xlab("% in State Attending Religious Services") + ylab("% in State Voting for Bush in 2000") # Save the image as a pdf: pdf(file = "bush_religion.pdf", height = 6, width = 8) p dev.off()
## pdf ## 2
Arranging graphs into a matrix of rows and columns, like we did on problem set 2, can be very useful for presentational purposes. There are two ways to do this using
Here’s an example of the first approach:
p1 <- ggplot(states, aes(x = bush00, y = bush04)) + geom_point() + geom_text(aes(label = stateid, y = bush04 - 0.7), size = 3) + geom_smooth(method = "loess", se = F) + xlab("% in State Voting for Bush in 2000") + ylab("% in State Voting for Bush in 2004") p2 <- ggplot(states, aes(x = bush04, y = obama08)) + geom_point() + geom_text(aes(label = stateid, y = obama08 - 0.7), size = 3) + geom_smooth(method = "loess", se = F) + xlab("% in State Voting for Bush in 2004") + ylab("% in State Voting for Obama in 2008") p3 <- ggplot(states, aes(x = vep04_turnout, y = bush04)) + geom_point() + geom_text(aes(label = stateid, y = bush04 - 0.7), size = 3) + geom_smooth(method = "loess", se = F) + xlab("Turnout among Voting Eligible Population (2004)") + ylab("% in State Voting for Bush in 2004") p4 <- ggplot(states, aes(x = vep08_turnout, y = obama08)) + geom_point() + geom_text(aes(label = stateid, y = obama08 - 0.7), size = 3) + geom_smooth(method = "loess", se = F) + xlab("Turnout among Voting Eligible Population (2008)") + ylab("% in State Voting for Obama in 2008") library(gridExtra) grid.arrange(p1, p2, p3, p4, #specify the graphs to include ncol = 2) #specify the number of columns we want
Of course, you could save this graph using the
pdf() function from above.
gridExtra, create four scatterplots of your choice (not the same as in the examples above) and arrange them into 2 rows and 2 columns.
pdf()and ’dev.off()`, specifying an appropriate width and height, and insert this image into Word.
For Problem Set 3, you will need to carry out one- and two-sample hypothesis tests. Refer to the lecture notes for the theory behind these tests. What follows is a brief discussion of how to implement these tests in R. Let’s keep working with the states.csv dataset.
Given a cross-tab, a chi-squared test essentially tests whether there is a “relation between the rows and columns”, or whether there is statistical independence given the marginal distributions of the rows and columns.
states <- read.csv("states.csv") with(states, table(gb_win00, states$gay_policy))
## ## gb_win00 Conservative Liberal Most conservative Most liberal ## Bush win 7 2 20 1 ## Gore win 3 12 0 5
# Rearrange the order of the gay policy scale states$gay_policy <- factor(states$gay_policy, levels = c("Most liberal", "Liberal", "Conservative", "Most conservative")) with(states, table(gb_win00, states$gay_policy))
## ## gb_win00 Most liberal Liberal Conservative Most conservative ## Bush win 1 2 7 20 ## Gore win 5 12 3 0
Class Exercise: What would this distribution look like if the cell values approximately were proportional to the marginal distributions?
Let’s do a chi-squared test on the actual distribution:
t <- with(states, table(gb_win00, states$gay_policy)) chisq.test(t)
## ## Pearson's Chi-squared test ## ## data: t ## X-squared = 30.63, df = 3, p-value = 1.015e-06
How do we interpret this output?
In one-sample t-tests, we test whether an estimated mean can be statistically distinguished from a posited “true” population mean \(\mu_0\). Let’s test whether per capita income—with an estimated mean of 31951 across states—actually is 30000. So, the null hypothesis defines \(\mu_0 =\) 30000. How weird would it be to see a value of 31951 given that \(\mu_0\) actually is 30000?
##  31951
t.test(states$prcapinc, mu = 30000)
## ## One Sample t-test ## ## data: states$prcapinc ## t = 3.101, df = 49, p-value = 0.003193 ## alternative hypothesis: true mean is not equal to 30000 ## 95 percent confidence interval: ## 30687 33215 ## sample estimates: ## mean of x ## 31951
How do we interpret this output?
In two-sample t-tests, we want to test whether two samples (or groups) assumed to come from the same distribution have different means. For example, say we wanted to test whether the percentage women in state legislatures differ across Southern and non-Southern states. Before we carry out this test, what is the null hypothesis? What is the alternative hypothesis?
The following carries out a Welch test, which doesn’t assume that the two groups have the same variances and uses Satterthaite-Welch adjustment of the degrees of freedom (usually resulting in non-integer degrees of freedom):
with(states, t.test(womleg ~ south))
## ## Welch Two Sample t-test ## ## data: womleg by south ## t = 3.301, df = 28.7, p-value = 0.002583 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 2.563 10.924 ## sample estimates: ## mean in group Nonsouth mean in group South ## 25.41 18.66
How do we interpret this output?
The code used in this tutorial is available here.