tidystats is a package to create a file containing the output of
statistical models. The goal of this package is to help researchers accompany
their manuscript with an organized data file of statistical results in order to
greatly improve the reliability of meta-research and to reduce statistical
Scientific hypotheses are tested using statistical tests. It is the output of the statistical tests that generally determine whether an hypothesis is supported or not. In other words, the output of statistical tests is important. Yet, statistical output is mostly treated like a second-class citizen.
This leads to two problems:
- Insufficient statistics reporting
- Incorrect statistics reporting
I believe that
tidystats can address both of these issues.
tidystats enables researchers to create a file that contains all statistics
output. This file is a JSON file. JSON files are organized files of data that
are easy to read for computers (but less easy to read for humans). The upside of
storing statistics outside of a manuscript is that the researcher does not need
to worry about which analyses to report in the space-limited manuscript. An
additional benefit is that because JSON files are easy to read for computers,
it is (relatively) easy to write software that does cool things with these
tidystats can be installed from CRAN and the latest version can be installed
from Github using devtools.
Please note that the package is still under heavy development, which means that the package is undergoing many changes that might break older code.
tidystats, load the package and start by creating an empty list to
store the results of statistical models in.
library(tidystats) results <- list()
The main function is
add_stats(). The function has 2 necessary arguments:
results: The list you want to add the statistical output to.
output: The output of a statistical test you want to add to the list (e.g., the output of t.test() or lm())
Optional, but useful, arguments include adding a unique identifier and
additional notes using the
The identifier is used to identify the model (e.g., ‘weight_height_correlation’). This has to be a unique value; the function will throw an error if a duplicate identifier is provided. If you do not provide an identifier, one is automatically created for you based on the name of the output variable.
The notes argument can (and should) be used to add additional information which you or others may find fruitful. For example, you can specify that the test is a manipulation check or one of the crucial hypothesis tests. Some statistical tests have default notes output (e.g., t-tests), which will be overwritten when a notes argument is supplied to the add_stats() function.
To illustrate how to use
tidystats, I will analyze the Many Labs replication
of Lorge & Curtiss (1936). Lorge and Curtiss (1936) examined how a quotation is
perceived when it is attributed to a liked or disliked individual. The quotation
of interest was:
“I hold it that a little rebellion, now and then, is a good thing, and as necessary in the political world as storms are in the physical world.”
In one condition the quotation was attributed to Thomas Jefferson and in the other it was attributed to Vladimir Lenin. Lorge and Curtiss (1936) found that people agree more with the quotation when the quotation was attributed to Jefferson than Lenin. In the Many Labs replication study, the quotation was attributed to either George Washington, the liked individual, or Osama Bin Laden, the disliked individual. We are again interested in testing whether the source of the quotation affects how it is evaluated (on a 9-point Likert scale ranging from 1 [disagreement] to 9 [agreement]).
Before getting into how
tidystats should be used, let’s first simply analyze
the data. I have designed
tidystats to be minimally invasive. In other words,
tidystats, you do not need to substantially change your data analysis
We start by visualizing the data and calculating several simple descriptives.
This looks like the effect is in the expected direction. Participates rated the quotation as more favorably when they believe the quote to be from George Washington.
tidystats comes with its own function to calculate
descriptives, inspired by the
describe() function from the
quote_source %>% group_by(source) %>% describe_data(response)
Here we see exactly how many observations we have, per group. All the other information was already present in the violin plot.
To test whether the differences between the two sources are statistically significant, we perform a t-test. Normally, we would just run the t-test like so:
t.test(response ~ source, quote_source)
However, since we want to later add this analysis to our list of analyses, we need to store the output of the t-test in a variable. We can then see the output of the t-test by simply printing the variable to the console, like so:
main_test <- t.test(response ~ source, quote_source) main_test
|t||p||df||difference||95% CI lower||95% CI upper|
This shows us that there is a statistically significant effect of the quote source, consistent with the hypothesis.
Next, let’s run some additional analyses. One thing we can test is whether the effect is stronger in the US compared to non-US countries. To test this, we perform a regression analysis.
us_moderation_test <- lm(response ~ source * us_or_international, quote_source) summary(us_moderation_test)
There appears to be a significant interaction. Since I’m terrible at interpretating interaction terms, let’s inspect the interaction with a graph.
We see that the effect of source is larger in the US. Given that the positive source was George Washington, this makes sense.
Next, let’s do one more analysis to see whether the effect is stronger in a lab setting compared to an online setting.
lab_moderation_test <- lm(response ~ source * lab_or_online, quote_source) summary(lab_moderation_test)
We see no significant interaction in this case. This means we do not find evidence that running the study in an online setting significantly weakens the effect; good to know!
Now let’s get to
tidystats. We have three analyses we want to save: a
t-test and two regression analyses. We stored each of these analyses in
separate variables, called
lab_moderation_test. But we do not just want to save the output, we also want
to add some additional information to each analysis. For the sake of this
example, let’s say that the t-test was our primary test. We also had a
suspicion that the location (US vs. international) would matter, but it wasn’t
our main interest. Nevertheless, we preregistered these two analyses. During
data analysis, we figured that it might also matter whether the study was
conducted in the lab or online, so we tested it. This means that this is an
exploratory analysis. With
add_stats(), we can add this information.
We will add the three variables containing the analyses to the list we previously
results. This is done with the
add_stats() accepts a list as its first argument, followed by a variable
containing a statistics model. In our case, this means we need to use the
add_stats() function three times, as we have three different analyses we want
to save. Since this can get pretty repetitive, we will use the piping operator
to pipe link the three steps together and save some typing.
results <- results %>% add_stats(main_test, type = "primary", preregistered = TRUE) %>% add_stats(us_moderation_test, type = "secondary", preregistered = TRUE) %>% add_stats(lab_moderation_test, type = "secondary", preregistered = FALSE)
I recommend to do this at the end of the data analysis script in a section
called ‘tidystats’. This confines most of the
tidystats code to a single
section, keeping it organized, and it will keep most of your script readable to
those unfamiliar with
After all the analyses are added to the list, the list can be saved as a .json
file to your computer’s disk. This is done with the
The function requires the list as its first argument, followed by a file path.
I’m a big fan of using RStudio Project files so that you can define relative
file paths. In this case, I create the .json file in the ‘Data’ folder of my
That’s it! You can now share the output of your statistical analyses with others or start using the file to report the results in your manuscript. Want to see what this .json file looks like? You can download it here. Open the file in a text editor to see how the statistics are structured. As you will see, it is not easy for our human eyes to quickly see the results, but that’s where computers come in.