Data Carpentry for Media Research: Key Points

Pre-Alpha

Data Carpentry for Media Research

Before we Start

Use RStudio to write and run R programs.
Use install.packages() to install packages (libraries).

Introduction to R

Access individual values by location using [].
Access arbitrary sets of data using [c(...)].
Use logical operations and logical vectors to access subsets of data.

Starting with Data

Use read_csv to read tabular data in R.

Data Wrangling with dplyr

Use the dplyr package to manipulate dataframes.
Use select() to choose variables from a dataframe.
Use filter() to choose data based on values.
Use group_by() and summarize() to work with subsets of data.
Use mutate() to create new variables.

Data Wrangling with tidyr

Use the tidyr package to change the layout of data frames.
Use pivot_wider() to go from long to wide format.
Use pivot_longer() to go from wide to long format.

Data Visualisation with ggplot2

ggplot2 is a flexible and useful tool for creating plots in R.
The data set and coordinate system can be defined using the ggplot function.
Additional layers, including geoms, are added using the + operator.
Boxplots are useful for visualizing the distribution of a continuous variable.
Barplots are useful for visualizing categorical data.
Faceting allows you to generate multiple plots based on a categorical variable.

Getting started with R Markdown (Optional)

R Markdown is a useful language for creating reproducible documents combining text and executable R-code.
Specify chunk options to control formatting of the output document

Processing JSON data (Optional)

JSON is a popular data format for transferring data used by a great many Web based APIs
The complex structure of a JSON document means that it cannot easily be ‘flattened’ into tabular data
We can use R code to extract values of interest and place them in a csv file

Text as data (Optional)Using Word Frequencies to Analyse Text

Use the quanteda package to analyse text data.
Use corpus(), tokens(),dfm(), dfm_remove() and stopword lists to prepare text for analysis.
Use textstat_frequency to investigate the most frequently used tokens or features in a dfm.
Plot frequencies using ggplot and the quanteda function textplot_wordcloud.