--- title: "sitePickR: Getting Started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{sitepickR-demo} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} options(rmarkdown.html_vignette.check_title = FALSE) knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.width=8, fig.height=5) knitr::opts_chunk$set( collapse = FALSE, comment = "#>" ) ``` ## Getting Started: Sampling AY 2017-18 California schools and districts __Elena Badillo-Goicoechea, Robert Olsen, and Elizabeth A. Stuart__ 2022-11-29 ## Introduction sitepickR is designed to select a representative sample of sites for a prospective impact evaluation, such as a randomized controlled trial (RCT). The main function in this package, [selectMatch()](https://sitepickr.github.io/sitepickR-website/reference/selectMatch.html) lets the user carry out a two-level sample selection where the possibility of an initially selected unit not wanting to participate is anticipated. The procedure aims to reduce the bias (and/or loss of generalizability to the target population) this could introduce. In selecting units and sub-units, sitepickR uses the cube method (e.g., Deville & Tillé, 2004; Tillé 2011). The cube method is a probability sampling method that is designed to satisfy criteria for balance between the sample and the population. Recent research has shown that this method performs well in simulations for studies of educational programs (Fay & Olsen, under review). To implement the cube method, sitepickerR uses the [sampling](https://cran.r-project.org/package=sampling) R package. Users have the option to select units with equal probabilities or with probabilities proportional to their "size" measured in terms of the number of sub-units nested within units. In addition, sitepickR uses statistical matching to select possible replacement units. In education RCTs, the share of selected districts that agrees to participate tends to be low. To address this challenge, sitepickR selects and ranks up to K replacement districts for each districts selected using the cube method. Replacement districts are selected using statistical matching based on propensity score methods. To implement statistical matching, sitepickR uses the [MatchIt](https://cran.r-project.org/package=MatchIt) R package. sitepickR's core sampling + matching procedure, implemented with the selectMatch() function, consists of four main steps: 1. __Study sample design__, where we: - Identify a target population with a unit : sub-unit nested structure (e.g. [district A : schools in district A]) - Set key parameter values (e.g. initial sample size, desired number of matches per unit) - Specify a set of observable covariates of interest at the unit level - Specify a set of sub-unit level observable covariates of interest 2. Select a random __sample__ of units from the target population. 3. Obtain a list of __best K matches__ for each initially selected unit, where match no. 1 is closest to a given original unit and match no. K is furthest, in terms of the covariates of interest and key parameter values. These K matches will be the potential replacement units for each initially selected unit, in case their corresponding original unit is unable to participate in the study, and they are taken (with or without repetition, which you can set with the [repFlag](https://sitepickr.github.io/sitepickR-website/reference/selectMatch.html) argument) from the pool of units that _did not_ get selected in the initial random sampling procedure. 4. Assess __balance__ and __match quality__ in terms of the covariates of interest: - Balance between initially selected ('original') units and the target population. - Balance between original units and each group of matches (1 to K, from closest to furthest). - Number of successful matches obtained by the procedure, given the restrictions imposed. - Balance between sub-units associated to each unit replacement group and the original sub-units in the population, in terms of available covariates of interest, both at the unit and sub-unit level. This vignette will guide the reader, step-by-step, on sitepickR's basic functionalities, with data from the Common Core of Data (CCD) for California schools (2017-18), using a [pre-processed dataset](https://sitepickr.github.io/sitepickR-website/reference/rawCCD.html) that is included with the package installation. Technical details on each of the package's main functions and on the sample dataset is included in the documentation, which the reader is encouraged to consult, if necessary. ## Common Core of Data The [CCD](http://nces.ed.gov/ccd/) is the U.S. Department of Education's primary database on public elementary and secondary education in the United States. CCD is a comprehensive, annual, national database of all public elementary and secondary schools and school districts. In addition to administrative data, it provides relevant information about schools and districts, disaggregated by demographics (grade, race/ethnicity, and sex) and socioeconomic factors (school-level counts of student eligible for free and reduced-price lunches). ## Package and data set up First, if needed, install the sitepickR package with the help of devtools: ```{r} if(!require(devtools)){ install.packages("devtools", repos = "http://cran.us.r-project.org") } if(!require(sitepickR)){ devtools::install_github("ElenaBadilloG/sitepickR") } ``` And load the package by calling its library: ```{r} library(sitepickR) ``` Now let's load the sample CCD-California 2017-18 dataset that comes with the package: ```{r, results='asis'} rawCCD <- sitepickR::rawCCD knitr::kable(head(rawCCD), format = "html") ``` Process the sample aggregate dataset in order for it to be in the exact format expected by selectMatch by using sitepckr [prepDF()](https://sitepickr.github.io/sitepickR-website/reference/prepDF.html) function. This simple step will re-define some key variables, and will create a new variable, 'unitSize', that will be used in the step of the selectMatch procedure where units are initially selected. Specifically, the cube sampling method implicit in selectMatch allows us to select units in a way that is not biased by its number of sub-units (which we would usually want to impose). This is explained in more detail in this section of the [sampling](https://rdrr.io/cran/sampling/man/balancedcluster.html) documentation. What matters here, is that, having created a unitSize column, with selectMatch we can let the sampling step correct by size by setting the [sizeFlag](https://sitepickr.github.io/sitepickR-website/reference/selectMatch.html) input to TRUE. ```{r, results='asis'} dfCCD <- prepDF(rawCCD, unitID="LEAID", subunitID="NCESSCH") knitr::kable(dfCCD[1:10,(ncol(dfCCD)-5-1):ncol(dfCCD)], format = "html") ``` ## Study sample design ### Define key input values Since there is some underlying random sampling in selectMatch, we can set a seed for replication purposes (optional): ```{r} seed <- 1122 ``` Set distance tolerance for restricted covariates: ```{r} calip <- 0.2 # maximum standard deviations of covariate distance around target population ``` #### Set output sample sizes Set size for: 1) initial unit ("district", in our example) sample, Nu; 2) number of desired matches per unit, K; and 3) the number of sub-units ("schools", in this example) that will be randomly selected for each potential unit: ```{r} Nu <- 100 # district sample size K <- 10 # number of matches per district Ns <- 5 # number of schools per potential district ``` ### Specify covariates of interest Now, specify unit level covariates on which you'll want to match selected units (again "districts" here) with their replacement candidates from the non-selected sample: ```{r} uSampVarsCCD <- c("w.pct.frlunch", "w.pct.black", "w.pct.hisp", "w.pct.female") ``` Similarly, specify sub-unit ("school") level covariates on which you'll want to match districts. In our CCD example, these covariates are just school level aggregates of conceptually the same district level underlying variable (e.g. percent of Hispanic students). However, these could be any other sub-unit level variables available in your own dataset, not necessarily expressing the same variable as some other unit level covariate. ```{r} suSampVarsCCD <- c("sch.pct.frlunch", "sch.pct.black", "sch.pct.hisp", "sch.pct.female") ``` You can (optionally) specify covariates on which to _exactly_ match units. These would usually be categorical covariates, with relatively few categories --otherwise the matching could fail, being too restrictive or even empty: ```{r} exactMatchVars <- c("distr.type") ``` Similarly, you can optionally specify covariates on which to match units within a given numeric caliper, in terms of standard deviations. Such covariates are expected to be numeric variables. In this case, we are interested in caliper-matching on three numeric covariates: percentage of Black, Hispanic, and female students in the district. Notice that the lower the caliper, the more restrictive the matching (which could result in ferwe proportion of successful matches per replacement group, or even empty ones): ```{r} calipMatchVars <- c("w.pct.black", "w.pct.hisp", "w.pct.female") ``` ## Get unit matches We'll now leverage sampling + MatchIt packages joint functionality, and match districts with their candidate replacements among the non-initially selected units using [Mahalanobis distance](https://en.wikipedia.org/wiki/Mahalanobis_distance). We could also do this using propensity scores, as specified on [MatchIt](https://cran.r-project.org/package=MatchIt), and the output structure would be exactly the same: ```{r} smOut <- selectMatch(df = dfCCD, # user dataset unitID = "LEAID", # column name of unit ID in user dataset subunitID = "NCESSCH", # column name of sub-unit ID in user dataset unitVars = uSampVarsCCD, # name of unit level covariate columns subunitSampVars = suSampVarsCCD, # name of sub-unit level covariate columns exactMatchVars = exactMatchVars, # unit level categorical covariates on which to match exactly calipMatchVars = calipMatchVars, # unit level numeric covariates on which to match within a radius nUnitSamp = Nu, # original unit sample size nRepUnits = K, # number of desired matches per initially selected unit nsubUnits = Ns, # number of sub-units to sample from each candidate unit calipValue = calip, # maximum distance on which to match specified unit level covariates (calipMatchVars) seedN = seed, # random seed number matchDistance = "mahalanobis", # metric used for matching units sizeFlag = TRUE, repFlag = FALSE, # pick matches without repetition writeOut = FALSE, # write out a csv file for: 1) matched units and 2) selected sub-units replacementUnitsFilename = "replacementsTable.csv", # filename for {districtA: [districtA replacement list]} table subUnitTableFilename = "schoolsDirectory.csv" # filename for {districtA: [districtA schools]} table ) ``` At this point, the two main outputs of the selectMatch procedure have been produced, and are stored inside the object we' have 've just defined as 'smOut'. In our example, these two main outputs are: 1. A table with 10 matches (replacement candidates) for each of the 100 school districts we initially selected. The matches are ordered from the most to the least 'similar' to their original district (where 'similar' is close, in terms of covariate distance). 2. A table with 5 (or less, when there are less than 5 schools available for a given district) randomly (if applicable) selected schools for each of the 100 original __and__ each of the 100 x 10 candidate districts. Of course, each of these 5 schools are sampled exclusively from those that belong to each school district: ```{r, results='asis'} districtReplTable <- smOut[[1]] knitr::kable(head(districtReplTable), format = "html") ``` ```{r, results='asis'} schoolDirectory <- smOut[[2]] knitr::kable(head(dplyr::filter(schoolDirectory, !is.na(Sub_unit5_ID))), format = "html") ``` Each of these two tables can be automatically stored as .csv files in your local computer, by simply setting writeOut = TRUE in the main function call to selectMatch. In case you use this feature, you can given those two files a specific name, using the [replacementUnitsFilename][repFlag](https://sitepickr.github.io/sitepickR-website/reference/selectMatch.html) and [subUnitTableFilename][repFlag](https://sitepickr.github.io/sitepickR-website/reference/selectMatch.html) selectMatch arguments (or just keeping their default values). ## Assess balance and match quality Now that we have our potential districts replacements, we'd like to know how 'similar' they really are to both the target population and the originally selected sample. Similarly, we'd like to know how well the schools that belong to each of these districts represent the schools of our target population, in terms of our covariates of interest. In the end, what we want is to be able to retain the original representativensess (imposed by our initial random sampling step) in our study design as much as possible, even when units "self-select". sitepickR provides the user with several functions to carry out these balance diagnostics. By default --and following the standard literature-- balance diagnostics in sitepickR are expressed in terms of standardized mean difference (SMD) between two comparison groups of interest. In our case, these two comparison groups we'll be: 1) initially selected districts (or schools) against target population; 2) initially selected districts (or schools) against each of the 10 district replacement groups. In addition, sitepickR has functions to find out how 'successful' the matching procedure was, in terms of how often was it possible to find exactly K, K-1, ..., 1, and 0 matches for the original units --possibly indicating we are asking for 'too much' in terms of maximum distance, or exact matching constraints, and thus, letting us re-adjust our parameters accordingly. ### 1. Original units vs. target population First, we want to look at the overall balance between the initially selected districts (i.e. the group of districts that were selected in the initial cube sampling step of the selectMatch procedure) and __all__ the districts in the population. While no matching has taken place at this point (so, technically, selectMatch has not done its job yet), it might still be of interest to look at this initial balance, in order to have a benchmark for the quality of the subsequent matches: ```{r fig.width=6, fig.height=5} unitLovePlot(smOut, title = "Standardized Mean Difference: \n Initially Selected Units vs. Population") ``` We can also take a look at a table with the actual SMD for each covariate of interest applying the [getSummary()](https://sitepickr.github.io/sitepickR-website/reference/getSummary.html) function, which takes the : ```{r, results='asis'} unitBalanceTab <- getSummary(smOut, diagnostic="unitBal") knitr::kable(head(unitBalanceTab, 10), format = "html") ``` ### 2. Original units vs. replacement candidates Now, to assess the quality of our resulting district matches, we look at the balance (in terms of SMD) between the group of initially selected districts and each of its K replacement districts groups, where group 1 is composed of the first closest matches ordered in terms of distance, group 2 corresponds to the second-best matches, and so on: ```{r fig.width=9, fig.height=6} matchBalance(smOut, title = "Standardized Mean Difference: \ Replacement Unit Groups (1...K) vs. Originally Selected Units") ``` Overall, we expect the SMD to increase from 1 to K. However, the decrease is not necessarily monotonic for each covariate (recall that the distance is computed over __all__ of the covariates, simultaneously!). This diagnostic is quite interesting and relevant since it lets us know the quality of the matching for each replacement group and for each covariate separately. This way we can know more about the strengths and weaknesses of our final study sample. In addition, it let's us diagnose how strict our matching restrictions are overall, given our data. A steeper curve would indicate rapidly match quality is rapidly deteriorating as we go from the K-1 to the K best replacement, while a flatter one would indicate that pretty much all of our candidate matches are similarly good. ```{r, results='asis'} matchBalanceTab <- getSummary(smOut, diagnostic="matchBal") knitr::kable(head(matchBalanceTab, 20), format = "html") ``` ### 3. Successful matches Besides inspecting balance metrics, we might want to look at how many matches were actually computed for our original units. sitepickR lets us do this in both ways: 1. [matchFreq()](https://sitepickr.github.io/sitepickR-website/reference/matchFreq.html): distribution of successful matches across all original units. So this is the frequency (from 1 to K) of computed matches for the initially selected sub-population. This gives us an idea of how successful our matching procedure was, overall. 2. [matchCount()](https://sitepickr.github.io/sitepickR-website/reference/matchCount.html): The number of successful matches between the original districts and replacement candidate districts from the non-selected pool, __for each of their K replacements groups__ (which are by default sorted from most to least similar 1...K). So for the group of best matches (i.e. Replacement group 1), how many matches were computed by selectMatch, how many for the group of second-best matches, and so on. This gives us an idea of how restrictive our parameter settings were, taken together, and given our dataset structure. More specifically, the number of matches for each replacement group is expected to be: 1) decreasing in K, the number of matches (i.e. replacement candidates) we are asking for, and inversely related to restrictions imposed by exact matching, caliper value, and inherent structure and size of the data. #### 3.1 Match histogram First, let's look at the overall match frequency the original units got, overall: ```{r fig.width=7, fig.height=6} matchFreq(smOut, title="Match Frequency per Original Unit") ``` #### 3.2 Successfully computed unit matches per unit group (1...K) Now let's examine the percent of matches for each replacement group, sorted from closest to furthest units. Here, 100% for replacement group 1 means that all units got a best-match matches were successful, 50% for replacement group 2 means that only half of all original units got a 2nd best match, and so on: ```{r fig.width=7, fig.height=6} matchCount(smOut, title="% of Successful Matches per Unit Group") ``` Again, we can, use [getSummary()](https://sitepickr.github.io/sitepickR-website/reference/getSummary.html) to access each of these stats directly in a table: ```{r, results='asis'} matchFreqTab <- getSummary(smOut, diagnostic="matchFreq") knitr::kable(matchFreqTab, format = "html") ``` ```{r, results='asis'} matchCountTab <- getSummary(smOut, diagnostic="matchCount") knitr::kable(matchCountTab, format = "html") ``` ### 4. Sub-units from original vs. candidate units Ultimately, the procedure wants to yield not just districts, but schools that are similar to the target population, in terms of all specified covariates of interest (both the district and school level ones). sitepickR's [subUnitBalance()](https://sitepickr.github.io/sitepickR-website/reference/subUnitBalance.html) function lets us assess balance precisely on this terms: ```{r fig.width=9, fig.height=6} subUnitBalance(smOut, title="Standardized Mean Difference: \n Sub-units from Original + Replacement Unit Groups vs. Population") ``` Similary, use [getSummary()](https://sitepickr.github.io/sitepickR-website/reference/getSummary.html) function to obtain balance directly in a table: ```{r, results='asis'} subUnitBalanceTab <- getSummary(smOut, diagnostic="subunitBal") knitr::kable(subUnitBalanceTab, format = "html") ```