When researchers request access to your data they may not be granted access to the whole dataset, but only to the variables which they will use in their project. In Armadilllo, access is regulated on the project level, so you will need to create a view containing only these variables.
You first need to install and load the package to be able to create the subsets.
In order to access the files, you need to log in using the URL of the Armadillo server. A browser window will be opened where you can identify yourself with the ID provider.
armadillo.login("https://armadillo-demo.molgenis.net/")
#> [1] "We're opening a browser so you can log in with code 5FLGYF"
A session will be created and the credentials stored in the environment.
Let’s assume you are in a consortium which has data that can not be shared in entirety to researchers. You want to share a subset of the whole dataset with certain researchers that applied for access to your data. There are two ways that you can do this.
For each research project, you first create a .csv file containing 5 columns:
source_folder | source_table | target_folder | target_table | variable |
---|---|---|---|---|
2_1_core_1_0 | yearly_rep | project1 | yearly_vars | green_dist_ |
2_1_core_1_0 | yearly_rep | project1 | yearly_vars | green_size_ |
2_1_core_1_0 | yearly_rep | project1. | yearly_vars | green_access_ |
‘source_folder’ refers a folder within the master project; ‘source_table’ refers to the name of a table within this folder, ‘target_folder’ refers to the name for the new folder within the target project, ‘target_table’ refers to the name of the new table within ‘target_folder’ and ‘variable’ refers to one or more variables within source_table (‘source_project’ and ‘target_project’ are specified later).
Note that these columns need to be named exactly as above.
Once you have defined the tables then you can construct the
’subset_definition
. This creates a tibble within R holding
the details from the .csv file.
subset_definition <- armadillo.subset_definition(
reference_csv = "data/subset/vars.csv")
subset_definition
#> # A tibble: 3 × 5
#> source_folder source_table target_folder target_table target_vars
#> <chr> <chr> <chr> <chr> <list>
#> 1 2_1-core-1_0 yearlyrep core year_rep <tibble [14 × 1]>
#> 2 1_1-outcome-1_0 yearlyrep outcome year_rep <tibble [9 × 1]>
#> 3 2_1-core-1_0 nonrep core non_rep <tibble [5 × 1]>
After this you can create a new subset using the subset method within Armadillo.
armadillo.subset(
input_source = "subset_def",
source_project = "gecko",
target_project = "study1",
subset_def = subset_definition
)
#> Created project 'study1' without users
#> ✔ All views were successfully created!
#> ✔ View 'core/year_rep' successfully created
#> ✔ View 'outcome/year_rep' successfully created
#> ✔ View 'core/non_rep' successfully created
This method is generally the best choice if you need to create subsets for multiple tables.
An alternative is to specify the subset in R, via arguments to the
armadillo.subset
function:
armadillo.subset(
input_source = "arguments",
source_project = "gecko",
source_folder = "2_1-core-1_0",
source_table = "yearlyrep",
target_project = "study2",
target_folder = "core",
target_table = "year_rep",
target_vars = c("occup_f1_", "occupcode_f2_", "edu_f1_", "edu_f1_fath", "edu_f2_", "edu_f2_fath", "pets_", "cats_", "cats_quant_", "dogs_")
)
#> Created project 'study2' without users
#> ✔ All views were successfully created!
#> ✔ View 'core/year_rep' successfully created
This method may be easier if you only need to create one small subset.
Now you can also take a look at the files in the armadillo user interface, if you open it in a browser window.