Tumgik
Text
Code for writing out example data with entryLabels
For those of you (all?) trying to write out example data with columns matching the entryLabelIfDifferentFromFieldName instead of fieldName, I tweaked the get_wbFields function to return the entryLabels.  It's a little trickier than it sounds because there are some that will never have an entry label (uid, dataQF etc).  So this function returns the fieldName if the entryLabel is blank.  The cool part about this is that you can put it in now, before the fulcrum app is finished and before you have entryLabels written out, and then just knit again when they are filled in.  I've included the function below, but to see how I made this work for my ATBD, see the Aquatic Plant Point Counts ATBD: https://github.com/NEONInc/organismalIPT/blob/master/aquaticPlantPointTransect/apc_pointTransect_ATBD/apc_pointTransect.Rmd
ing_in <- read.delim(paste(myPathToWB, "apc_dataingest_NEON.DOC.003471.txt", sep="/")) pub_in <- read.delim(paste(myPathToWB, "apc_datapub_NEON.DOC.003472.txt", sep="/"))
ing_in <- apply(ing_in,2,trimws) pub_in <- apply(pub_in,2,trimws)
iTable <- unique(ing_in[,"table"]) pTable <- unique(pub_in[,"table"])
names(iTable) <- c("fieldData", "perTaxon", "voucher") names(pTable) <- c("fieldData", "perTaxon", "labTaxonomy", "cleanTaxonomy", "voucher")
ingestFields <- get_wbFields(ing_in, tableNicknames = names(iTable))#returns fieldNames pubFields <- get_wbFields(pub_in, tableNicknames = names(pTable))
get_wbFields_entryLabel <- function(wb, tableNicknames=NULL){ tables <- unique(wb[,"table"]) entryLabels <- NULL fields <- NULL results <- NULL if (is.null(tableNicknames)){tableNicknames = tables} for (i in 1:length(tables)){ # each table t <- tables[i] tn <- tableNicknames[i] entryLabels[[tn]] <- wb[which(wb[,"table"]==t),"entryLabelIfDifferentFromFieldName"] fields[[tn]] <- wb[which(wb[,"table"]==t),"fieldName"] results[[tn]] <- entryLabels[[tn]] results[[tn]][which(is.na(entryLabels[[tn]]))] <- fields[[tn]][which(is.na(entryLabels[[tn]]))] } return(results) } ingestFields_entryLabel <- get_wbFields_entryLabel(ing_in, tableNicknames = names(iTable))#returns entryLabels unless blank, then fieldName
Then at the end of the ATBD when writing out the example data, call ingestFields_entryLabel for writing out the ingests 
write.csv(inFieldData[,c('transitionID', ingestFields_entryLabel[["fieldData"]])], file = paste(myPathToCIFiles,'golden_apc_pointTransect_in.csv',sep="/"), row.names = FALSE, na="")
0 notes
Text
smsFieldnames
The ingestWB uploader requires a smsFieldname for all fields that are smsOnly.  If it’s a fate, put in the string ‘fate’. You do not need to put in a modifiier (e.g. fieldSampleIDFate), because the sampleGroups will tell which sample it applies to.  If your smsOnly field is something other than fate, contact DPS and we will tell you what other terms work.
0 notes
Text
golden data files - naming convention
For everyone's sanity, we're trying for a single naming convention for golden data files. Put them in a single folder, with nothing else in it, called CI_files. There should be one file for each table (at least - see below), both L0 and L1 tables. Name them golden_tableName.csv
If and only if you have a small test dataset CI should start with, and then a larger dataset to run through the code once it's been written, you can have multiple files per table. Number them, golden_tableName1.csv, golden_tableName2.csv, etc, and CI will start with 1.
0 notes
Text
source & inputs columns
For the source and inputs columns in pub workbooks:
source should be the table name of the table in the ingest workbook that the pub field is derived from. inputs should be the field name that the pub field is derived from. In the case of spatial data (domainID, siteID, etc), source should still be the table name from the ingest workbook, but inputs should be namedLocation (as in, that exact text). This is because CI doesn’t look for spatial data by looking at L0 fields, they look it up in their spatial data tables, based on the named location associated with the data.
These changes have been make for records in AirTable. Note that for spatial fields, ingestInput should indicate the namedLocation field (e.g. plotID).
0 notes
Text
pubFormat = integer for fields with dataType (unsigned) integer
All publication fields in Airtable with dataType == unsigned integer or integer have had pubFormat set to integer.
0 notes
Text
STARTS_WITH and ENDS_WITH
Surprise! Corey implemented STARTS_WITH and ENDS_WITH for the parser function library v 1.0 though not technically scheduled til v 2.0.  Feel free to use.  parserFunctionLibrary.txt has been updated to reflect this change.  
This is useful if you have sampleIDs that follow some pattern, e.g. if your subsampleID is a concatenation of your sampleID + some additional characters, in addition to checking the regex on the overall format, you can check that 
subsampleID STARTS_WITH(sample_ID)
0 notes
Text
sample fate LOVs
The sample fates post has been updated; now describes all possible sample fates.
If you want techs to be able to enter a sample fate, instead of defaulting the value, you'll need an LOV. Think about which fates are possible at the point when the techs are working; e.g., "released" is not an option when collecting litter from traps. And consult with others about making shared LOVs, there are probably a few common subsets of the sample fate list that will cover all the cases.
0 notes
Text
validating named locations
Update! Ross says the named location type for external labs is "External lab".
0 notes
Text
validating named locations
You can (and should!) use the named location validation to validate any named locations in your ingest, EVEN IF those locations aren’t the “location of record” in the database. The way this will most often come up is when ingesting data from an external lab. The location on the data is the location where the sample was collected, let’s say plotID. But you’re also ingesting the lab’s name, usually laboratoryName. In entryValidationRulesParser, put [NAMED_LOCATION_TYPE(External lab)]. Don’t use an LOV to validation the external lab name.
0 notes
Text
new rules for sampleGroups
New guidance from Ross on sampleGroups: we need to provide a value in sampleGroups for any field that goes to the sample management system. We had been populating sampleGroups only for sample IDs, fates, and barcodes, which is fine if that's all you're sending to SMS. But if, say, processedDate is going to SMS as well, populate sampleGroups to indicate which sample in the table it's associated with.
Figuring out which fields should go to SMS is an entirely separate problem, of course. Use your best judgement for now, and in the new system, updating a workbook to populate sampleGroups and smsFieldName for a field where they were NA before shouldn't be a huge deal.
Ingest prep checklist and transposed template have been updated with this change.
0 notes
Text
Another step when updating previous ATBD template
You do not need to create a separate ‘error’ vs ‘no error’ output files for the pub. Just output one pub file that looks exactly how you expect the pub to look after running the code on your golden input (eg. contains rows with and without quality flags, different outcomes for various steps of your ATBD, try to ensure that multiple cases are covered). Remove code from the ATBD template that creates more than just your ‘golden’ pub output.
0 notes
Text
Test datasets
ATBD authors: Some wisdom on test datasets
The current plan for testing in the new CI pipeline anticipates your test dataset will be ingested into a CI test DB. This means it must look EXACTLY like the data going in via fulcrum and/or the spreadsheet, and the outputs must look EXACTLY as they would coming out.
To this end, you will need to make your namedLocations in your golden input look like a real CI named location (e.g. UKFS_010.mammalGrid.mam NOT UKFS_010 [this is also needed to join with the spatial data via API, see below)]. You will also need to make your dateTimes look as they would coming in via spreadsheet or via fulcrum (YYYY-mm-DD or YYYY-mm-DDTHH:MM). Please bear this in mind making test sets and ATBDs going forward.
You'll also need to make your golden datasets contain spatial data that matches the expected output from CI. The only way to ensure this actually happens is to use the same spatial data. With the release of CI's new API, I wrote some code to pull spatial data directly from the CI servers, which should help with the alignment of test vs real outputs in the algorithm testing phase. It is in devTOS->atbdlibrary->get_localityInfo. Given a set of named locations (and yes, they must be real named locations, not plotIDs), it will return to you the lat/long/elev and a bunch of other stuff CI has stored. Please use this going forward rather than faking and/or taking a snapshot of the spatial data. Remember - to use the new functions you'll need to reinstall the library. Note that geodeticDatum doesn't seem to be available via the API, so you may need to 'fake' that on in your test datasets too. If you want some example code/workflow, look in the rpt ATBD, it's working there.
0 notes
Text
FYI for those using Sarah's awesome post to update existing ATBD's: For step #8, if you want the code to run then you need to either remove the new term ‘ADlist’ or create it (see chunk 7 of ATBD skeleton, lines 259-285).
0 notes
Text
publication "usage" update
Small update to the publication workbook: we’ve dropped “transition” as an option in the “usage” column. It’s redundant with usage=“both” and it’s not terribly useful. “both” and “publication” are the options for usage, and should be applied at the table level, i.e. usage shouldn’t have different values for different fields within a table.
0 notes
Text
How to update your existing ATBD in 8 ..or maybe 17...easy steps.
If you are starting an ATBD anew, just pull and clone the ATBD library and ignore this.  If you started an ATBD and have noticed changes in the template, here’s what you need to do to be Agile compliant without starting over.  Mostly it’s careful copy and pasting.  
1. Replace your logo.png with the new one, which is in with the ~devTOS\atbdLibrary\inst\rmarkdown\templates\atbdTemplate\skeleton\logo.png -> the new logo won’t have the trademark in it, if you want to check you did it right
2. Replace your first section (starting with – fontsize: 11pt THROUGH word_document: default –) with the corresponding section in: ~devTOS\atbdLibrary\inst\rmarkdown\templates\atbdTemplate\skeleton\skeleton.Rmd
3. Replace your section (starting with [//:] TEMPLATE SECTION 1 THROUGH [//]: TEMPLATE SECTION 2) with the corresponding section in: ~devTOS\atbdLibrary\inst\rmarkdown\templates\atbdTemplate\skeleton\skeleton.Rmd
4. Search for ‘Remove the next three lines for ATBDs’, and delete the next 3 lines
5. Replace your section (starting with ## PURPOSE THROUGH ## SCOPE) with the corresponding section in: ~devTOS\atbdLibrary\inst\rmarkdown\templates\atbdTemplate\skeleton\skeleton.Rmd
6. Delete the variable reported table
7. Add to variables reported section a final sentence: Some variables described in this document may be for NEON internal use only and will not appear in downloaded data. These are indicated with **downloadPkg** = “none” in `r pubName` (`r ADlist[“pub”,“ref”]`). You may need to adjust the reference in the above sentence to whatever reference your pubwb tables, depending on how you set up that reference.
8. Copy in the new data constraints and validation sections (copy and paste from the skeleton.rmd to replace your existing text. But, look before you paste over if you want to retain any notes to your fulcrum buddy in RED).  You should now have sentences about ## User Interface Specifications: all forms, not things about webUIs vs MDRs.  Your new section should end with ‘1.  All date fields can be entered as dates or dateTimes, the parser will interpret whether time is included based on the formatting.’
9. (updated 10/5/2016) smsOnly fields can occur in your example data.  You may want to remove them when simulating the parser steps (e.g. before you start implementing your algorithm) since these fields will be ignored in any de-duping, etc, as they will not be available in PDR)
More steps added 10/3/2016
10.Adjust your code so it writes out the namedLocation in the L1 goldenData
11.Reformat dateTime fields as necessary to match the preferred CI formatting
12. Make sure your namedLocations are REAL ones that exist, and that you are using the API to populate things looked up from the spatial data table.
13. Make sure your Equals:type samples [EXIST] , if specified by the workflow
14. Samples -> make sure you are passing both the barcode and the id (but not the fate)
15. For any calculations/logic done on sampleIDs, paste in example syntax to algorithm implementation from the skeleton (’In every instance in the algorithm in which a sample tag (generally corresponding to a fieldName of the form xxxSampleID) is used to look up data records, the lookup should be first attempted via the sample barcode. If the sample barcode is not populated, proceed using the sample tag. on using sample tag if it exists, otherwise sample barcode’)
16. Add text (Populate the location description values…)and code from the skeleton to populate the publication location-y things (domainID, plotID, locationID, etc. Copy and paste the sentence from the template that begins ‘ “The named location for each”
17. Make sure your de-duping says whether to treat NULL values as different, or resolve, and that the code and language match.
Updates 10/10/2016
18. Delete the section on sample creation rules, formerly started with ‘ ## ##Sample creation rules
19.It is not necessary to include a list of fields that are NOT passed from L0 to L1 (though if you have it you can keep it, it can be hard to keep up to date
20. Add transitionID to the golden L0 and L1
21. Make sure column headers on golden_in match entryLabelIfDifferentFromFieldName
22. Specify whether you want fields that are NOT passed L0-> L1 in the dedupe check
23. Put your testing files in CI_files subdirectory and name them correctly and clean out any extra bonus files on there so there’s no confusion
24.If you have taxon fuzzing, copy in the new syntax with namedLocation instead of dXX, and where the redaction is folded in.  If you are copying and pasting from the template,the sentence starts with ‘ For each record *p* of `r pTable[“id”]` where **targetTaxaPresent** is ‘Yes’
0 notes
Text
dataQFs, fates, and barcodes in publication workbooks & ATBDs
Updates to the publication workbook are almost complete, watch for a Git announcement. In the meantime, we've had questions about dataQFs and sample fates and barcodes.
dataQF: (1) if tables in the ingest and pub are one-to-one, pass dataQFs through to the pub. (2) if there are more tables in the pub than in the ingest, pass the dataQF to the table that makes the most sense, or replicate it if that makes sense. (3) if there are more tables in the ingest than the pub, pass all dataQFs through, but rename them so it's clear where they're coming from (e.g. subsamplingDataQF, analyticalDataQF, etc).
barcodes: pass the barcodes through to the pub. Put downloadPkg=none so it doesn't show up on download, but this way there will be a place for it in the future.
fates: generally shouldn't go into the pub. Usually sampleCompromised or something similar is more useful to end users. If you have an exception, consult Sarah or Claire.
0 notes
Text
ATBD wording re: downloadPkg=none variables
Kim added a brief explanatory chunk to the “Variables Reported” section of the ATBD template (where the table used to be) to explain that not all fields in the ATBD may appear in downloaded data (i.e. dataQF). Here it is in Rmarkdown:
Some variables described in this document may be for NEON internal use only and will not appear in downloaded data. These are indicated with **downloadPkg** = "none" in `r pubName` (`r ADlist["pub","ref"]`).
which appears as (using ticks as an example):
Some variables described in this document may be for NEON internal use only and will not appear in downloaded data. These are indicated with downloadPkg = "none" in NEON Data Publication Workbook for Tick Sampling (AD[09]).
0 notes