minutiaeforatbdauthors - Tumblr blog

minutiaeforatbdauthors · 8 years

Text

Code for writing out example data with entryLabels

For those of you (all?) trying to write out example data with columns matching the entryLabelIfDifferentFromFieldName instead of fieldName, I tweaked the get_wbFields function to return the entryLabels. It's a little trickier than it sounds because there are some that will never have an entry label (uid, dataQF etc). So this function returns the fieldName if the entryLabel is blank. The cool part about this is that you can put it in now, before the fulcrum app is finished and before you have entryLabels written out, and then just knit again when they are filled in. I've included the function below, but to see how I made this work for my ATBD, see the Aquatic Plant Point Counts ATBD: https://github.com/NEONInc/organismalIPT/blob/master/aquaticPlantPointTransect/apc_pointTransect_ATBD/apc_pointTransect.Rmd

ing_in <- read.delim(paste(myPathToWB, "apc_dataingest_NEON.DOC.003471.txt", sep="/")) pub_in <- read.delim(paste(myPathToWB, "apc_datapub_NEON.DOC.003472.txt", sep="/"))

ing_in <- apply(ing_in,2,trimws) pub_in <- apply(pub_in,2,trimws)

iTable <- unique(ing_in[,"table"]) pTable <- unique(pub_in[,"table"])

names(iTable) <- c("fieldData", "perTaxon", "voucher") names(pTable) <- c("fieldData", "perTaxon", "labTaxonomy", "cleanTaxonomy", "voucher")

ingestFields <- get_wbFields(ing_in, tableNicknames = names(iTable))#returns fieldNames pubFields <- get_wbFields(pub_in, tableNicknames = names(pTable))

get_wbFields_entryLabel <- function(wb, tableNicknames=NULL){ tables <- unique(wb[,"table"]) entryLabels <- NULL fields <- NULL results <- NULL if (is.null(tableNicknames)){tableNicknames = tables} for (i in 1:length(tables)){ # each table t <- tables[i] tn <- tableNicknames[i] entryLabels[[tn]] <- wb[which(wb[,"table"]==t),"entryLabelIfDifferentFromFieldName"] fields[[tn]] <- wb[which(wb[,"table"]==t),"fieldName"] results[[tn]] <- entryLabels[[tn]] results[[tn]][which(is.na(entryLabels[[tn]]))] <- fields[[tn]][which(is.na(entryLabels[[tn]]))] } return(results) } ingestFields_entryLabel <- get_wbFields_entryLabel(ing_in, tableNicknames = names(iTable))#returns entryLabels unless blank, then fieldName

Then at the end of the ATBD when writing out the example data, call ingestFields_entryLabel for writing out the ingests

write.csv(inFieldData[,c('transitionID', ingestFields_entryLabel[["fieldData"]])], file = paste(myPathToCIFiles,'golden_apc_pointTransect_in.csv',sep="/"), row.names = FALSE, na="")

#atbd #ingest #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

smsFieldnames

The ingestWB uploader requires a smsFieldname for all fields that are smsOnly. If it’s a fate, put in the string ‘fate’. You do not need to put in a modifiier (e.g. fieldSampleIDFate), because the sampleGroups will tell which sample it applies to. If your smsOnly field is something other than fate, contact DPS and we will tell you what other terms work.

#ingest

0 notes

minutiaeforatbdauthors · 8 years

Text

golden data files - naming convention

For everyone's sanity, we're trying for a single naming convention for golden data files. Put them in a single folder, with nothing else in it, called CI_files. There should be one file for each table (at least - see below), both L0 and L1 tables. Name them golden_tableName.csv

If and only if you have a small test dataset CI should start with, and then a larger dataset to run through the code once it's been written, you can have multiple files per table. Number them, golden_tableName1.csv, golden_tableName2.csv, etc, and CI will start with 1.

#atbd #ingest #publication #updateRequired #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

source & inputs columns

For the source and inputs columns in pub workbooks:

source should be the table name of the table in the ingest workbook that the pub field is derived from. inputs should be the field name that the pub field is derived from. In the case of spatial data (domainID, siteID, etc), source should still be the table name from the ingest workbook, but inputs should be namedLocation (as in, that exact text). This is because CI doesn’t look for spatial data by looking at L0 fields, they look it up in their spatial data tables, based on the named location associated with the data.

These changes have been make for records in AirTable. Note that for spatial fields, ingestInput should indicate the namedLocation field (e.g. plotID).

#publication #updateRequired #submission #airtableupdate

0 notes

minutiaeforatbdauthors · 8 years

Text

pubFormat = integer for fields with dataType (unsigned) integer

All publication fields in Airtable with dataType == unsigned integer or integer have had pubFormat set to integer.

#airtableupdate

0 notes

minutiaeforatbdauthors · 8 years

Text

STARTS_WITH and ENDS_WITH

Surprise! Corey implemented STARTS_WITH and ENDS_WITH for the parser function library v 1.0 though not technically scheduled til v 2.0. Feel free to use. parserFunctionLibrary.txt has been updated to reflect this change.

This is useful if you have sampleIDs that follow some pattern, e.g. if your subsampleID is a concatenation of your sampleID + some additional characters, in addition to checking the regex on the overall format, you can check that

subsampleID STARTS_WITH(sample_ID)

#ingest #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

sample fate LOVs

The sample fates post has been updated; now describes all possible sample fates.

If you want techs to be able to enter a sample fate, instead of defaulting the value, you'll need an LOV. Think about which fates are possible at the point when the techs are working; e.g., "released" is not an option when collecting litter from traps. And consult with others about making shared LOVs, there are probably a few common subsets of the sample fate list that will cover all the cases.

#ingest #LOV #samples #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

validating named locations

Update! Ross says the named location type for external labs is "External lab".

#externalLabs #ingest #samples #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

validating named locations

You can (and should!) use the named location validation to validate any named locations in your ingest, EVEN IF those locations aren’t the “location of record” in the database. The way this will most often come up is when ingesting data from an external lab. The location on the data is the location where the sample was collected, let’s say plotID. But you’re also ingesting the lab’s name, usually laboratoryName. In entryValidationRulesParser, put [NAMED_LOCATION_TYPE(External lab)]. Don’t use an LOV to validation the external lab name.

#externalLabs #forYourNextDP #ingest #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

new rules for sampleGroups

New guidance from Ross on sampleGroups: we need to provide a value in sampleGroups for any field that goes to the sample management system. We had been populating sampleGroups only for sample IDs, fates, and barcodes, which is fine if that's all you're sending to SMS. But if, say, processedDate is going to SMS as well, populate sampleGroups to indicate which sample in the table it's associated with.

Figuring out which fields should go to SMS is an entirely separate problem, of course. Use your best judgement for now, and in the new system, updating a workbook to populate sampleGroups and smsFieldName for a field where they were NA before shouldn't be a huge deal.

Ingest prep checklist and transposed template have been updated with this change.

#forYourNextDP #ingest #samples #updateRequired #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

Another step when updating previous ATBD template

You do not need to create a separate ‘error’ vs ‘no error’ output files for the pub. Just output one pub file that looks exactly how you expect the pub to look after running the code on your golden input (eg. contains rows with and without quality flags, different outcomes for various steps of your ATBD, try to ensure that multiple cases are covered). Remove code from the ATBD template that creates more than just your ‘golden’ pub output.

#atbd #updateRequired #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

Test datasets

ATBD authors: Some wisdom on test datasets

The current plan for testing in the new CI pipeline anticipates your test dataset will be ingested into a CI test DB. This means it must look EXACTLY like the data going in via fulcrum and/or the spreadsheet, and the outputs must look EXACTLY as they would coming out.

To this end, you will need to make your namedLocations in your golden input look like a real CI named location (e.g. UKFS_010.mammalGrid.mam NOT UKFS_010 [this is also needed to join with the spatial data via API, see below)]. You will also need to make your dateTimes look as they would coming in via spreadsheet or via fulcrum (YYYY-mm-DD or YYYY-mm-DDTHH:MM). Please bear this in mind making test sets and ATBDs going forward.

You'll also need to make your golden datasets contain spatial data that matches the expected output from CI. The only way to ensure this actually happens is to use the same spatial data. With the release of CI's new API, I wrote some code to pull spatial data directly from the CI servers, which should help with the alignment of test vs real outputs in the algorithm testing phase. It is in devTOS->atbdlibrary->get_localityInfo. Given a set of named locations (and yes, they must be real named locations, not plotIDs), it will return to you the lat/long/elev and a bunch of other stuff CI has stored. Please use this going forward rather than faking and/or taking a snapshot of the spatial data. Remember - to use the new functions you'll need to reinstall the library. Note that geodeticDatum doesn't seem to be available via the API, so you may need to 'fake' that on in your test datasets too. If you want some example code/workflow, look in the rpt ATBD, it's working there.

#atbd #forYourNextDP #spatialData #updateRequired #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

FYI for those using Sarah's awesome post to update existing ATBD's: For step #8, if you want the code to run then you need to either remove the new term ‘ADlist’ or create it (see chunk 7 of ATBD skeleton, lines 259-285).

#atbd #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

publication "usage" update

Small update to the publication workbook: we’ve dropped “transition” as an option in the “usage” column. It’s redundant with usage=“both” and it’s not terribly useful. “both” and “publication” are the options for usage, and should be applied at the table level, i.e. usage shouldn’t have different values for different fields within a table.

#forYourNextDP #publication #updateRequired #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

How to update your existing ATBD in 8 ..or maybe 17...easy steps.

If you are starting an ATBD anew, just pull and clone the ATBD library and ignore this. If you started an ATBD and have noticed changes in the template, here’s what you need to do to be Agile compliant without starting over. Mostly it’s careful copy and pasting.

1. Replace your logo.png with the new one, which is in with the ~devTOS\atbdLibrary\inst\rmarkdown\templates\atbdTemplate\skeleton\logo.png -> the new logo won’t have the trademark in it, if you want to check you did it right

2. Replace your first section (starting with – fontsize: 11pt THROUGH word_document: default –) with the corresponding section in: ~devTOS\atbdLibrary\inst\rmarkdown\templates\atbdTemplate\skeleton\skeleton.Rmd

3. Replace your section (starting with [//:] TEMPLATE SECTION 1 THROUGH [//]: TEMPLATE SECTION 2) with the corresponding section in: ~devTOS\atbdLibrary\inst\rmarkdown\templates\atbdTemplate\skeleton\skeleton.Rmd

4. Search for ‘Remove the next three lines for ATBDs’, and delete the next 3 lines

5. Replace your section (starting with ## PURPOSE THROUGH ## SCOPE) with the corresponding section in: ~devTOS\atbdLibrary\inst\rmarkdown\templates\atbdTemplate\skeleton\skeleton.Rmd

6. Delete the variable reported table

7. Add to variables reported section a final sentence: Some variables described in this document may be for NEON internal use only and will not appear in downloaded data. These are indicated with **downloadPkg** = “none” in `r pubName` (`r ADlist[“pub”,“ref”]`). You may need to adjust the reference in the above sentence to whatever reference your pubwb tables, depending on how you set up that reference.

8. Copy in the new data constraints and validation sections (copy and paste from the skeleton.rmd to replace your existing text. But, look before you paste over if you want to retain any notes to your fulcrum buddy in RED). You should now have sentences about ## User Interface Specifications: all forms, not things about webUIs vs MDRs. Your new section should end with ‘1. All date fields can be entered as dates or dateTimes, the parser will interpret whether time is included based on the formatting.’

9. (updated 10/5/2016) smsOnly fields can occur in your example data. You may want to remove them when simulating the parser steps (e.g. before you start implementing your algorithm) since these fields will be ignored in any de-duping, etc, as they will not be available in PDR)

More steps added 10/3/2016

10.Adjust your code so it writes out the namedLocation in the L1 goldenData

11.Reformat dateTime fields as necessary to match the preferred CI formatting

12. Make sure your namedLocations are REAL ones that exist, and that you are using the API to populate things looked up from the spatial data table.

13. Make sure your Equals:type samples [EXIST] , if specified by the workflow

14. Samples -> make sure you are passing both the barcode and the id (but not the fate)

15. For any calculations/logic done on sampleIDs, paste in example syntax to algorithm implementation from the skeleton (’In every instance in the algorithm in which a sample tag (generally corresponding to a fieldName of the form xxxSampleID) is used to look up data records, the lookup should be first attempted via the sample barcode. If the sample barcode is not populated, proceed using the sample tag. on using sample tag if it exists, otherwise sample barcode’)

16. Add text (Populate the location description values…)and code from the skeleton to populate the publication location-y things (domainID, plotID, locationID, etc. Copy and paste the sentence from the template that begins ‘ “The named location for each”

17. Make sure your de-duping says whether to treat NULL values as different, or resolve, and that the code and language match.

Updates 10/10/2016

18. Delete the section on sample creation rules, formerly started with ‘ ## ##Sample creation rules

19.It is not necessary to include a list of fields that are NOT passed from L0 to L1 (though if you have it you can keep it, it can be hard to keep up to date

20. Add transitionID to the golden L0 and L1

21. Make sure column headers on golden_in match entryLabelIfDifferentFromFieldName

22. Specify whether you want fields that are NOT passed L0-> L1 in the dedupe check

23. Put your testing files in CI_files subdirectory and name them correctly and clean out any extra bonus files on there so there’s no confusion

24.If you have taxon fuzzing, copy in the new syntax with namedLocation instead of dXX, and where the redaction is folded in. If you are copying and pasting from the template,the sentence starts with ‘ For each record *p* of `r pTable[“id”]` where **targetTaxaPresent** is ‘Yes’

#atbd #forYourNextDP #updateRequired #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

dataQFs, fates, and barcodes in publication workbooks & ATBDs

Updates to the publication workbook are almost complete, watch for a Git announcement. In the meantime, we've had questions about dataQFs and sample fates and barcodes.

dataQF: (1) if tables in the ingest and pub are one-to-one, pass dataQFs through to the pub. (2) if there are more tables in the pub than in the ingest, pass the dataQF to the table that makes the most sense, or replicate it if that makes sense. (3) if there are more tables in the ingest than the pub, pass all dataQFs through, but rename them so it's clear where they're coming from (e.g. subsamplingDataQF, analyticalDataQF, etc).

barcodes: pass the barcodes through to the pub. Put downloadPkg=none so it doesn't show up on download, but this way there will be a place for it in the future.

fates: generally shouldn't go into the pub. Usually sampleCompromised or something similar is more useful to end users. If you have an exception, consult Sarah or Claire.

#atbd #publication #samples #submission

0 notes

minutiaeforatbdauthors · 8 years

Text

ATBD wording re: downloadPkg=none variables

Kim added a brief explanatory chunk to the “Variables Reported” section of the ATBD template (where the table used to be) to explain that not all fields in the ATBD may appear in downloaded data (i.e. dataQF). Here it is in Rmarkdown:

Some variables described in this document may be for NEON internal use only and will not appear in downloaded data. These are indicated with **downloadPkg** = "none" in `r pubName` (`r ADlist["pub","ref"]`).

which appears as (using ticks as an example):

Some variables described in this document may be for NEON internal use only and will not appear in downloaded data. These are indicated with downloadPkg = "none" in NEON Data Publication Workbook for Tick Sampling (AD[09]).

#atbd

0 notes