Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarize_wqdata() function throws with some datasets #163

Open
aylapear opened this issue Jan 25, 2022 · 9 comments
Open

summarize_wqdata() function throws with some datasets #163

aylapear opened this issue Jan 25, 2022 · 9 comments

Comments

@aylapear
Copy link
Contributor

Two examples using the same EMS_ID/Station but different parameters/variables and in one case the function summarize_wqdata() works and provides a summary table while in the other case the table throws an error

Example where it works properly
data_works <- tibble::tibble(
            EMS_ID = c("0200016", "0200016", "0200016"),
           Station = c("ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93"),
          Variable = c("Nitrogen Total", "Nitrogen Total", "Nitrogen Total"),
              Code = c("0114", "0114", "0114"),
             Value = c(0.844, 0.949, 0.754),
             Units = c("mg/L", "mg/L", "mg/L"),
    DetectionLimit = c(0.03, 0.03, 0.03),
      ResultLetter = c(NA, NA, NA),
              Date = c("2021-11-07", "2021-11-21", "2021-12-05"),
           Outlier = c(FALSE, FALSE, FALSE),
      Site_Renamed = c("ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93"),
       UPPER_DEPTH = as.factor(c(NA, NA, NA)),
          Detected = as.factor(c("TRUE", "TRUE", "TRUE")),
         Timeframe = as.factor(c("2021", "2021", "2021"))
)


wqbc::summarise_wqdata(
  data_works,
  by = c("EMS_ID"),
  censored = TRUE,
  na.rm = TRUE
)

Output

# A tibble: 1 × 14
  Variable       EMS_ID      n  ncen   min   max  mean median lowerQ upperQ     sd     se lowerCL upperCL
  <chr>          <chr>   <int> <int> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1 Nitrogen Total 0200016     3     0 0.754 0.949 0.849  0.845  0.793  0.901 0.0799 0.0461   0.763   0.944
Example where it fails
data_fails <- tibble::tibble(
            EMS_ID = c("0200016", "0200016", "0200016", "0200016", "0200016"),
           Station = c("ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93"),
          Variable = c("Aluminum Total",
                       "Aluminum Total","Aluminum Total","Aluminum Total","Aluminum Total"),
              Code = c("AL-T", "AL-T", "AL-T", "AL-T", "AL-T"),
             Value = c(0.031, 0.0192, 0.397, 0.0183, 0.1),
             Units = c("mg/L", "mg/L", "mg/L", "mg/L", "mg/L"),
    DetectionLimit = c(0.5, 0.5, 0.5, 0.5, 0.5),
      ResultLetter = c(NA, NA, NA, NA, NA),
              Date = c("2020-01-05","2020-01-27",
                       "2020-02-02","2020-02-17","2020-03-01"),
           Outlier = c(FALSE, FALSE, FALSE, FALSE, FALSE),
      Site_Renamed = c("ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93"),
       UPPER_DEPTH = as.factor(c(NA, NA, NA, NA, NA)),
          Detected = as.factor(c("FALSE", "FALSE", "FALSE", "FALSE", "FALSE")),
         Timeframe = as.factor(c("2020", "2020", "2020", "2020", "2020"))
)

wqbc::summarise_wqdata(
  data_fails,
  by = c("EMS_ID"),
  censored = TRUE,
  na.rm = TRUE
)

Output

Error in names(ret) <- c("mean", "se", LCL(x), UCL(x)) : 
  'names' attribute [4] must be the same length as the vector [0]
In addition: Warning message:
In survreg.fit(X, Y, weights, offset, init = init, controlvals = control,  :
  Ran out of iterations and did not converge
Backtrace:
     ▆
  1. └─wqbc::summarise_wqdata(...)
  2.   └─plyr::ddply(...)
  3.     └─plyr::ldply(...)
  4.       └─plyr::llply(...)
  5.         ├─plyr:::loop_apply(n, do.ply)
  6.         └─plyr `<fn>`(1L)
  7.           └─wqbc .fun(piece, ...)
  8.             ├─base::mean(ml)
  9.             └─NADA::mean(ml)
 10.               └─NADA .local(x, ...)
@aylapear
Copy link
Contributor Author

@joethorley

@joethorley
Copy link
Collaborator

thanks @aylapear - I'll look into

@HeatherGranger
Copy link
Contributor

@joethorley do you remember if this is still ongoing? if so, perhaps something to examine what the dependencies are and what's worth updating in it's current format.

@aylapear
Copy link
Contributor Author

The cause of this is error when all the data points are censored, specifically when the Censored = TRUE for every row.

# real data output
EMS_ID                    Station       Variable Code Value Units DetectionLimit ResultLetter       Date
1 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-01-05
2 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-01-27
3 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-02-02
4 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-02-17
5 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-03-01
  Outlier               Site_Renamed UPPER_DEPTH Detected Timeframe Censored
1   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
2   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
3   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
4   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
5   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE

Example to show which code is throwing error

# this throws error
df <- tibble(
  Value = c(0.844, 0.949, 0.754),
  Censored = c(TRUE, TRUE, TRUE)
)

ml <- with(
  df,
  cenmle(
    Value,
    Censored,
    dist = "lognormal",
    conf.int = 0.95
  )
)
ml

est <- mean(ml)
# As long as one censored value is false it will run
df <- tibble(
  Value = c(0.844, 0.949, 0.754),
  Censored = c(FALSE, TRUE, TRUE)
)

ml <- with(
  df,
  cenmle(
    Value,
    Censored,
    dist = "lognormal",
    conf.int = 0.95
  )
)
ml

est <- mean(ml)

@aylapear
Copy link
Contributor Author

This error is caused by the internal function summarise_wqdata_by() in the summaries-wqdata.R file

@aylapear
Copy link
Contributor Author

It also fails on a single value

> df <- tibble(
+   Value = c(0.011),
+   Censored = c(TRUE)
+ )
> ml <- with(
+   df,
+   cenmle(
+     Value,
+     Censored,
+     dist = "lognormal",
+     conf.int = 0.95
+   )
+ )
> ml
Error in exp(x@survreg$coef[1]) : 
  non-numeric argument to mathematical function
# fails
df <- tibble(
  Value = c(0.011),
  Censored = c(FALSE)
)
# passes when two values even if one is censored 
df <- tibble(
  Value = c(0.011, 0.0001),
  Censored = c(FALSE, TRUE)
)

@aylapear
Copy link
Contributor Author

In these edge cases when either all values are censored or only a single value is given the function should return NA' s instead of an error.

@aylapear
Copy link
Contributor Author

This would then generate this table when the site that has no data display NA's instead of throwing an error.

# A tibble: 2 × 14
  Variable       EMS_ID      n  ncen   min   max  mean median lowerQ upperQ     sd     se lowerCL upperCL
  <chr>          <chr>   <int> <int> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1 Nitrogen Total 0200016     3     0 0.754 0.949 0.849  0.845  0.793  0.901 0.0799 0.0461   0.763   0.944
1 Nitrogen Total 0478416     NA          NA      NA      NA       NA       NA      NA        NA       NA        NA        NA

@joethorley
Copy link
Collaborator

I agree - and when no data it should return a table with the same columns and classes and no rows.

@aylapear aylapear mentioned this issue Dec 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants