Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New section on querying SQL databases for Import and export chapter section 7.10 #146

Open
AmyMikhail opened this issue Oct 24, 2023 · 0 comments

Comments

@AmyMikhail
Copy link

Scope:

cc @nsbatra

The current Import and export chapter 7 includes a subsection, 7.10 that deals with APIs and making HTTP requests. This is helpful where APIs have been set up for large databases, however people are still likely to encounter and may need to extract data from databases that do not have APIs. A common scenario is a mysql database that is being used to hold disease surveillance data. The database may comprise of tables and views that users can access with their own user name and password.

To incorporate the querying and import of resultant data from an SQL database into R in a script or R markdown document, four things need to be demonstrated:

  1. Best practice for securing login details, e.g. by using getPass() and encrypting, or by using dedicated packages such as keychain
  2. How to connect to the database using the login details dynamically with connection strings and the DBI package or similar
  3. How to deal with decryption of data from an encrypted database in R, either during the query or after import
  4. How to create and send a query from R, using dplyr() primarily with a matched example of how this could be done with SQL from an R markdown document chunk

Ideally, it would be nice to set up a worked example which Epi R handbook readers could also interact with and try out themselves. This would mean creating an SQL database containing fake surveillance data (or publically available surveillance data e.g. from WHO) and building a scenario around it, such as creating a summary figure and table in an automated annual surveillance report. The database would need to be sandboxed and login details either provided in the text, or auto-generated and temporary login details given to the user. It would be important that they can see what their login details are, so that it is clear how to incorporate this into a connection string. The database could use some standard encryption so that decryption can be demonstrated as well (as this is a common practice in surveillance databases for columns containing personal identifying information).

Resources:

There is a nice e-book by POSIT on connecting to databases here. In this e-book there is a nice example of how to query a database using dplyr verbs - which is appealing because this approach will fit well with the rest of the handbook and also means that the user doesn't have to learn SQL to query their database. It might also be helpful to show how a query would be done with dplyr and then with SQL in R, for comparison (and in case there are some instances where for instance SQL is faster). Examples from the POSIT e-book are given here and here.

I would suggest this be added as a new sub-section after the current section 7.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant