Skip to content
This repository has been archived by the owner on May 25, 2021. It is now read-only.

COUCHDB-769: Store attachments in the external storage. #33

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

gilv
Copy link

@gilv gilv commented Oct 14, 2015

Initial implementation that allows CouchDB to store attachments outside of the database file.
This implementation supports OpenStack Swift and SoftLayer Object store

…mplementation, supports OpenStack Swift and SoftLayer Object store
swift_delete_container(DbName).

externalize_att(Db) ->
Res = config:get("swift","attachments_offload","false"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option looks as boolean, but you use "external" value elsewhere in place of "true". If it pretend to be a flag, use config:get_boolean. Otherwise better pick less confusing name for default value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, was confused. So it's flag. Definitely use config:get_boolean and true/false atoms elsewhere.

@kxepal
Copy link
Member

kxepal commented Oct 14, 2015

Please fix:

  • Style: try to fit 80 chars line length, put space after comma, no tabs, correctly aligned clauses etc.
  • Reduce amount of debug logs: debug logs should help you understand what going on in your code and in which state it is in places where problems may happens, not produce noise about every logical step.

NewAtt = couch_att:store(data,N1,Att),
couch_log:debug("Swift. testing store in original length ~p~n",[AttLen]),
{ObjectSize,EtagMD5} = swift_head_object(DbName, Name),
NewAtt1 = couch_att:store([{att_external_size,ObjectSize},{att_external,"external"},{att_external_md5,EtagMD5}],NewAtt),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you store "external" as list, no at atom, or at least as binary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that "external" value is not the best choice. Since field is already carries external word in the key, better specify which exact external we assume here. Today it's only swift, tomorrow it could be also S3/NFS/whatever which will have to use different integration logic.

@kxepal
Copy link
Member

kxepal commented Oct 14, 2015

Would be nice to see some tests for this to prove that this all works.

@@ -0,0 +1,333 @@
%% @author gilv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general note that we don't include attributions in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. And we need ASF license header here.

@rnewson
Copy link
Member

rnewson commented Oct 29, 2015

I'm not going to have time to review this in detail, but I agree it needs tests to go in, as well as considerable cleanup.

Noting here that this should not be merged before 2.0 is released whatever happens.

@gilv
Copy link
Author

gilv commented Jan 4, 2016

@kxepal @rnewson
I provided another patch, implementing most of the comments.
Here is what is not implemented in this patch:

  1. Remark on the AttFun. Not yet implemented. Original remark: “AttFun could be actually an attachment data or some fold function? Something went wrong if you have to do this.”
  2. usage of couch_util:to_hex
  3. still no unitests

++ " ~p in the container: ~n",[FileName, ContentLen,
ContainerName]),
%TO-DO: No chunk reader. All kept in the memory. Should be fixed.
Data = couch_httpd:recv(Req, ContentLen),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be in swift backend module, at least. And certainly not couch_httpd (even if these two does the same thing) - just chttpd.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kxepal . Thanks for review it.
fabric_at_handler is a layer between CouchDB and the backend store. It's not aware how backend store works or what it does internally. I completely agree with your other comments, for example "201" is something that belongs to the backend store and should be there.

Data = couch_httpd:recv(Req, ContentLen) is suppose to read data from the HTTP request, using the same mechanism as CouchDB works, so i use the same function to read data. Otherwise i will need to copy-paste the same code here.
Why is it different to function att_processor(DbName, Att)? That function also "aware" of CouchDB's internal functions, for example it uses couch_at module to update attachments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why it's different is simple: couch_httpd/chttpd are frontend modules that provides HTTP API for internal stuff like couch_att.

The reason why couch_httpd better to avoid use here as that this module is for backdoor interface while chttpd that used for the front, cluster one, may eventually start work by different rules.

The reason why chttpd/couch_httpd shouldn't be here is that fabric is a layer to run cluster wide operations after http request get processed.

Also, have you checked fabric:att_receiver/2? Looks like exactly what you need here.

{ok, Container} ->
couch_log:debug("Container ~p created", [Container]);
{error,_} ->
couch_log:debug("Container ~p creation failed", [DbNameSuffix])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't handle an error by logging, surely this is fatal?

@rnewson
Copy link
Member

rnewson commented Jan 4, 2016

there's a lot to review here. The huge number of :debug calls seems inappropriate, and often appears to be placeholders / reminders for future work (like error handling).

I note also that indentation rules are still not followed (4 spaces, no 'pretty' alignment).

{error, DbName}
end;
container(get, DbName) ->
couch_log:debug("Get container ~p~n", [DbName]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

@rnewson
Copy link
Member

rnewson commented Jan 4, 2016

Given the nature of the patch, this cannot merge without tests, assuming all style and other issues are resolved.

@gilv
Copy link
Author

gilv commented Jan 4, 2016

@rnewson what is the style convention i should use? For example, there was a comment that my code should use 80 chars per line, but there is bunch of existing code with more than 80 in a line.
Is there any template for Eclipse so i can use it for style? spaces, tabs, etc..

@gilv
Copy link
Author

gilv commented Jan 5, 2016

@kxepal @rnewson

  1. Please ignore the debug statements in my code. When my code will be stable enough - i will just remove those debug prints. Since i am changing code rapidly - i still need those prints.
  2. You both wrote that the code style is not good. Can you please tell me what should be the code style? Is there any template i can use? Is there any documentation in CouchDB where it's written what should be coding style? Perhaps i missed it.
  3. Tests: What would you like to have exactly? To fully test that my patch is working there is need to access the Swift cluster. I have some functional tests, that upload attachments to CouchDB. Those tests access Swift cluster directly and verify that attachments stored correctly. But i assume it's not what you mean. Can you please point me to some existing tests in CouchDB so i will see what i should implement in my case?

@kxepal
Copy link
Member

kxepal commented Jan 6, 2016

@gilv
2. I'm afraid there is no any document that describes whole set of rules. I want to setup elvis to define and control that, but it's a future. For now, better just follow the comments.
3. Tests need to cover whole attachments API to ensure that everything works same as before + I think some test with replication would be nice to have to see how it goes. Here is the main question is how to ensure that swift integration code is really works. I'm not sure that it's possible to verify that without Open Stack cluster. Hopefully, today, we have docker thing for the resque.

@gilv
Copy link
Author

gilv commented Jan 20, 2016

@kxepal @rnewson
Few notes related recent patch

  1. Please ignore all debug / info print statements. At some point i will remove them at once.
  2. I addressed most of the comments.
  3. Remark on the AttFun. Not yet implemented. Original remark: “AttFun could be actually an attachment data or some fold function? Something went wrong if you have to do this.”
  4. Still no unitests / functional tests.
  5. Attachment retrieval from the object store - keep all data in memory. I need to figure out how to use streaming there.

@gilv
Copy link
Author

gilv commented Jan 29, 2016

@kxepal Can you please review the recent code?

@lazedo
Copy link

lazedo commented Feb 4, 2016

@gilv i think att_store/2 should be att_store/3 with Options param so that we can have different drivers for different databases and/or documents. get_backend_driver should try to read it from Options mentioned above and get it from config ( maybe add a section external_storage_drivers for mapping) with default to active_store. the Options would also allow passing specific driver options.
thoughts ?

@kxepal
Copy link
Member

kxepal commented Feb 4, 2016

@gilv I can read the code and leave few more comments about styling, naming and else not-much-important-but-annoying bits, but I think that would be much better see any tests and the way to try this at home. That's more important than yet-another-spacing issue.

@kxepal
Copy link
Member

kxepal commented Feb 4, 2016

@lazedo I think there is no practically need: these options could be fetched within att_store/2 transparently for the API user during driver resolve and without abstraction leaks.

Assume we have per-db driver configuration. Where such options will be stored? First case: in config. Then we do Driver = get_backend_driver(Db) and Options = get_backend_driver_options(Db, Driver) right before do any driver API call. Another case: db options are stored in _metadata db - but this changes nothing, only configuration backend. What else Options could be passed here?

@lazedo
Copy link

lazedo commented Feb 4, 2016

@kxepal i'm thinking in a way to have different provider per attachment/document. the use case is a document with two or more attachments where one could stored in a public accessible manner (dropbox, gdrive) and others more sensitive or private could be stored in another like evernote.
the per-db driver configuration makes sense but i believe it doesn't respond to this?

@lazedo
Copy link

lazedo commented Feb 4, 2016

@kxepal passing the Options to the driver would also allow for example to for form the complete url in a google drive driver implementation (folder name).

@kxepal
Copy link
Member

kxepal commented Feb 4, 2016

@lazedo
Options argument assumes that it will be passed from the outside. Who will form it and how? And if it contains some url to there attachment have to be stored (by your example, as I get it right), how it would be preserved on the following attachment read? Have to be we need to store these options somewhere and link them by the database-document-attachment triple id. In att_store we already have all these information that need to lookup such options (assuming that they are stored outside of the document).

Per-attachment configuration involves custom attachment stub fields feature, so they all will be available for the passed Doc argument (in anyway it's a reasonable to store them within the document itself). But that would be a question how to store private information there and how they should be replicated.

@lazedo
Copy link

lazedo commented Feb 4, 2016

@kxepal maybe i'm missing the objective of this feature. as i see it, a external attachment would still create the stub information about how to reach the real attachment. the stubwould have the information needed to get the attachment back, so, you would need to store the driver in case the configuration is changed and you start using another driver but still want to reach the already stored attachments. the options would allow more flexibility to the driver creating the necessary stub information needed to get the attachment back and also to choose which driver we want for the attachment.

@kxepal
Copy link
Member

kxepal commented Feb 4, 2016

@lazedo I think I miss the case as well. Anyway, to give the proper answer on this need to implement few more different drivers against very different backends for point of their API. Then we can shape our API more-or-less right. Now it's more like a proof-of-concept, but need to start from something.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants