Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to migrate/include legacy repositories #76

Closed
hagenw opened this issue Apr 12, 2023 · 32 comments · Fixed by #130
Closed

How to migrate/include legacy repositories #76

hagenw opened this issue Apr 12, 2023 · 32 comments · Fixed by #130

Comments

@hagenw
Copy link
Member

hagenw commented Apr 12, 2023

As discussed in #58 the only solution to removing the ext argument from all our method requires that we store artifacts differently on the servers (see #62), namely:

<host>/<repository>/<folder>/<version>/<name>

instead of

<host>/<repository>/<folder>/<name-wo-ext>/<version>/<name-wo-ext>-<version>.<ext>

This means we need a strategy how to handle already existing repositories that store artifacts in the old structure.
One solution was already proposed in #58 which we follow up on here.

Proposal:

  • Rename existing repositories and add -legacy at their end, e.g. data-public-local-legacy and make them read only (might need to be temporarily changed if data has to be deleted)
  • Make sure the affected Python packages are able to read from different repositories
  • Implement an artifactory-legacy backend in audbackend
  • Register the old and new repositories with the Python packages, e.g. with an entry in .audb.yaml
  • Add an upload exclude pattern to the new repositories on Artifactory to make sure users that use an old version of audbackend are not able to upload new artifacts.
  • Add an upload exclude pattern to the legacy repositories on Artifactory to make sure users can only upload using the artifactory-legacy backend or an old version of audbackend (they repos should be readonly, but it might be needed to access files to delete single artifacts).
@frankenjoe
Copy link
Collaborator

frankenjoe commented Apr 12, 2023

Mhh, I thought we simply rename the files on the repository and then apply the include pattern?

@hagenw
Copy link
Member Author

hagenw commented Apr 12, 2023

The solution with the legacy repositories would have the advantage that we don't need to republish any data that was published before.

I would maybe to an exception for our public Artifactory audb repo, as it only contains emodb anyway at the moment.
But let's see.

@hagenw
Copy link
Member Author

hagenw commented Apr 12, 2023

Mhh, I though we simply rename the files on the repository and then apply the include pattern?

I would not do this as it would break mostly all existing code that has used audb, audmodel, etc.

@frankenjoe
Copy link
Collaborator

I don't see a big problem here, users simply have to update to the latest package and it will work again.

@hagenw
Copy link
Member Author

hagenw commented Apr 12, 2023

Mh, but renaming the repos will also break old code. Maybe we have to create new repos and stick with the old names for the existing ones?

@frankenjoe
Copy link
Collaborator

And don't forget - for old code, databases are (usually) loaded from the cache, which is not affected by the change. So in the end it will only break code when a database needs to be loaded to the cache again.

@hagenw
Copy link
Member Author

hagenw commented Apr 12, 2023

I don't see a big problem here, users simply have to update to the latest package and it will work again.

It might easily happen that you have code that only works with an old numpy or pandas version and you cannot update to the new audb version that requires a newer pandas version (the same holds for more complicated requriements.txt.lock files where the main purpose is that you can reproduce the results). In that case you will have to update a lot of your code until it works again. This doesn't look like a solution to me at all.

@frankenjoe
Copy link
Collaborator

If your code really depends on such old versions of a package then it's probably a very good idea to update your code anyway :) And if it is really not possible to update your code, then you only have to make sure that the databases your projects depends on are cached. So to me it doesn't sound like a big problem.

@hagenw
Copy link
Member Author

hagenw commented Apr 12, 2023

If your code really depends on such old versions of a package then it's probably a very good idea to update your code anyway :)

Yes and no. If the code simply produced an output (e.g. a model) that for whatever reason you have lost in the mean time, you just want to rerun the code and don't spend any time changing it.

On the other hand it might indeed be not as bad as I first have thought as it might be enough to only update the audbackend package and leave all other packages the same. The only requirement here would be that the other packages are already new enough to use audbackend and do not communicate directly with Artifactory.

@hagenw
Copy link
Member Author

hagenw commented Apr 12, 2023

Mhh, I thought we simply rename the files on the repository and then apply the include pattern?

This would indeed be a completely different solution to what is proposed here and in #58.

It would be:

  • Remove read and write access to the repos for all users besides us, then start renaming all the artifacts to the new structure (somehow making sure that we can continue where we have stopped if the connection is lost)
  • Use exclude (and maybe include) patterns to make sure old versions of audbackend or other packages can no longer read/write files with the old structure
  • Enable read/write access to all users and tell them they have to update audbackend in order to access the repos, but no config files need to be changed
  • We don't need to implement an artifactory-legacy backend

@hagenw
Copy link
Member Author

hagenw commented May 9, 2023

As we can now update the first packages to use the new audbackend implementation (audeering/audb#305), we should further consolidate the migration plan here.

I would propose the following:

  • Add a conversion script to audbackend that can convert an old repository structure to the new one
  • For each existing repository, create a new one and migrate all the published artifacts to the new one
  • Set all old repositories to be read-only
  • Add an upload exclude pattern to the new repositories on Artifactory to make sure users that use an old version of audbackend are not able to upload new artifacts

Advantages:

  • Old and new code will work
  • We don't need to implement legacy backends

Disadvantages:

  • Will double the needed disc space on the hosts

If we consider our current data-public on https://audeering.jfrog.io used by audb this would mean we need to find a name for the new repo, e.g. data. When finished it will have the advantage that we can arrive at a stage where the tests for main (using he old backend) and dev (using the new backend) will still work.

@frankenjoe
Copy link
Collaborator

If we consider our current data-public on https://audeering.jfrog.io used by audb this would mean we need to find a name for the new repo, e.g. data. When finished it will have the advantage that we can arrive at a stage where the tests for main (using he old backend) and dev (using the new backend) will still work.

Makes sense.

@frankenjoe
Copy link
Collaborator

  • Will double the needed disc space on the hosts

An obvious solution would be to create symbolic links from the old to the new repository. But it seems that Artifactory does not provide symbolic links. However, while googling I came across this issue:

https://stackoverflow.com/questions/40661376/artifactory-symlink-or-aliasing-an-artifact-url

and apparently it can be solved with a user plugin. So maybe this provides a way to redirect requests with old audb versions to the new location? Maybe even on-the-fly without actually creating symbolic links?

@hagenw
Copy link
Member Author

hagenw commented May 10, 2023

This might indeed be a solution, the only problem is that you need to write it in Groovy and it is not so easy to test what it is actually doing ;)

@frankenjoe
Copy link
Collaborator

@hagenw
Copy link
Member Author

hagenw commented May 10, 2023

I'm still not completely convinced by our solutions how to migrate, especially as there is no need at all to migrate Artifactory repositories. The main reason we removed the ext argument (#58) was to support other backends that do not have a folder structure, besides that it makes it much easier to understand the methods in audbackend.

So maybe we should instead of migrating add legacy backends and simply continue publishing using those.
This would still lead to some problems when we change the name of the backend, e.g. to artifactory-legacy.
The alternative would be to use different names for the new backends, which is of cause not so easy as artifactory2 also looks ugly.

Another alternative would be to continue publishing with the old structure in the Artifactory backend, which would work for audb (#58 (comment)), but not for audmodel (#58 (comment)) and audbenchmark (#58 (comment)). But maybe it will be easier to add workarounds to those packages instead of having to introduce two different Artifactory backends.

@frankenjoe
Copy link
Collaborator

frankenjoe commented May 10, 2023

So maybe we should instead of migrating add legacy backends and simply continue publishing using those.

Mhh, but how do we guarantee users will use the legacy backends with those repositories?

But maybe it will be easier to add workarounds to those packages instead of having to introduce two different Artifactory backends.

This feels like we are simply postponing the problem for now :)


But how about that plugin solution I mentioned before. It might allow us to move the legacy problem from the backend implementation to the repository. E.g. we could keep the current structure of the repositories and on-the-fly convert requests with the new file structure to the old one.

@hagenw
Copy link
Member Author

hagenw commented May 10, 2023

So maybe we should instead of migrating add legacy backends and simply continue publishing using those.

Mhh, but how do we guarantee users will use the legacy backends with those repositories?

Agree, so I would vote for the other solution of continuing to publish the artifacts on the Artifactory backend as before.

But maybe it will be easier to add workarounds to those packages instead of having to introduce two different Artifactory backends.

This feels like we are simply postponing the problem for now :)

Not directly, we can try to solve those issue in audmodel and audbenchmark and might migrate repositories there.
At least I would take a look if we can solve the issue there easier than with audb.

@hagenw
Copy link
Member Author

hagenw commented May 10, 2023

But how about that plugin solution I mentioned before. It might allow us to move the legacy problem from the backend implementation to the repository. E.g. we could keep the current structure of the repositories and on-the-fly convert requests with the new file structure to the old one.

You still have the problem then that a user might use the new audb version with the old repository and wonders why she is not allowed to publish databases there. In addition, it sounds not very performant if we need to do redirects for each single operation of the new audb. Which means most likely you would be faster if you simply continue to use an older version of audb.

@hagenw
Copy link
Member Author

hagenw commented May 10, 2023

One solution for audmodel might be to do the breaking change when switching to a more general audstore package. Then we could do it in a way that audmodel will never support audbackend>=1.0.0 and after audstore is released we could set the old model repository to read-only and you can no longer publish with audmodel. But then updating the code should be much easier as you don't need to configure any new backend/repository, but simply replace audmodel by audstore in your code.

@frankenjoe
Copy link
Collaborator

frankenjoe commented May 10, 2023

You still have the problem then that a user might use the new audb version with the old repository and wonders why she is not allowed to publish databases there.

If there is a beforeDownloadRequest I am sure there is also a beforeUploadRequest, so publishing should also be possible with old and new audb.

In addition, it sounds not very performant if we need to do redirects for each single operation of the new audb

Mhh, I don't think you will even notice a difference in performance. It's basically a regex operation (to find out if the path is old or new) and if it's a new path you return convert the path to the old format, otherwise you simply forward the path.

@frankenjoe
Copy link
Collaborator

frankenjoe commented May 10, 2023

The main reason we removed the ext argument (#58) was to support other backends that do not have a folder structure

The main reason was that the same backend path can translate into different paths on Artifactory.

@hagenw
Copy link
Member Author

hagenw commented May 11, 2023

The main reason we removed the ext argument (#58) was to support other backends that do not have a folder structure

The main reason was that the same backend path can translate into different paths on Artifactory.

True, forgot about that. But for this we also do not need to change how we store artifacts on Artifactory. The main problem we phase is that in audmodel we misuse the structure and store two files with different file extension in the same folder, something that we shouldn't have done.

@frankenjoe
Copy link
Collaborator

frankenjoe commented May 11, 2023

in audmodel we misuse the structure and store two files with different file extension in the same folder, something that we shouldn't have done.

Yes, if you want so. But of course it makes sense we wanted the files /sub/file.txt and /sub/file.tar.gz to be stored in the same folder on Artifactory (although one could argue that how something is stored is irrelevant to the user as she only sees the backend path).


So I would very much prefer a solution where we in the future use <host>/<repository>/<folder>/<version>/<name> instead of the unnecessary complex <host>/<repository>/<folder>/<name-wo-ext>/<version>/<name-wo-ext>-<version>.<ext>. For our public repository I don't see a big problem - we only host one database at the moment and it's just us publishing there, so we can ensure we use the latest audb package. There are also only two public projects at the moment that read from that server and I think it's reasonable if we simply update the dependencies there.

So the big issue are our internal repositories. Moving the files to the new structure might indeed not be a solution - it's risky we have a data loss somehow and also we would lose the author information. So here I would say we at least explore a bit if a plugin solution could work that allows us to keep the old structure and supports reading / writing with the old and new backend implementation. If it turns out it is not possible or creates too much of an overhead, I guess we need to go with a ArtifactoryLegacy backend. I guess we can write a single one that supports not only audb, but also audmodel and audbenchmark if we hard code the extensions of certain files. In addition we have to use include/exclude patterns to avoid users are publishing with the new Artifactory implementation. And of course we have to instruct our colleagues to update their .yaml config files.

@hagenw
Copy link
Member Author

hagenw commented May 11, 2023

use <host>/<repository>/<folder>/<version>/<name> instead of the unnecessary complex <host>/<repository>/<folder>/<name-wo-ext>/<version>/<name-wo-ext>-<version>

The second one looks indeed unnecessary complex, but it's not when you browse the files in the web UI, e.g.

image

It also has an advantage that the version number is part of the filename when downloading a single file through the web UI.

But I'm also fine with changing it, I just wanted to make the point that the old structure is not as ridiculous as it might appear.

@frankenjoe
Copy link
Collaborator

I agree, version number being part of the filename is indeed a nice feature and probably the reason they invented this file structure. But as soon as you have file names where the extension includes one or more dots, it becomes problematic.

@frankenjoe
Copy link
Collaborator

frankenjoe commented May 11, 2023

So here I would say we at least explore a bit if a plugin solution could work that allows us to keep the old structure and supports reading / writing with the old and new backend implementation

Looking at https://jfrog.com/help/r/jfrog-integrations-documentation/user-plugins the callbacks beforeUploadRequest and beforeDownloadRequest should allow us to redirect the download / upload path. But I am not sure how to handle a exists, checksum and search request.

@frankenjoe
Copy link
Collaborator

frankenjoe commented May 12, 2023

Ok, here's one more solution:

We add Artifactory.get_legacy() and Artifactory.put_legacy() and in packages where we want to rely on the old Artifacory file structure we do:

if isinstance(backend, Artifactory):
    backend.get_legacy(..., ext=...)
else:
    backend.get(...)

@frankenjoe
Copy link
Collaborator

Or actually we would need to generally stick to the old Artifactory structure (i.e. <host>/<repository>/<folder>/<name-wo-ext>/<version>/<name-wo-ext>-<version>.<ext>) and use the legacy functions in places where want to control the extension.

@hagenw
Copy link
Member Author

hagenw commented May 12, 2023

Or actually we would need to generally stick to the old Artifactory structure (i.e. <host>/<repository>/<folder>/<name-wo-ext>/<version>/<name-wo-ext>-<version>.<ext>) and use the legacy functions in places where want to control the extension.

This sounds like a good solution to me. This is basically in line what I proposed in #76 (comment) and your legacy methods would provide solutions that we could apply in audmodel and audbackend.

@frankenjoe
Copy link
Collaborator

#130 implements a solution which is even more elegant I think. Instead of introducing those legacy functions it offers a way to set custom extensions. In packages where we have used the ext argument before, we only have to add ext to Artifactory.extensions after creating the backend. Compared to the previous implementation it has the limitation that you cannot handle files with the same extensions in different ways, e.g. you cannot store file.tar.gz once under the extension tar.gz and another time under gz, but I guess that's fine.

@hagenw
Copy link
Member Author

hagenw commented May 15, 2023

So here I would say we at least explore a bit if a plugin solution could work that allows us to keep the old structure and supports reading / writing with the old and new backend implementation

Looking at https://jfrog.com/help/r/jfrog-integrations-documentation/user-plugins the callbacks beforeUploadRequest and beforeDownloadRequest should allow us to redirect the download / upload path. But I am not sure how to handle a exists, checksum and search request.

One disadvantage of using user-plugins is that we make us more vulnerable to Artifactory updates as it might happen that the code of the plugin becomes deprecated (which seems to have happened at least ones if you look at the example repo) and then we need to update it before we can access the data again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants