Git repositories seem like an elegant solution for package registry data. Pull requests for governance, version history for free, distributed by design. But as registries grow, the cracks appear.
GitHub explicitly asked Homebrew to stop using shallow clones. Updating them was “an extremely expensive operation” due to the tree layout and traffic of homebrew-core and homebrew-cask.
I’m not going through the PR to understand what’s breaking, since it’s not immediately apparent from a quick skim. But three possible problems based on what people are mentioning there.
The problem is the cost of the shallow clone
Assuming that the workload here is always --depth=1 and they aren’t doing commits at a high rate relative to clones, and that’s an expensive operation for git, I feel like for GitHub, a better solution would be some patch to git that allows it to cache a shallow clone for depth=1 for a given hashref.
The problem is the cost of unshallowing the shallow clone
If the actual problem isn’t the shallow clone, that a regular clone would be fine, but that unshallowing is a problem, then a patch to git that allows more-efficient unshallowing should be a better solution. I mean, I’d think that unshallowing should only need a time-ordered index of commits referenced blobs up to a given point. That shouldn’t be that expensive for git to maintain an index of, if it doesn’t already have it.
The problem is that Homebrew has users repeatedly unshallowing a clone off GitHub and then blowing it away and repeating
If the problem is that people keep repeatedly doing a clone off GitHub — that is, a regular, non-shallow clone would also be problematic — I’d think that a better solution would be to have Homebrew do a local bare clone as a cache, and then just do a pull on that cache and then use it as a reference to create the new clone. If Homebrew uses the fresh clone as read-only and the cache can be relied upon to remain, then they could use --reference alone. If not, then add --dissociate. I’d think that that’d lead to better performance anyway.
I’m not going through the PR to understand what’s breaking, since it’s not immediately apparent from a quick skim. But three possible problems based on what people are mentioning there.
The problem is the cost of the shallow clone
Assuming that the workload here is always
--depth=1and they aren’t doing commits at a high rate relative to clones, and that’s an expensive operation for git, I feel like for GitHub, a better solution would be some patch to git that allows it to cache a shallow clone for depth=1 for a given hashref.The problem is the cost of unshallowing the shallow clone
If the actual problem isn’t the shallow clone, that a regular clone would be fine, but that unshallowing is a problem, then a patch to git that allows more-efficient unshallowing should be a better solution. I mean, I’d think that unshallowing should only need a time-ordered index of commits referenced blobs up to a given point. That shouldn’t be that expensive for git to maintain an index of, if it doesn’t already have it.
The problem is that Homebrew has users repeatedly unshallowing a clone off GitHub and then blowing it away and repeating
If the problem is that people keep repeatedly doing a clone off GitHub — that is, a regular, non-shallow clone would also be problematic — I’d think that a better solution would be to have Homebrew do a local bare clone as a cache, and then just do a pull on that cache and then use it as a reference to create the new clone. If Homebrew uses the fresh clone as read-only and the cache can be relied upon to remain, then they could use
--referencealone. If not, then add--dissociate. I’d think that that’d lead to better performance anyway.