Conversation
Codecov Report
|
|
@beyang: Added what you wrote on Slack to the PR description. |
I suppose that makes some sense, if looking only at the name of the setting. Navigating to the repo isn't an automatic background action, it's a manual user action.
Yes! This is indeed part of the changes in #2025. We now store all repository metadata in each repo row, including clone URLs. However, it should be the case today even that repo-updater is the source of truth for repository metadata. That's what the repo lookup endpoint is for (which also serves clone URLs). The difference with the new syncer is that we don't have to contact the upstream codehost for this metadata anymore, since it's all cached in the database and updated with every background sync.
Thanks for doing this research. We had it on our plate in 3.4 to untangle this
Definitely. We're changing that in #3345 to take in a simple repo ID. |
|
I'm sorry you had to go through all this pain. This It's a testament that this is high-priority tech debt we must address in the next release cycle. |
There's the repo metadata which is stored in Postgres by the sync loops and there's the git update scheduler which is a completely separate component that stores all its state in-memory. Whenever a sync happens, the scheduler is updated. But the git update scheduler can be updated via the repo updater HTTP API explicitly, too.
It should not do any more than that. Looking at all these references confirms that. It affects only the code paths that trigger git updates due to background metadata syncs: https://sourcegraph.com/search?q=repo:%5Egithub.com/sourcegraph/sourcegraph%24+DisableAutoGitUpdates+
With However, the repo can have its metadata synced AND THEN be cloned via a succession of calls to The The The result of the
|
|
And finally, I have some questions to try to clarify how this issue was affecting $CUSTOMER with the new release, but not the one they were running before.
|
|
@keegancsmith: Added you as a reviewer because a lot of the answers I gave here are intimately related with our upcoming work on #3345. |
keegancsmith
left a comment
There was a problem hiding this comment.
@tsenart all your information is correct here and awesome! Kind of wish it could live somewhere else easier to find. One addition, repo-updater doesn't have all metadata in postgres. Crucially the cloneURL is not stored. Instead we rely on caching the responses from codehost APIs. However, with our new syncer changes we now store the cloneURL so this certainly gonna be cleaned up in 3.4 (finally!!!)
There was a problem hiding this comment.
👍 this seems like the most important thing that was missing? I wonder why this wasn't set before.
There was a problem hiding this comment.
From the docstring, it looks like the method was just added a placeholder. It can't actually fully fetch Gitolite repo metadata, because it would have to invoke the gitserver endpoint, which is only necessary because currently only gitserver has the Gitolite SSH credentials. My guess is that the author didn't want to call that endpoint here, but still needed to return something to indicate the repository exists.
BTW, I think we can fix this whole issue by making the Gitolite SSH credentials be an explicit part of config, then using the golang stdlib ssh package, instead of shelling out to ssh to get the Gitolite repo metadata. Then we can have repo-updater hit the Gitolite API directly.
There was a problem hiding this comment.
BTW, I think we can fix this whole issue by making the Gitolite SSH credentials be an explicit part of config, then using the golang stdlib ssh package, instead of shelling out to ssh to get the Gitolite repo metadata. Then we can have repo-updater hit the Gitolite API directly.
We should probably do that as part of integrating Gitolite external services with the new syncer.
|
Answers to @tsenart's questions:
A few follow-up questions/clarifications:
|
The syncer's responsibility is to converge the set of mirrored repositories with what's defined in external service configuration and to keep their meta-data up-to-date. The Git update scheduler subscribes to syncs and updates its own schedule with the full set of mirrored repos. Eventually, we'd like to move the Git update scheduler to git-server itself and use a unified log (e.g. using Redis Streams) for the data integration between repo-updater and git-server.
Yes.
The The syncer publishes diffs to a channel which the git update scheduler reads from and manages its own schedule. As for (4), I believe Zoekt periodically polls all existing repos in the database via the frontend API. (Pull, not push). Can you confirm this @keegancsmith? Our idea of using a unified log for data integration would also apply here. See https://brandur.org/redis-streams.
It will only be necessary to pass the repo id. Nothing else. |
|
Zoekt is poll based. It is pretty aggressive polling. Given the time and resources it takes to index a repo, the slight delay we get from polling is far less than the amount of indexing it does in a busy org. |
What I've learned:
•
disableAutoGitUpdatesdoesn't affect the behavior of navigating directly to a repo to get it to clone• The need for a clone URL to be set the first time gitserver clones a repository is documented in the code, but it's unclear where this clone URL actually comes from. Perhaps it would be useful for repo-updater to expose this and serve as the source of truth for such repository metadata? If that shouldn't be repo-updater, what should it be?
• Three code paths that might require a new repo to be cloned to gitserver are (1) repo-updater loop, (2) direct user navigation to a repo page, (3) triggering a repo-update by the
$REPO/-/refreshendpoint. Any others we might have to cover?• Maybe it's worth changing the interface of
EnqueueRepoUpdate(https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/pkg/repoupdater/client.go#L145:0) to make it more explicit when the expectation is that a repo is cloned if it doesn't exist already. The current behavior of needing to setRepo.URLis a bit hidden.Fixes #3336