return clone URL for Gitolite repositories by beyang · Pull Request #3358 · sourcegraph/sourcegraph-public-snapshot

beyang · 2019-04-11T03:57:28Z

What I've learned:
• disableAutoGitUpdates doesn't affect the behavior of navigating directly to a repo to get it to clone
• The need for a clone URL to be set the first time gitserver clones a repository is documented in the code, but it's unclear where this clone URL actually comes from. Perhaps it would be useful for repo-updater to expose this and serve as the source of truth for such repository metadata? If that shouldn't be repo-updater, what should it be?
• Three code paths that might require a new repo to be cloned to gitserver are (1) repo-updater loop, (2) direct user navigation to a repo page, (3) triggering a repo-update by the $REPO/-/refresh endpoint. Any others we might have to cover?
• Maybe it's worth changing the interface of EnqueueRepoUpdate (https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/pkg/repoupdater/client.go#L145:0) to make it more explicit when the expectation is that a repo is cloned if it doesn't exist already. The current behavior of needing to set Repo.URL is a bit hidden.

Fixes #3336

codecov · 2019-04-11T04:18:00Z

Codecov Report

Merging #3358 into master will decrease coverage by <.01%.
The diff coverage is 42.1%.

Impacted Files	Coverage Δ
cmd/repo-updater/repos/gitolite.go	`0% <0%> (ø)`	⬆️
pkg/extsvc/gitolite/codehost.go	`0% <0%> (ø)`	⬆️
cmd/frontend/internal/httpapi/repo_refresh.go	`71.42% <80%> (+11.42%)`	⬆️

tsenart · 2019-04-11T08:48:39Z

@beyang: Added what you wrote on Slack to the PR description.

tsenart · 2019-04-11T09:04:42Z

disableAutoGitUpdates doesn't affect the behavior of navigating directly to a repo to get it to clone

I suppose that makes some sense, if looking only at the name of the setting. Navigating to the repo isn't an automatic background action, it's a manual user action.

Perhaps it would be useful for repo-updater to expose this and serve as the source of truth for such repository metadata? If that shouldn't be repo-updater, what should it be?

Yes! This is indeed part of the changes in #2025. We now store all repository metadata in each repo row, including clone URLs.

However, it should be the case today even that repo-updater is the source of truth for repository metadata. That's what the repo lookup endpoint is for (which also serves clone URLs).

The difference with the new syncer is that we don't have to contact the upstream codehost for this metadata anymore, since it's all cached in the database and updated with every background sync.

Three code paths that might require a new repo to be cloned to gitserver are (1) repo-updater loop, (2) direct user navigation to a repo page, (3) triggering a repo-update by the $REPO/-/refresh endpoint. Any others we might have to cover?

Thanks for doing this research. We had it on our plate in 3.4 to untangle this EnqueueRepoUpdate mess as part of: #3345. So, we are familiar with these two first code paths, but not the last one. There seem to be a few more

Maybe it's worth changing the interface of EnqueueRepoUpdate (https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/pkg/repoupdater/client.go#L145:0) to make it more explicit when the expectation is that a repo is cloned if it doesn't exist already. The current behavior of needing to set Repo.URL is a bit hidden.

Definitely. We're changing that in #3345 to take in a simple repo ID.

tsenart · 2019-04-11T09:08:31Z

I'm sorry you had to go through all this pain. This EnqueueRepoUpdate call graph is in very poor condition as I realised two days ago.

It's a testament that this is high-priority tech debt we must address in the next release cycle.

tsenart · 2019-04-11T09:48:38Z

What are the main pieces of state that repo-updater maintains (e.g., repo queue)? Is there a way to tell what the current state is?

There's the repo metadata which is stored in Postgres by the sync loops and there's the git update scheduler which is a completely separate component that stores all its state in-memory. Whenever a sync happens, the scheduler is updated. But the git update scheduler can be updated via the repo updater HTTP API explicitly, too.

There are a number of places where DisableAutoGitUpdates is referenced. Its documentation says "Disable periodically fetching git contents for existing repositories" but it appears it may do more than that?

It should not do any more than that. Looking at all these references confirms that. It affects only the code paths that trigger git updates due to background metadata syncs: https://sourcegraph.com/search?q=repo:%5Egithub.com/sourcegraph/sourcegraph%24+DisableAutoGitUpdates+

I tried repro'ing with GitHub repositories (instead of Gitolite). AFAICT, the bug exists there, but is mitigated by the fact that the GitHub repo is eventually cloned in the background. But doesn't this violate "DisableAutoGitUpdates": true?

With DisableAutoGitUpdates set to true, the automatic git update scheduler is off. This means that after a repo updater metadata sync, updating this schedule is a no-op as you can see:

Here for the old metadata syncing code path.
And here for the new metadata syncing code path.

However, the repo can have its metadata synced AND THEN be cloned via a succession of calls to RepoLookup and EnqueueRepoUpdate.

The RepoLookup endpoint fetches the metadata from the code host and stores it back in Postgres via the Internal HTTP API.

The EnqueueRepoUpdate endpoint, then, is usually called with the results of a RepoLookup via the backend.GitRepo function.

The result of the RepoLookup should include the CloneURL, so it can then be passed to EnqueueRepoUpdate which works regardless of the value of DisableAutoGitUpdates, since it's considered a manual update.

EnqueueRepoUpdate works without CloneURL ONLY IF gitserver had cloned this repo before.

tsenart · 2019-04-11T09:50:48Z

And finally, I have some questions to try to clarify how this issue was affecting $CUSTOMER with the new release, but not the one they were running before.

Has $CUSTOMER changed disableAutoGitUpdates in their upgrade process?
Did $CUSTOMER have the experimental updateScheduler2 enabled before upgrading?

tsenart · 2019-04-11T09:52:11Z

@keegancsmith: Added you as a reviewer because a lot of the answers I gave here are intimately related with our upcoming work on #3345.

tsenart

Looks good to me.

keegancsmith

@tsenart all your information is correct here and awesome! Kind of wish it could live somewhere else easier to find. One addition, repo-updater doesn't have all metadata in postgres. Crucially the cloneURL is not stored. Instead we rely on caching the responses from codehost APIs. However, with our new syncer changes we now store the cloneURL so this certainly gonna be cleaned up in 3.4 (finally!!!)

keegancsmith · 2019-04-11T13:05:06Z

👍 this seems like the most important thing that was missing? I wonder why this wasn't set before.

From the docstring, it looks like the method was just added a placeholder. It can't actually fully fetch Gitolite repo metadata, because it would have to invoke the gitserver endpoint, which is only necessary because currently only gitserver has the Gitolite SSH credentials. My guess is that the author didn't want to call that endpoint here, but still needed to return something to indicate the repository exists.

BTW, I think we can fix this whole issue by making the Gitolite SSH credentials be an explicit part of config, then using the golang stdlib ssh package, instead of shelling out to ssh to get the Gitolite repo metadata. Then we can have repo-updater hit the Gitolite API directly.

BTW, I think we can fix this whole issue by making the Gitolite SSH credentials be an explicit part of config, then using the golang stdlib ssh package, instead of shelling out to ssh to get the Gitolite repo metadata. Then we can have repo-updater hit the Gitolite API directly.

We should probably do that as part of integrating Gitolite external services with the new syncer.

beyang · 2019-04-11T17:04:05Z

Answers to @tsenart's questions:

Has $CUSTOMER changed disableAutoGitUpdates in their upgrade process?
No.

Did $CUSTOMER have the experimental updateScheduler2 enabled before upgrading?
No. Also, they are upgrading from 2.x to 3.2.

A few follow-up questions/clarifications:

The syncer's responsibility is to collect the set of repositories from the code host. Then it should delete the outdated ones from the DB and enqueue the existing ones by calling the scheduler?
The syncer will store the clone URL in the DB after 3.4?
Which of the following does a repo-updater repo update cover? (1) Creating the metadata rows in the DB (2) Updating metadata in the DB (3) Syncing gitserver contents (4) Zoekt reindex. To me, it looks like it only covers (3), but maybe the syncer triggers 1 + 2? What triggers (4)?
After EnqueueRepoUpdate's interface is changed (enqueue-repo-update should use an ID #3345), will it still be necessary for the client to pass the clone URL or will repo-updater obtain that on its own if it is needed (i.e., if gitserver has not yet cloned the repo).

tsenart · 2019-04-12T09:37:06Z

The syncer's responsibility is to collect the set of repositories from the code host. Then it should delete the outdated ones from the DB and enqueue the existing ones by calling the scheduler?

The syncer's responsibility is to converge the set of mirrored repositories with what's defined in external service configuration and to keep their meta-data up-to-date.

The Git update scheduler subscribes to syncs and updates its own schedule with the full set of mirrored repos. Eventually, we'd like to move the Git update scheduler to git-server itself and use a unified log (e.g. using Redis Streams) for the data integration between repo-updater and git-server.

The syncer will store the clone URL in the DB after 3.4?

Yes.

3.3: GitHub, GitLab, Bitbucket Server and possibly "Other external service kind" which is in a PR (repo-updater: Syncing improvements for OTHER external services #3368)
3.4: Gitolite and AWS Code Commit

Which of the following does a repo-updater repo update cover? (1) Creating the metadata rows in the DB (2) Updating metadata in the DB (3) Syncing gitserver contents (4) Zoekt reindex. To me, it looks like it only covers (3), but maybe the syncer triggers 1 + 2? What triggers (4)?

The EnqueueRepoUpdate endpoint only covers (3), but, as written before, it's usually preceded at call site by a RepoLookup call which syncs metadata.

The syncer publishes diffs to a channel which the git update scheduler reads from and manages its own schedule.

As for (4), I believe Zoekt periodically polls all existing repos in the database via the frontend API. (Pull, not push). Can you confirm this @keegancsmith?

Our idea of using a unified log for data integration would also apply here. See https://brandur.org/redis-streams.

After EnqueueRepoUpdate's interface is changed (#3345), will it still be necessary for the client to pass the clone URL or will repo-updater obtain that on its own if it is needed (i.e., if gitserver has not yet cloned the repo).

It will only be necessary to pass the repo id. Nothing else.

keegancsmith · 2019-04-12T11:55:02Z

Zoekt is poll based. It is pretty aggressive polling. Given the time and resources it takes to index a repo, the slight delay we get from polling is far less than the amount of indexing it does in a busy org.

beyang force-pushed the bl/3.2-patch branch from 9383965 to 2f3d84e Compare April 11, 2019 04:07

beyang force-pushed the bl/3.2-patch branch from 2f3d84e to 9b1285e Compare April 11, 2019 04:22

tsenart self-requested a review April 11, 2019 09:04

tsenart added the team/core-services label Apr 11, 2019

tsenart requested a review from keegancsmith April 11, 2019 09:51

tsenart approved these changes Apr 11, 2019

View reviewed changes

tsenart changed the title ~~return clone URL for Gitolite repositories, fix #3336~~ return clone URL for Gitolite repositories Apr 11, 2019

keegancsmith reviewed Apr 11, 2019

View reviewed changes

beyang force-pushed the bl/3.2-patch branch from 9b1285e to 638412c Compare April 11, 2019 17:30

beyang added 3 commits April 11, 2019 17:31

return clone URL for Gitolite repositories, fix #3336

9fc38c0

fix repo-refresh endpoint

cdbd08a

fix test

3d91dbd

beyang force-pushed the bl/3.2-patch branch from e2dff3c to 3d91dbd Compare April 12, 2019 00:31

beyang merged commit 6b5c442 into master Apr 12, 2019

beyang deleted the bl/3.2-patch branch April 12, 2019 05:27

beyang added a commit that referenced this pull request Apr 12, 2019

return clone URL for Gitolite repositories (#3358)

48d5c9d

Conversation

beyang commented Apr 11, 2019 • edited by tsenart Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tsenart commented Apr 11, 2019

Uh oh!

tsenart commented Apr 11, 2019

Uh oh!

tsenart commented Apr 11, 2019

Uh oh!

tsenart commented Apr 11, 2019

Uh oh!

tsenart commented Apr 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsenart commented Apr 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsenart left a comment

Choose a reason for hiding this comment

Uh oh!

keegancsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

keegancsmith Apr 11, 2019

Choose a reason for hiding this comment

Uh oh!

beyang Apr 11, 2019

Choose a reason for hiding this comment

Uh oh!

tsenart Apr 12, 2019

Choose a reason for hiding this comment

Uh oh!

beyang commented Apr 11, 2019

Uh oh!

tsenart commented Apr 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keegancsmith commented Apr 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

beyang commented Apr 11, 2019 •

edited by tsenart

Loading

codecov Bot commented Apr 11, 2019 •

edited

Loading

tsenart commented Apr 11, 2019 •

edited

Loading

tsenart commented Apr 11, 2019 •

edited

Loading

tsenart commented Apr 12, 2019 •

edited

Loading