Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

return clone URL for Gitolite repositories#3358

Merged
beyang merged 3 commits intomasterfrom
bl/3.2-patch
Apr 12, 2019
Merged

return clone URL for Gitolite repositories#3358
beyang merged 3 commits intomasterfrom
bl/3.2-patch

Conversation

@beyang
Copy link
Copy Markdown
Member

@beyang beyang commented Apr 11, 2019

What I've learned:
disableAutoGitUpdates doesn't affect the behavior of navigating directly to a repo to get it to clone
• The need for a clone URL to be set the first time gitserver clones a repository is documented in the code, but it's unclear where this clone URL actually comes from. Perhaps it would be useful for repo-updater to expose this and serve as the source of truth for such repository metadata? If that shouldn't be repo-updater, what should it be?
• Three code paths that might require a new repo to be cloned to gitserver are (1) repo-updater loop, (2) direct user navigation to a repo page, (3) triggering a repo-update by the $REPO/-/refresh endpoint. Any others we might have to cover?
• Maybe it's worth changing the interface of EnqueueRepoUpdate (https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/pkg/repoupdater/client.go#L145:0) to make it more explicit when the expectation is that a repo is cloned if it doesn't exist already. The current behavior of needing to set Repo.URL is a bit hidden.

Fixes #3336

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 11, 2019

Codecov Report

Merging #3358 into master will decrease coverage by <.01%.
The diff coverage is 42.1%.

Impacted Files Coverage Δ
cmd/repo-updater/repos/gitolite.go 0% <0%> (ø) ⬆️
pkg/extsvc/gitolite/codehost.go 0% <0%> (ø) ⬆️
cmd/frontend/internal/httpapi/repo_refresh.go 71.42% <80%> (+11.42%) ⬆️

@tsenart
Copy link
Copy Markdown
Contributor

tsenart commented Apr 11, 2019

@beyang: Added what you wrote on Slack to the PR description.

@tsenart
Copy link
Copy Markdown
Contributor

tsenart commented Apr 11, 2019

disableAutoGitUpdates doesn't affect the behavior of navigating directly to a repo to get it to clone

I suppose that makes some sense, if looking only at the name of the setting. Navigating to the repo isn't an automatic background action, it's a manual user action.

Perhaps it would be useful for repo-updater to expose this and serve as the source of truth for such repository metadata? If that shouldn't be repo-updater, what should it be?

Yes! This is indeed part of the changes in #2025. We now store all repository metadata in each repo row, including clone URLs.

However, it should be the case today even that repo-updater is the source of truth for repository metadata. That's what the repo lookup endpoint is for (which also serves clone URLs).

The difference with the new syncer is that we don't have to contact the upstream codehost for this metadata anymore, since it's all cached in the database and updated with every background sync.

Three code paths that might require a new repo to be cloned to gitserver are (1) repo-updater loop, (2) direct user navigation to a repo page, (3) triggering a repo-update by the $REPO/-/refresh endpoint. Any others we might have to cover?

Thanks for doing this research. We had it on our plate in 3.4 to untangle this EnqueueRepoUpdate mess as part of: #3345. So, we are familiar with these two first code paths, but not the last one. There seem to be a few more

Maybe it's worth changing the interface of EnqueueRepoUpdate (https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/pkg/repoupdater/client.go#L145:0) to make it more explicit when the expectation is that a repo is cloned if it doesn't exist already. The current behavior of needing to set Repo.URL is a bit hidden.

Definitely. We're changing that in #3345 to take in a simple repo ID.

@tsenart tsenart self-requested a review April 11, 2019 09:04
@tsenart
Copy link
Copy Markdown
Contributor

tsenart commented Apr 11, 2019

I'm sorry you had to go through all this pain. This EnqueueRepoUpdate call graph is in very poor condition as I realised two days ago.

It's a testament that this is high-priority tech debt we must address in the next release cycle.

@tsenart
Copy link
Copy Markdown
Contributor

tsenart commented Apr 11, 2019

What are the main pieces of state that repo-updater maintains (e.g., repo queue)? Is there a way to tell what the current state is?

There's the repo metadata which is stored in Postgres by the sync loops and there's the git update scheduler which is a completely separate component that stores all its state in-memory. Whenever a sync happens, the scheduler is updated. But the git update scheduler can be updated via the repo updater HTTP API explicitly, too.

There are a number of places where DisableAutoGitUpdates is referenced. Its documentation says "Disable periodically fetching git contents for existing repositories" but it appears it may do more than that?

It should not do any more than that. Looking at all these references confirms that. It affects only the code paths that trigger git updates due to background metadata syncs: https://sourcegraph.com/search?q=repo:%5Egithub.com/sourcegraph/sourcegraph%24+DisableAutoGitUpdates+

I tried repro'ing with GitHub repositories (instead of Gitolite). AFAICT, the bug exists there, but is mitigated by the fact that the GitHub repo is eventually cloned in the background. But doesn't this violate "DisableAutoGitUpdates": true?

With DisableAutoGitUpdates set to true, the automatic git update scheduler is off. This means that after a repo updater metadata sync, updating this schedule is a no-op as you can see:

  1. Here for the old metadata syncing code path.
  2. And here for the new metadata syncing code path.

However, the repo can have its metadata synced AND THEN be cloned via a succession of calls to RepoLookup and EnqueueRepoUpdate.

The RepoLookup endpoint fetches the metadata from the code host and stores it back in Postgres via the Internal HTTP API.

The EnqueueRepoUpdate endpoint, then, is usually called with the results of a RepoLookup via the backend.GitRepo function.

The result of the RepoLookup should include the CloneURL, so it can then be passed to EnqueueRepoUpdate which works regardless of the value of DisableAutoGitUpdates, since it's considered a manual update.

EnqueueRepoUpdate works without CloneURL ONLY IF gitserver had cloned this repo before.

@tsenart
Copy link
Copy Markdown
Contributor

tsenart commented Apr 11, 2019

And finally, I have some questions to try to clarify how this issue was affecting $CUSTOMER with the new release, but not the one they were running before.

  1. Has $CUSTOMER changed disableAutoGitUpdates in their upgrade process?
  2. Did $CUSTOMER have the experimental updateScheduler2 enabled before upgrading?

@tsenart tsenart requested a review from keegancsmith April 11, 2019 09:51
@tsenart
Copy link
Copy Markdown
Contributor

tsenart commented Apr 11, 2019

@keegancsmith: Added you as a reviewer because a lot of the answers I gave here are intimately related with our upcoming work on #3345.

Copy link
Copy Markdown
Contributor

@tsenart tsenart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@tsenart tsenart changed the title return clone URL for Gitolite repositories, fix #3336 return clone URL for Gitolite repositories Apr 11, 2019
Copy link
Copy Markdown
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsenart all your information is correct here and awesome! Kind of wish it could live somewhere else easier to find. One addition, repo-updater doesn't have all metadata in postgres. Crucially the cloneURL is not stored. Instead we rely on caching the responses from codehost APIs. However, with our new syncer changes we now store the cloneURL so this certainly gonna be cleaned up in 3.4 (finally!!!)

Comment thread cmd/frontend/internal/httpapi/repo_refresh.go Outdated
Comment thread cmd/repo-updater/repos/gitolite.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this seems like the most important thing that was missing? I wonder why this wasn't set before.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docstring, it looks like the method was just added a placeholder. It can't actually fully fetch Gitolite repo metadata, because it would have to invoke the gitserver endpoint, which is only necessary because currently only gitserver has the Gitolite SSH credentials. My guess is that the author didn't want to call that endpoint here, but still needed to return something to indicate the repository exists.

BTW, I think we can fix this whole issue by making the Gitolite SSH credentials be an explicit part of config, then using the golang stdlib ssh package, instead of shelling out to ssh to get the Gitolite repo metadata. Then we can have repo-updater hit the Gitolite API directly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think we can fix this whole issue by making the Gitolite SSH credentials be an explicit part of config, then using the golang stdlib ssh package, instead of shelling out to ssh to get the Gitolite repo metadata. Then we can have repo-updater hit the Gitolite API directly.

We should probably do that as part of integrating Gitolite external services with the new syncer.

@beyang
Copy link
Copy Markdown
Member Author

beyang commented Apr 11, 2019

Answers to @tsenart's questions:

Has $CUSTOMER changed disableAutoGitUpdates in their upgrade process?
No.

Did $CUSTOMER have the experimental updateScheduler2 enabled before upgrading?
No. Also, they are upgrading from 2.x to 3.2.

A few follow-up questions/clarifications:

  • The syncer's responsibility is to collect the set of repositories from the code host. Then it should delete the outdated ones from the DB and enqueue the existing ones by calling the scheduler?
  • The syncer will store the clone URL in the DB after 3.4?
  • Which of the following does a repo-updater repo update cover? (1) Creating the metadata rows in the DB (2) Updating metadata in the DB (3) Syncing gitserver contents (4) Zoekt reindex. To me, it looks like it only covers (3), but maybe the syncer triggers 1 + 2? What triggers (4)?
  • After EnqueueRepoUpdate's interface is changed (enqueue-repo-update should use an ID #3345), will it still be necessary for the client to pass the clone URL or will repo-updater obtain that on its own if it is needed (i.e., if gitserver has not yet cloned the repo).

@beyang beyang merged commit 6b5c442 into master Apr 12, 2019
@beyang beyang deleted the bl/3.2-patch branch April 12, 2019 05:27
@tsenart
Copy link
Copy Markdown
Contributor

tsenart commented Apr 12, 2019

The syncer's responsibility is to collect the set of repositories from the code host. Then it should delete the outdated ones from the DB and enqueue the existing ones by calling the scheduler?

The syncer's responsibility is to converge the set of mirrored repositories with what's defined in external service configuration and to keep their meta-data up-to-date.

The Git update scheduler subscribes to syncs and updates its own schedule with the full set of mirrored repos. Eventually, we'd like to move the Git update scheduler to git-server itself and use a unified log (e.g. using Redis Streams) for the data integration between repo-updater and git-server.

The syncer will store the clone URL in the DB after 3.4?

Yes.

Which of the following does a repo-updater repo update cover? (1) Creating the metadata rows in the DB (2) Updating metadata in the DB (3) Syncing gitserver contents (4) Zoekt reindex. To me, it looks like it only covers (3), but maybe the syncer triggers 1 + 2? What triggers (4)?

The EnqueueRepoUpdate endpoint only covers (3), but, as written before, it's usually preceded at call site by a RepoLookup call which syncs metadata.

The syncer publishes diffs to a channel which the git update scheduler reads from and manages its own schedule.

As for (4), I believe Zoekt periodically polls all existing repos in the database via the frontend API. (Pull, not push). Can you confirm this @keegancsmith?

Our idea of using a unified log for data integration would also apply here. See https://brandur.org/redis-streams.

After EnqueueRepoUpdate's interface is changed (#3345), will it still be necessary for the client to pass the clone URL or will repo-updater obtain that on its own if it is needed (i.e., if gitserver has not yet cloned the repo).

It will only be necessary to pass the repo id. Nothing else.

@keegancsmith
Copy link
Copy Markdown
Member

Zoekt is poll based. It is pretty aggressive polling. Given the time and resources it takes to index a repo, the slight delay we get from polling is far less than the amount of indexing it does in a busy org.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Repository shows up as empty when "disableAutoGitUpdates" is true

3 participants