Re: Scraping Gitiles WebPage

Dave Borowitz

unread,

Apr 9, 2021, 11:24:31 AM4/9/21

to Anurag Aravala, repo-discuss

Hi Anurag,

I'll take a shot at answering, but I don't work on Gitiles anymore, so if you need further support, I recommend repo-discuss (cc'ed).

In general, scraping via the (undocumented) JSON API is supported. Usually, it's as simple as adding "?format=JSON" to a URL you see in the browser. For example, I browsed to https://android.googlesource.com/platform/frameworks/base/+/refs/heads/master/data/, then confirmed that JSON works:

curl 'https://android.googlesource.com/platform/frameworks/base/+/refs/heads/master/data/?format=JSON&recursive'

The error message you got is very generic, and just means there's something in the format of your URL that's wrong. What URL are you using? Off the top of my head, another thing that might be happening is your HTTP client might be replacing the literal "+" sign a space " " or "%20", which would make the URL invalid.

I also can't help you with authentication, since that's pretty dependent on your Gitiles setup. But based on the error message, I'm guessing the auth part is working, since that message probably indicates 400/404 rather than 401/403.

Hope this helps.

Dave

On Fri, Apr 9, 2021 at 12:01 AM Anurag Aravala <anurag....@gmail.com> wrote:

Hi Dave,
We had a use case where we need to get only filenames from around 1000 repos in Gerrit at a time. We tried Gerrit REST API but it would only give us information about modified/added/deleted files during a change but not all filenames from a repo. So we thought of scraping Gitiles Webpage to obtain filenames, but when we use python requests.get with authentication we are getting HTTP response as html containing <h1>Cannot Parse as Gitiles URL</h1> . Can you please help me how to resolve this and get the page source

Thanks and Regards,
Anurag Aravala

Dave Borowitz

unread,

Apr 14, 2021, 6:25:17 PM4/14/21

to Anurag Aravala, repo-discuss

+repo-discuss

Sorry, I don't have the necessary context to help you more than what was in my first email. Perhaps someone on the repo-discuss list can help.

On Fri, Apr 9, 2021 at 9:27 AM Anurag Aravala <anurag....@gmail.com> wrote:

Thanks for the information. The gitiles url I'm using is "https://ec-gerrit.<company>.com/plugins/gitiles/<project name>/+/refs/heads/master". This url uses saml for authentication and gives the webpage containing files under a branch of a project in Gerrit. Until getting the list of branches in a project, it will be a Gerrit url only and there won't be any gitiles in the url, but after clicking on a particular branch I'd , the files will be listed and the url will be in the above mentioned format. I'm using python requests library to get the html. I have attached the webpage(I have created HTML file using the response, but if I use the url in browser, it works fine) which is the response I'm getting after sending the request with python requests.get for the above url.You have mentioned the possible issues in your email but I'm sending this mail to explain the error in detail with the url so that you can help me in diagnosing the problem. Please help me with information about possible issues while requesting for the page source. I have already posted this in repo-discuss and I didn't get any reply to the scraping issue but people suggested me to clone the repos to get filenames which is impossible in our case. Please help me.

Thanks & Regards,
Anurag

Anurag Aravala.

Anurag Aravala

unread,

Apr 15, 2021, 2:29:18 AM4/15/21

to Dave Borowitz, repo-discuss

Hi Dave, Thanks for the Information. I found that We have access to opengrok which can help us with this task.

Reply all

Reply to author

Forward