Querying the GitHub GraphQL API
For the very impatient, here’s a gist with the full script.
GraphQL Query
If you’re new to GraphQL, the basic idea is that you can fetch data objects, and then use relationships to fetch data related to it. The Introduction to GraphQL from GitHub is a good place to look if you want to learn more.
In my example, I started by fetching a user by their login (their username or handle). Then I used the handy ContributionsCollection field to get information about the user’s activity – this is one of my favourite data features of the GitHub v4 GraphQL API and I have used it to build a few different dashboards because it has some cool data in it that would otherwise be laborious to extract from GitHub (although I do really like their v3 REST API too!).
From ContributionsCollection, I can pick the [PullRequestReviewContributions]( as that’s all I’m interested in here: I’m looking for review comment activity on repositories where the user has push access. Note that there’s no way to tell if the user has previously had push access – for example my own account doesn’t show any of the Redocly repositories despite that being a big part of my open source activity in recent years, because I don’t have maintainer rights now that I don’t work there any more.
Finally I also collect information about the repository that the pull request is in, because my end goal is to collect a list of projects.
Here’s the query:
query ($login: String!) {
user(login: $login) {
login
contributionsCollection {
pullRequestReviewContributions(last: 100) {
nodes {
pullRequestReview {
authorCanPushToRepository
}
pullRequest {
repository {
name
owner {
login
}
}
}
}
}
}
}
}
This query uses a $login
variable, which is supplied separately when we run the query.
There’s a handy explorer tool, but I found it much easier to quickly write a Python wrapper script so I could edit the query and re-run the script using my local development tools. YMMV, but the graphical web-based interfaces don’t play that well with my accessibility tools so it was quite literally a painful process for me personally!
Python Script
Why Python? Why not? I don’t write Python every day but I wish I did and it’s a decent choice for something like this to fetch some data and manipulate it.
I’m using the script as a CLI entry point, including passing in a variable, so the first thing is to grab the input, or prompt the user if there isn’t any. That section looks like this:
if len(sys.argv) != 2:
print("Usage: python3 github-api.py name")
sys.exit(1)
name = sys.argv[1]
# variables for query
variables = {
"login": name
}
I don’t need to pass in any other variables, so the variables
is ready to pass into the API call when we make it. I also define a query
variable that contains the query shown in the first section of the blog post (so I’m not pasting it again here!).
I’ve got my GitHub access token in an environment variable named GH_KEY
, mostly so that I don’t accidentally paste when I make the gist!! Using an environment variable is good practice, each user maintains their own and there’s no risk of checking something into the repository that shouldn’t be there. Remember if you use a tool like dotenv to add those files to your .gitignore
file to avoid mishaps.
Calling the API itself is a bit of an anticlimax, here’s the Python code:
url = 'https://api.github.com/graphql'
headers = {'Authorization': 'Bearer ' + os.environ['GH_KEY']}
response = requests.post(url, json={'query': query, 'variables': variables}, headers=headers)
Here I’m setting the URL in its own variable: this is super useful in case you ever want to send the request somewhere else, through a debugging or testing tool or something. I’m setting the header to authenticate myself – this step isn’t really needed here as I’m only accessing publicly-available data, but if you would need to log in to see something then you need to authenticate. It also gives higher rate limits, which didn’t matter for this project but might be useful for you to know.
Finally, the rest of the script checks the data, picks out the projects the user has been working on, puts them in a list, and finally just prints some output.
# be careful, errors are response status 200
if response.status_code == 200:
data = response.json()
if 'errors' in data:
print("Errors:")
print(data['errors'])
elif 'data' in data and 'user' in data['data'] and data['data']['user'] != None:
login = data['data']['user']['login']
projects = []
for pr_review in data['data']['user']['contributionsCollection']['pullRequestReviewContributions']['nodes']:
pr = pr_review['pullRequest']
repo = pr['repository']['name']
owner = pr['repository']['owner']['login']
can_push = pr_review['pullRequestReview']['authorCanPushToRepository']
project_string = f"{owner}/{repo}"
if can_push and (project_string not in projects):
projects.append(project_string)
print(f"{login} | {projects}")
else:
print("Error: User data not found in the response.")
else:
print(f"Error: Request failed with status code {response.status_code}")
The main “gotcha” is that the request can be successful with status 200 but if you did something wrong in your query, there can still be errors in the response body. I’m sure there are other error cases that this script doesn’t handle but the only one I’m aware of is that it got VERY confused when someone had deleted their account that I was trying to fetch data about! I also didn’t implement pagination because I didn’t need it, but that might be a good consideration if you’re doing something with a GraphQL API.
REST vs GraphQL
The naming of the REST API as v3 and the GraphQL API as v4 is a bit misleading in my opinion since both have their own strengths and I can’t imagine GitHub deprecating their API when it’s so widely used! However it is always worth checking what’s available in each one, as some things are only in one API or the other; discussions are GraphQL only (or they were last time I implemented something that needed them), so take care.
Good luck with your GraphQL projects, drop me a comment and let me know what you build!
Reposts