Back Up GitHub and GitLab Repositories Using Golang
Want to learn Golang and build something useful? Learn how to write a tool to back up your GitHub and GitLab repositories.
GitHub and GitLab are two popular Git repository hosting services that are used to host and manage open-source projects. They also have become an easy way for content creators to be able to invite others to share and collaborate without needing to have their own infrastructure setup.
Using hosted services that you don't manage yourself, however, comes with a downside. Systems fail, services go down and disks crash. Content hosted on remote services can simply vanish. Wouldn't it be nice if you could have an easy way to back up your git repositories periodically into a place you control?
If you follow along with this article, you will write a Golang program to back up git repositories from GitHub and GitLab (including custom GitLab installations). Being familiar with Golang basics will be helpful, but not required. Let's get started!
Hello Golang
The latest stable release of Golang at the time of this writing is 1.8. The package name is usually golang, but if your Linux distro doesn't have this release, you can download the Golang compiler and other tools for Linux. Once downloaded, extract it to /usr/local:
$ sudo tar -C /usr/local -xzf <filename-from-above>
$ export PATH=$PATH:/usr/local/go/bin
Opening a new terminal and typing $ go version
should show the
following:
$ go version
go version go1.8 linux/amd64
Let's write your first program. Listing 1 shows a program that expects
a -name
flag (or argument) when run and prints a greeting using the
specified name. Compile and run the program as follows:
$ go build listing1.go
$ ./listing1 -name "Amit"
Hello Amit
$ ./listing1
./listing1
2017/02/18 22:48:25 Please specify your name using -name
$ echo $?
1
If you don't specify the -name
argument, it exits printing a message
with a non-zero exit code. You can combine both compiling and running
the program using go run
:
$ go run listing1.go -name Amit
2017/03/04 23:08:11 Hello Amit
Listing 1. Example Program listing1.go
package main
import (
"flag"
"log"
)
func main() {
name := flag.String("name", "", "Your Name")
flag.Parse()
if len(*name) != 0 {
log.Printf("Hello %s", *name)
} else {
log.Fatal("Please specify your name using -name")
}
}
The first line in the program declares the package for the program. The
main
package is special, and any executable Go program must live in
the main
package. Next, the program imports two packages from the Golang standard
library using the import
statement:
import (
"flag"
"log"
)
The "flag"
package is used to handle command-line
arguments to programs,
and the "log"
package is used for logging.
Next, the program defines the main()
function where the program execution starts:
func main() {
name := flag.String("name", "", "Your Name")
flag.Parse()
if len(*name) != 0 {
log.Printf("Hello %s", *name)
} else {
log.Fatal("Please specify your name using -name")
}
}
Unlike other functions you'll write, the main
function doesn't return
anything nor does it take any arguments. The first statement in the
main()
function above defines a string flag,
"name"
, with a default value
of an empty string and "Your Name"
as the help message. The return value
of the function is a string pointer stored in the variable,
name
. The
:=
is a shorthand notation of declaring a variable where its type is inferred
from the value being assigned to it. In this case, it is of type
*string
—a reference or pointer to a string value.
The Parse()
function parses the flags and makes the specified flag values
available via the returned pointer. If a value has been provided to
the "-name"
flag when executing the program, the value will be stored
in "name"
and is accessible via
*name
(recall that name
is a string
pointer). Hence, you can check whether the length of the string referred to
via name
is non-zero, and if so, print a greeting
via the Printf()
function of the log package. If, however, no value was specified, you use
the Fatal()
function to print a message. The
Fatal()
function prints
the specified message and terminates the program execution.
Structures, Slices and Maps
The program shown in Listing 2 demonstrates the following different things:
- Defining a struct data type.
- Creating a map.
- Creating a slice and iterating over it.
- Defining a user-defined function.
Listing 2. Structures, Slices and Maps Example
package main
import (
"log"
)
type Repository struct {
GitURL string
Name string
}
func getRepo(id int) Repository {
repos := map[int]Repository{
1: Repository{GitURL: "ssh://github.com/amitsaha/gitbackup",
↪Name: "gitbackup"},
2: Repository{GitURL: "ssh://github.com/amitsaha/lj_gitbackup",
↪Name: "lj_gitbackup"},
}
return repos[id]
}
func backUp(r *Repository) {
log.Printf("Backing up %s\n", r.Name)
}
func main() {
var repositories []Repository
repositories = append(repositories, getRepo(1))
repositories = append(repositories, getRepo(2))
repositories = append(repositories, getRepo(3))
for _, r := range repositories {
if (Repository{}) != r {
backUp(&r)
}
}
}
At the beginning, you define a new struct data type
Repository
as follows:
type Repository struct {
GitURL string
Name string
}
The structure Repository
has two members:
GitURL
and Name
, both of type
string
. You can define a variable of this structure type using
r
:= Repository{"git+ssh://git.mydomain.com/myrepo", "myrepo"}
.
You
can choose to leave one or both members out when defining a structure
variable. For example, you can leave the GitURL
unset using r :=
Repository{Name: "myrepo"}
, or you even can leave both out.
When you leave
a member unset, the value defaults to the zero value for that type—0 for int, empty string for string type.
Next, you define a function, getRepo
, which takes an integer as argument
and returns a value of type Repository
:
func getRepo(id int) Repository {
repos := map[int]Repository{
1: Repository{GitURL: "git+ssh://github.com/amitsaha/gitbackup",
↪Name: "gitbackup"},
2: Repository{GitURL:
↪"git+ssh://github.com/amitsaha/lj_gitbackup", Name: "lj_gitbackup"},
}
return repos[id]
}
In the getRepo()
function, you create a map or a hash table of key-value
pairs—the key being an integer and a value of type
Repository
. The map
is initialized with two key-value pairs.
The function returns the Repository
, which corresponds to the specified
integer. If a specified key is not found in a map, a zero value of the
value's type is returned. In this case, if an integer other than 1 or
2 is supplied, a value of type Repository
is returned with both the
members set to empty strings.
Next, you define a function backUp()
, which accepts a pointer to a
variable of type Repository
as an argument and
prints the Name
of the
repository. In the final program, this function actually will create a
backup of a repository.
Finally, there is the main()
function:
func main() {
var repositories []Repository
repositories = append(repositories, getRepo(1))
repositories = append(repositories, getRepo(2))
repositories = append(repositories, getRepo(3))
for _, r := range repositories {
if (Repository{}) != r {
backUp(&r)
}
}
}
In the first statement, you create a slice, repositories
, that will store
elements of type Repository
. A slice in Golang is an dynamically sized
array—similar to a list in Python. You then call the
getRepo()
function
to obtain a repository corresponding to the key 1 and store the returned
value in the repositories
slice using the append()
function. You do the
same in the next two statements. When you call the
getRepo()
function
with the key, 3, you get back an empty value of type
Repository
.
You then use a for loop with the range
clause to iterate over the elements
of the slice, repositories
. The index of the element in a slice is
stored in the _
variable, and the element itself is
referred to via the
r
variable. You check if the element is not an empty
Repository
variable,
and if it isn't, you call the backUp()
function, passing the address of
the element. It is worth mentioning that there is no reason to pass the
element's address; you could have passed the element's value itself. However,
passing by address is a good practice when a structure has a large number
of members.
When you build and run this program, you'll see the following output:
$ go run listing2.go
2017/02/19 19:44:32 Backing up gitbackup
2017/02/19 19:44:32 Backing up lj_gitbackup
Goroutines and Channels
Consider the previous program (Listing 2). You call the
backUp()
function
with every repo in the repositories serially. When you actually create
a backup of a large number of repositories, doing them serially can
be slow. Since each repository backup is independent of any other,
they can be run in parallel. Golang makes it really easy to have
multiple simultaneous units of execution in a program using
goroutines.
A goroutine is what other programming languages refer to as lightweight
threads or green threads. By default, a Golang program is said to be
executing in a main goroutine, which can spawn other goroutines. A main
goroutine can wait for all the spawned goroutines to finish before
finishing up using a variable of WaitGroup
type, as you'll see next.
Listing 3 modifies the previous program such that the
backUp()
function
is called in a goroutine. The main()
function
declares a variable, wg
of type WaitGroup
defined in the sync package, and
then sets up a deferred
call to the Wait()
function of this variable. The
defer
statement is
used to execute any function just before the current function returns.
Thus, you ensure that you wait for all the goroutines to finish before
exiting the program.
Listing 3. Goroutine Example
package main
import (
"log"
"sync"
)
type Repository struct {
GitURL string
Name string
}
func getRepo(id int) Repository {
repos := map[int]Repository{
1: Repository{GitURL: "ssh://github.com/amitsaha/gitbackup",
↪Name: "gitbackup"},
2: Repository{GitURL: "ssh://github.com/amitsaha/
↪lj_gitbackup", Name: "lj_gitbackup"},
}
return repos[id]
}
func backUp(r *Repository, wg *sync.WaitGroup) {
defer wg.Done()
log.Printf("Backing up %s\n", r.Name)
}
func main() {
var wg sync.WaitGroup
defer wg.Wait()
var repositories []Repository
repositories = append(repositories, getRepo(1))
repositories = append(repositories, getRepo(2))
repositories = append(repositories, getRepo(3))
for _, r := range repositories {
if (Repository{}) != r {
wg.Add(1)
go func(r Repository) {
backUp(&r, &wg)
}(r)
}
}
}
The other primary change in the main()
function is
how you call the
backUp()
function. Instead of calling this function
directly, you call
it in a new goroutine as follows:
wg.Add(1)
go func(r Repository) {
backUp(&r, &wg)
}(r)
You call the Add()
function with an argument 1 to
indicate that you'll be
creating a new goroutine that you want to wait for before you exit. Then,
you define an anonymous function taking an argument,
r
of type Repository
,
which calls the function backUp()
with an additional argument, a reference
to the variable, wg
—the
WaitGroup
variable declared earlier.
Consider the scenario where you have a large number of elements in your repositories list—a very realistic scenario for this backup tool. Spawning a goroutine for each element in the repository can easily lead to having an uncontrolled number of goroutines running concurrently. This can lead to the program hitting per-process memory and file-descriptor limits imposed by the operating system.
Thus, you would want to regulate the maximum number of goroutines spawned by the program and spawn a new goroutine only when the ones executing have finished. Channels in Golang allow you to achieve this and other synchronization operations among goroutines. Listing 4 shows how you can regulate the maximum number of goroutines spawned.
Listing 4. Channels Example
package main
import (
"log"
"sync"
)
type Repository struct {
GitURL string
Name string
}
func getRepo(id int) Repository {
repos := map[int]Repository{
1: Repository{GitURL: "ssh://github.com/amitsaha/gitbackup",
↪Name: "gitbackup"},
2: Repository{GitURL: "ssh://github.com/amitsaha/
↪lj_gitbackup", Name: "lj_gitbackup"},
3: Repository{GitURL: "ssh://github.com/amitsaha/gitbackup",
↪Name: "gitbackup"},
4: Repository{GitURL: "ssh://github.com/amitsaha/
↪lj_gitbackup", Name: "lj_gitbackup"},
5: Repository{GitURL: "ssh://github.com/amitsaha/gitbackup",
↪Name: "gitbackup"},
6: Repository{GitURL: "ssh://github.com/amitsaha/
↪lj_gitbackup", Name: "lj_gitbackup"},
7: Repository{GitURL: "ssh://github.com/amitsaha/gitbackup",
↪Name: "gitbackup"},
8: Repository{GitURL: "ssh://github.com/amitsaha/
↪lj_gitbackup", Name: "lj_gitbackup"},
9: Repository{GitURL: "ssh://github.com/amitsaha/gitbackup",
↪Name: "gitbackup"},
10: Repository{GitURL: "ssh://github.com/amitsaha/
↪lj_gitbackup", Name: "lj_gitbackup"},
}
return repos[id]
}
func backUp(r *Repository, wg *sync.WaitGroup) {
defer wg.Done()
log.Printf("Backing up %s\n", r.Name)
}
func main() {
var wg sync.WaitGroup
defer wg.Wait()
var repositories []Repository
repositories = append(repositories, getRepo(1))
repositories = append(repositories, getRepo(2))
repositories = append(repositories, getRepo(3))
repositories = append(repositories, getRepo(4))
repositories = append(repositories, getRepo(5))
repositories = append(repositories, getRepo(6))
repositories = append(repositories, getRepo(7))
repositories = append(repositories, getRepo(8))
repositories = append(repositories, getRepo(9))
repositories = append(repositories, getRepo(10))
// Create a channel of capacity 5
tokens := make(chan bool, 5)
for _, r := range repositories {
if (Repository{}) != r {
wg.Add(1)
// Get a token
tokens <- true
go func(r Repository) {
backUp(&r, &wg)
// release the token
<-tokens
}(r)
}
}
}
You create a channel of capacity 5 and use it to implement a token
system. The channel is created using make
:
tokens := make(chan bool, 5)
The above statement creates a "buffered channel"—a channel with a capacity of 5 and that can store only values of type "bool". If a buffered channel is full, writes to it will block, and if a channel is empty, reads from it will block. This property allows you to implement your token system.
Before you can spawn a goroutine, you write a boolean value, true into it ("taking" a token) and then take it back once you are done with it ("releasing" the token). If the channel is full, it means the maximum number of goroutines are already running and, hence, your attempt to write will block and a new goroutine will not be spawned. The write operation is performed via:
tokens <- true
After the control is returned from the backUp()
function, you read a
value from the channel and, hence, release the token:
<-tokens
The above mechanism ensures that you never have more than five goroutines running simultaneously, and each goroutine releases its token before it exits so that the next goroutine may run. The file, listing5.go in the GitHub repository mentioned at the end of the article uses the runtime package to print the number of goroutines running using this mechanism, essentially allowing you to verify your implementation.
gitbackup—Backing Up GitHub and GitLab Repositories
In the example programs so far, I haven't explored using any third-party
packages. Whereas Golang's built-in tools completely support having an
application using third-party repositories, you'll use a tool called
gb
for developing your "gitbackup" project. One main reason I like
gb
is how it's really easy to fetch and
update third-party
dependencies via its "vendor" plugin. It also does away with the need
to have your go application in your GOPATH
, a
requirement that the built-in go tools assume.
Next, you'll fetch and build gb
:
$ go get github.com/constabulary/gb/...
The compiled binary gb
is placed in the directory
$GOPATH/bin. You'll
add $GOPATH/bin
to the $PATH
environment variable and start a new shell
session and type in gb
:
$ gb
gb, a project based build tool for the Go programming language.
Usage:
gb command [arguments]
..
Next, install the gb-vendor plugin:
$ go get github.com/constabulary/gb/cmd/gb-vendor
gb
works on the notion of projects. A project has an
"src" subdirectory
inside it, with one or more packages in their own sub-directories. Clone
the "gitbackup" project from https://github.com/amitsaha/gitbackup and
you will notice the following directory structure:
$ tree -L 1 gitbackup
gitbackup
|--src
| |--gitbackup
|--main.go
|--main_test.go
|--remote.go
..
The "gitbackup" application is composed of only a single package, "gitbackup", and it has two program files and unit tests. Let's take a look at the remote.go file first. Right at the beginning, you import third-party repositories in addition to a few from the standard library:
- github.com/google/go-github/github: this is the Golang interface to the GitHub API.
- golang.org/x/oauth2: used to send authenticated requests to the GitHub API.
- github.com/xanzy/go-gitlab: Golang interface to the GitLab API.
You define a struct of type Response
, which matches
the Response
structure
implemented by both the GitHub and GitLab libraries above. The struct
Repository
describes each repository that you fetch from either GitLab
or GitHub. It has two string fields: GitURL, representing the git clone
URL of the repository, and Name, the name of the repository.
The NewClient()
function accepts the service name
(github
or gitlab
)
as a parameter and returns the corresponding client, which then will be
used to interface with the service. The return type of this function is
interface{}
, a special Golang type indicating that this function can
return a value of any type. Depending on the service name specified, it
either will be of type *github.Client
or
*gitlab.Client
. If a different
service name is specified, it will return nil. To be able to fetch
your list of repositories before you can back them up, you will need to
specify an access token via an environment variable.
The token for GitLab is specified via the
GITLAB_TOKEN
environment
variable and for GitHub via the GITHUB_TOKEN
environment variable. In
this function, you check if the correct environment variable has been
specified using the Getenv()
function from the os package. The function
returns the value of the environment variable if specified and an
empty string if the specified environment variable wasn't found. If
the corresponding environment variable isn't found, you log a message
and exit using the Fatal()
function from the log package.
The NewClient()
function is used by the
getRepositories()
function, which returns a slice of
Repository
objects obtained via an API call to
the service. There are two conditional blocks in the function to account
for the two supported services. The first conditional block handles
repository listing for GitHub via the
Repositories.List()
function
implemented by the github.com/gooogle/go-github package. The first
argument to this function is the GitHub user name whose repositories
you want to fetch. If you leave it as an empty string, it returns the
repositories of the currently authenticated user. The second argument
to this option is a value of type
github.RepositoryListOptions
, which
allows you to specify the type of repositories you want returned via the
Type
field. The call to the function
Repositories.List()
is as follows:
repos, resp, err := client.(*github.Client)
↪.Repositories.List("", &options)
Recall that the newClient()
function returns a value
of type interface{}
,
which is an empty interface. Hence, if you attempt to make your function
call as client.Repositories.List()
, the compiler will complain with an
error message:
# gitbackup
remote.go:70: client.Repositories undefined (type interface {}
↪is interface with no methods)
So, you need to perform a "type assertion" through which you get access
to the underlying value of client, which is either of the
*github.Client
or
*gitlab.Client
type.
You query the list of repositories from the service in an infinite loop
indicated by the for
loop:
for {
// This is an infinite loop
}
The function returns three values: the first is a list of repositories,
the second is an object of type Response
, and the
third is an error value. If the
function call was successful, the value of err
is
nil. You then iterate
over each of the returned objects, create a
Repository
object containing
two fields you care about and append it to the slice repositories. Once you
have exhausted the list of repositories returned, you check the
NextPage
field of the resp
object to check whether it is equal to 0. If it is
equal to 0, you know there isn't anything else to read; you break from
the loop and return from the function with the list of repositories you
have so far. If you have a non-zero value, you have more repositories,
so you set the Page
field in the
ListOptions
structure to this value:
options.ListOptions.Page = resp.NextPage
The handler for the "gitlab" service is almost the same as the "github" service with one additional detail. "gitlab" is an open-source project, and you can have a custom installation running on your own host. You can handle it here via this code:
if len(gitlabUrl) != 0 {
gitlabUrlPath, err := url.Parse(gitlabUrl)
if err != nil {
log.Fatal("Invalid gitlab URL: %s", gitlabUrl)
}
gitlabUrlPath.Path = path.Join(gitlabUrlPath.Path, "api/v3")
client.(*gitlab.Client).SetBaseURL(gitlabUrlPath.String())
}
If the value in gitlabUrl
is a non-empty string, you
assume that you need to
query the GitLab hosted at this URL. You attempt to parse it first using
the Parse()
function from the "url" package and exit with an error
message if the parsing fails. The GitLab API lives at <DNS of gitlab
installation>/api/v3, so you update the Path
object of the parsed URL
and then call the function SetBaseURL()
of the
gitlab.Client
to set
this as the base URL.
Next, let's look at the main.go file. First though,
you should learn where "gitbackup" creates the backup of
the git repositories. You can pass the location via the
-backupdir
flag. If not specified, it defaults to $HOME/.gitbackup. Let's
refer to it as BACKUP_DIR
. The repositories are backed up in
BACKUP_DIR/gitlab/ or BACKUP_DIR/github. If a repository is not found in
BACKUP_DIR/<service_name>/<repo>, you know you'll have to make a new clone of
the repository (git clone
). If the repository
exists, you update it (git
pull
). This operation is performed in the
backUp()
function in main.go:
func backUp(backupDir string, repo *Repository, wg *sync.WaitGroup) {
defer wg.Done()
repoDir := path.Join(backupDir, repo.Name)
_, err := os.Stat(repoDir)
if err == nil {
log.Printf("%s exists, updating. \n", repo.Name)
cmd := exec.Command("git", "-C", repoDir, "pull")
err = cmd.Run()
if err != nil {
log.Printf("Error pulling %s: %v\n", repo.GitURL, err)
}
} else {
log.Printf("Cloning %s \n", repo.Name)
cmd := exec.Command("git", "clone", repo.GitURL, repoDir)
err := cmd.Run()
if err != nil {
log.Printf("Error cloning %s: %v", repo.Name, err)
}
}
}
The function takes three arguments: the first is a string that points
to the location of the backup directory, followed by a reference to a
Repository
object and a reference to a
WaitGroup
. You set up a deferred
call to Done()
on the
WaitGroup
. The next two lines then check whether
the repository already exists in the backup directory using the
Stat()
function in the os package. This function will return a nil error value
if the directory exists, so you execute the git pull
command by
using the Command()
function from the exec package. If the directory
doesn't exist, you execute a git clone
command instead.
The main()
function sets up the flags for the
"gitbackup" program:
-
backupdir
: the backup directory. If not specified, it defaults to $HOME/.gitbackup. -
github.repoType
: GitHub repo types to back up;all
will back up all of your repositories. Other options areowner
andmember
. -
gitlab.projectVisibility
: visibility level of GitLab projects to clone. It defaults tointernal
, which refers to projects that can be cloned by any logged in user. Other options arepublic
andprivate
. -
gitlab.url
DNS of the GitLab service. If you are creating a backup of your repositories on a custom GitLab installation, you can just specify this and ignore specifying the "service" option. -
service
: the service name for the Git service from which you are backing up your repositories. Currently, it recognizes "gitlab" and "github".
In the main()
function, if the
backupdir
is not specified, you default
to use the $HOME/.gitbackup/<service_name> directory. To find the home
directory, use the package github.com/mitchellh/go-homedir. In
either case, you create the directory tree using the
MkdirAll()
function
if it doesn't exist.
You then call the getRepositories()
function defined in remote.go
to fetch the list of repositories you want to back up. Limit the
maximum number of concurrent clones to 20 by using the token system I
described earlier.
Let's now build and run the project from the clone of the "gitbackup" repository you created earlier:
$ pwd
/Users/amit/work/github.com/amitsaha/gitbackup
$ gb build
..
$ ./bin/gitbackup -help
Usage of ./bin/gitbackup:
-backupdir string
Backup directory
-github.repoType string
Repo types to backup (all, owner, member) (default "all")
-gitlab.projectVisibility string
Visibility level of Projects to clone (default "internal")
-gitlab.url string
DNS of the GitLab service
-service string
Git Hosted Service Name (github/gitlab)
Before you can back up repositories from either GitHub or GitLab, you need to obtain an access token for each. To be able to back up a GitHub repository, obtain a GitHub personal access token from here with only the "repo" scope. For GitLab, you can get an access token from https://<location of gitlab>/profile/personal_access_tokens with the "api" scope.
The following command will back up all repositories from github:
$ GITHUB_TOKEN=my$token ./bin/gitbackup -service github
Similarly, to back up repositories from a GitLab installation to a custom location, do this:
$ GITLAB_TOKEN=my$token ./bin/gitbackup -gitlab.url
↪git.mydomain.com -backupdir /mnt/disk/gitbackup
See the README at here to learn more, and I welcome improvements to it via pull requests. In the time between the writing of this article and its publication, gitbackup has changed a bit. The code discussed in this article is available in the tag https://github.com/amitsaha/gitbackup/releases/tag/lj-0.1. To learn about the changes since this tag in the current version of the repository, see my blog post.
Conclusion
I covered some key Golang features in this article and applied them to write a tool to back up repositories from GitHub and GitLab. Along the way, I explored interfaces, goroutines and channels, passing command-line arguments via flags and working with third-party packages.
The code listings discussed in the article are available here. See the Resources section to learn more about Golang, GitHub and the GitLab API.
Resources
Getting Started with Golang and gb: https://bit.ly/2lKEJJm
Golang by Example: https://gobyexample.com
Golang Type Assertions: https://golang.org/doc/effective_go.html#interface_conversions
GitHub Repos API: https://developer.github.com/v3/repos
GitLab Projects API: https://docs.gitlab.com/ce/api/projects.html
Golang Interface for GitHub: https://github.com/google/go-github
Golang Interface for GitLab: https://github.com/xanzy/go-gitlab
gb: https://getgb.io