The Practice of Golang in Jike's Backend

sorcererxw•March 1, 2021

Continuous Integration and Build

Configuration Management

Background¶

With the evolution of business, a large amount of outdated code has accumulated in the backend services of Jike, leading to high maintenance costs, and code refactoring or even rewriting has been put on the agenda. Compared to Node.js, Golang has certain advantages. Since the backend of Jike has been well serviced, other businesses have also had some practice on Go, so it is a feasible choice to directly rewrite some Jike services in Go. In this process, we can verify the differences between the two languages on the same business, and we can improve the supporting facilities related to Go.

Renovation Results¶

Up to now, the Jike recommendation stream and user filtering services have been rewritten and launched using Go. Compared to the original services, the overhead of the new version has been significantly reduced:

Interface response time reduced by 50%
Response time of the old service interface
Response time of the new service interface
Memory usage reduced by 95%
CPU usage reduced by 90%

Note: The above performance data is based on the user filtering service as an example, which is a service with much more reads than writes and a single task. During the rewriting process, some optimizations were also made to the original implementation, so the above data is for reference only and does not fully represent the real performance comparison between Go and Node.

Renovation Plan¶

Step One: Rewrite the Service

While ensuring that the external interface remains unchanged, it is necessary to rewrite the entire core business logic. However, during the rewriting process, some problems were encountered:

Since the previous Node services mostly did not explicitly declare the input and output types of the interface, it was necessary to find all relevant fields when rewriting.
Since the majority of the previous code did not include unit tests, it was necessary to understand the business requirements and design unit tests after rewriting.
The old code heavily used the any type, and it took some effort to clarify all possible types. Many types in Node don't need to be very strict, but in Go, there is no room for deviation.

In summary, rewriting is not translation, it requires a deep understanding of the business, and the implementation of a new set of code.

Step Two: Correctness Verification

Since many services do not have complete regression tests, relying solely on unit tests is far from sufficient to ensure correctness.

Generally speaking, the correctness of read-only interfaces can be verified by data comparison, that is, comparing the output of the new and old services with the same input. For small-scale datasets, tests can be conducted by launching two services locally. However, once the data scale is large enough, it is impossible to fully test locally, and one solution is traffic replication testing.

Due to the complexity and performance impact of cross-environment calls between services, we use message queues to replicate requests for asynchronous matching.

The original service, for each response, packaged the input and output into messages and sent them to the message queue.
In the testing environment, the consumer service will receive messages and resend the input to the new version of the service.
After the new version of the service responds, the consumer service will compare the responses before and after. If the results are different, it will output logs.
Finally, all you need to do is download the logs to your local machine and correct the code one by one according to the test data.

Step Three: Gradually Replace the Old Service through Grayscale Deployment

Once we have a firm grasp of the business correctness, we can gradually launch the new version of the service. Thanks to the service decomposition, we can replace the service without any perception from upstream and downstream, we just need to gradually replace the corresponding service with new containers.

Engineering Practice¶

Repository Structure¶

The project structure is a monorepo based on Standard Go Project Layout:

.
├── build: 构建相关文件，可 symbolic link 至外部
├── tools: 项目自定义工具
├── pkg: 共享代码
│   ├── util
│   └── ...
├── app: 微服务目录
│   ├── hello: 示例服务
│   │   ├── cmd
│   │   │   ├── api
│   │   │   │   └── main.go
│   │   │   ├── cronjob
│   │   │   │   └── main.go
│   │   │   └── consumer
│   │   │       └── main.go
│   │   ├── internal: 具体业务代码一律放在 internal 内，防止被其他服务引用
│   │   │   ├── config
│   │   │   ├── controller
│   │   │   ├── service
│   │   │   └── dao
│   │   └── Dockerfile
│   ├── user: 大业务拆分多个子服务示例
│   │   ├── internal: 子业务间共享代码
│   │   ├── account：账户服务
│   │   │   ├── main.go
│   │   │   └── Dockerfile
│   │   └── profile: 用户主页服务
│   │       ├── main.go
│   │       └── Dockerfile
│   └── ...
├── .drone.yml
├── .golangci.yaml
├── go.mod
└── go.sum

The app directory contains all service code, and the hierarchy can be freely divided.
All shared code for the services is placed in the pkg in the root directory.
All external dependencies are declared in the go.mod in the root directory.
Each service or group of services, through the internal directory, monopolizes all the code underneath, preventing it from being referenced by other services.

The benefits brought by this pattern:

During development, you only need to focus on a single code repository, which improves development efficiency.
All service codes can be put together, from a large set of services for a whole function, to a small operational activity service. Through reasonable hierarchical organization, they can all be clearly maintained under the app directory.
When modifying the public code, ensure compatibility with all services that depend on it. Even if it is not compatible, the refactoring feature provided by the IDE allows for easy replacement.

Continuous Integration and Build¶

Static Check

The project uses golangci-lint for static checks. Each time the code is pushed, Github Action will automatically run golangci-lint, which is very fast and convenient. If an error occurs, the warning will be directly commented on the PR.

golangci-lint itself does not include lint strategies, but it can integrate various linters to achieve very detailed static checks, nipping potential errors in the bud.

Test + Build Image

We use Drone for testing and building images. Although we have tried building on Github Action, the matrix feature can well support monorepo. However, building images is relatively time-consuming, and doing so on Github Action will consume a lot of Github Action quotas. Once the quotas are used up, it will affect normal development work.

We finally chose the Drone project, and through Drone Configuration Extension, we can also customize complex build strategies. Generally speaking, we hope that the CI system's build strategy is intelligent enough to automatically distinguish which code needs to be built and which code needs to be tested. In the early stages of development, I thought the same way, analyzing the entire project's dependency topology through scripting, combined with file changes, to find all the affected packages, and then perform testing and building. It looks very ideal, but the reality is that once the common code is modified, almost all services will be rebuilt, which is simply a nightmare. This method may be more suitable for unit testing, rather than packaging.

So, I have now chosen a more straightforward strategy, using Dockerfile as a sign of build: If a directory contains a Dockerfile, it means this directory is "buildable"; once the subfiles in this directory change (added or modified), it indicates that this Dockerfile is "to be built". Drone will start a pipeline for each Dockerfile that is to be built.

There are a few points worth noting:

During the build process, not only the code of the current service needs to be copied, but also the shared code. Therefore, the context directory needs to be set to the root directory during the build, and the service directory is passed in as a parameter for easy construction:
```
docker build --file app/hello/Dockerfile --build-arg TARGET="./app/hello" .
```
The image name will be named by default as the concatenation of the folder names from inside to outside, such as after building ./app/live/gift/Dockerfile, an image in the form of {registry}/gift-live-app:{branch}-{commitId} will be generated.
All builds (including downloading dependencies, compiling) are defined by Dockerfile, avoiding introducing too much logic into the CI main process and reducing flexibility. The caching mechanism of Docker itself can also make the build speed extremely fast.
One issue is that once the shared code outside the service directory changes, Drone cannot perceive and build the affected services. The solution is to add a specific field in the git commit message to inform Drone to execute the corresponding build.

Configuration Management¶

In Node projects, we usually use node-config to configure different settings for different environments. There is no ready-made tool in the Go ecosystem that can do the same job directly, but we can try to abandon this approach.

As advocated by the Twelve-Factor Principles, we should configure services through environment variables as much as possible, rather than multiple different configuration files. In fact, in Node projects, apart from the local development environment, we often dynamically configure through environment variables, with most of the test.json/beta.json directly referencing production.json.

We divide the configuration into two parts:

Single configuration file
We define a complete set of configurations in the form of a file within the service, serving as the basic configuration, which can be used during local development.
Dynamic Environment Variables
After the service is deployed online, on the basis of the basic configuration, we inject environment variables into the configuration.

We can write a config.toml file (choose any configuration format you like) in the service directory, and write the basic configuration for use during local development.

# config.toml
port=3000
sentryDsn="https://[email protected]"

[mongodb]
url="mongodb://localhost:27017"
database="db"

When running online, we also need to inject environment variables into the configuration. We can use Netflix/go-env to inject environment variables into the configuration data structure:

type MongoDBConfig struct {
	URL      string `toml:"url" env:"MONGO_URL,MONGO_URL_ACCOUNT"`
	Database string `toml:"database"`
}

type Config struct {
	Port      int            `toml:"port" env:"PORT,default=3000"`
	SentryDSN string         `toml:"sentryDsn"`
	MongoDB   *MongoDBConfig `toml:"mongodb"`
}

//go:embed config.toml
var configToml string

func ParseConfig() (*Config, error) {
  var cfg Config
	if _, err := toml.Decode(configToml, &cfg); err != nil {
		return nil, err
	}
	if _, err := env.UnmarshalFromEnviron(&cfg); err != nil {
		return nil, err
	}
	return &cfg, nil
}

The above code also uses the latest Go1.16 embed feature, which allows any file to be packaged into the final binary file with just one Compiler Directive. Building an image only requires copying a single executable file, reducing the complexity of build and release.

Service Invocation¶

Code Management

Considering that the backend of Jike has services in multiple languages (Node/Java/Go), the repeated definition of types across services can lead to manpower waste and inconsistency. Therefore, we define types through ProtoBuf, generate corresponding code with protoc, and maintain clients in various languages within a single repository.

.
├── go
│   ├── internal: 内部实现，如 http client 封装
│   ├── service
│   │   ├── user
│   │   │   ├── api.go: 接口定义与实现
│   │   │   ├── api_mock.go: 通过 gomock 生成的接口 mock
│   │   │   └── user.pb.go: 通过 protoc 生成的类型文件
│   │   ├── hello
│   │   └── ...
│   ├── go.mod
│   ├── go.sum
│   └── Makefile
├── java
├── proto
│   ├── user.proto
│   ├── hello.proto
│   └── ...
└── Makefile

Each service exposes interfaces through an independent package, and each service consists of four parts:

Interface Definition
The specific invocation code implemented based on interface definition
Mock implementation generated by gomock based on interface definition
Generate type code based on proto

API Design

Under the premise of not using code generation, optional parameters are used to add options such as degradation, retry, and timeout for each interface.

result, err := userservice.DefaultClient.IsBetaUser(
  context.Background(), 
  []string{"guoguo"}, 
  option.WithRetries(3),  // 重试三次
  option.WithDowngrade(func() interface{} { return map[string]bool{"guoguo":false} }), // 接口降级
  option.WithTimeout(3*time.Second), // 超时控制，也可以直接使用 context.WithTimeout
)

ProtoBuf

As mentioned above, in order to reduce the cost of internal interface docking and maintenance, we chose to use ProtoBuf to define types and generate Go types. Although we use ProtoBuf for definition, the services still pass data through JSON, and data serialization and deserialization have become an issue.

To simplify the conversion between ProtoBuf and JSON, Google provides a package called jsonpb. This package, based on the native json, implements the conversion between enum Name(string) and Value(int32) to be compatible with traditional string enum; it also supports the oneof type. All of these capabilities are unachievable with Go's native json. If the native json is used to serialize proto types, it will result in the inability to output strings for enum and the complete inability to output for oneof.

So, does this mean we should replace all the native json with jsonpb in our code? Not exactly, jsonpb only supports serialization of proto types:

func Marshal(w io.Writer, m proto.Message) error

Unless all external read and write interfaces are defined with ProtoBuf, you cannot use jsonpb all the way.

However, every cloud has a silver lining. Go's native json defines two interfaces:

// Marshaler is the interface implemented by types that
// can marshal themselves into valid JSON.
type Marshaler interface {
	MarshalJSON() ([]byte, error)
}

// Unmarshaler is the interface implemented by types
// that can unmarshal a JSON description of themselves.
// The input can be assumed to be a valid encoding of
// a JSON value. UnmarshalJSON must copy the JSON data
// if it wishes to retain the data after returning.
//
// By convention, to approximate the behavior of Unmarshal itself,
// Unmarshalers implement UnmarshalJSON([]byte("null")) as a no-op.
type Unmarshaler interface {
	UnmarshalJSON([]byte) error
}

Any type, as long as it implements these two interfaces, can invoke its own logic for operations when being (de)serialized, similar to Hook functions. In this way, all that's needed is to implement these two interfaces for all proto types: when json attempts to (de)serialize itself, it instead uses jsonpb.

func (msg *Person) MarshalJSON() ([]byte, error) {
	var buf bytes.Buffer
	err := (&jsonpb.Marshaler{
		EnumsAsInts:  false,
		EmitDefaults: false,
		OrigName:     false,
	}).Marshal(&buf, msg)
	return buf.Bytes(), err
}

func (msg *Person) UnmarshalJSON(b []byte) error {
	return (&jsonpb.Unmarshaler{
		AllowUnknownFields: true,
	}).Unmarshal(bytes.NewReader(b), msg)
}

After some searching, we finally found a protoc plugin protoc-gen-go-json: it can implement json.Marshaler and json.Unmarshaler for all types while generating proto types. In this way, there is no need to compromise for serialization compatibility, and it does not intrude on the existing code at all.

Release

As it is an independently maintained repository, it needs to be introduced into the project in the form of Go module. Thanks to the design of Go module, version release can be seamlessly integrated with Github, which is highly efficient.

Test version
go mod supports directly pulling the code of the corresponding branch as a dependency, without the need to manually release an alpha version. You only need to execute go get -u github.com/iftechio/jike-sdk/go@{branch} in the code execution directory of the caller to directly download the latest version of the corresponding development branch.
Official Version
When changes are merged into the main branch, a stable version can be released simply through Github Release (or you can tag it locally with git tag), and you can pull the corresponding repository snapshot by the specific version number: go get github.com/iftechio/jike-sdk/go@{version}
Since go get is essentially downloading code, and our code is hosted on Github, there may be failures in pulling dependencies due to network issues when building the code on Alibaba Cloud in China (private mod cannot be pulled through goproxy). Therefore, we modified goproxy and deployed a goproxy within the cluster:
- Public repositories will be pulled through goproxy.cn.
- For private repositories, they can be directly pulled from Github via a proxy, and goproxy will also handle the authentication work for Github private repositories.
  We only need to execute the following code to download dependencies through the internal goproxy:
  GOPROXY="http://goproxy.infra:8081" \ GONOSUMDB="github.com/iftechio" \ go mod download

Context¶

Context provides a means of transmitting deadlines, caller cancellations, and other request-scoped values across API boundaries and between processes.

Context is a very special existence in Go, which can string the whole business together like a bridge, allowing data and signals to be passed between the upstream and downstream of the business chain. In our project, context also has quite a few applications:

Cancellation Signal

Every http request carries a context, and once the request times out or the client actively closes the connection, the outermost layer will pass a cancel signal through the context to the entire link, and all downstream calls will end immediately. If the entire link follows this specification, once the upstream closes the request, all services will cancel the current operation, which can reduce a large amount of unnecessary consumption.

During development, it is important to note:

When most tasks are cancelled, a context.ErrCancelled error is thrown to allow the caller to perceive the exception and exit. However, the RPC circuit breaker also captures this error and records it as a failure. In extreme scenarios, if the client continuously initiates requests and cancels them immediately, it can cause the service's circuit breakers to open one after another, leading to service instability. The solution is to modify the circuit breaker to still throw specific errors, but not record them as failures.

In distributed scenarios, the vast majority of data writes cannot use transactions. It is necessary to consider whether the final consistency can still be guaranteed if an operation is cancelled halfway? For operations with high consistency requirements, it is necessary to proactively block the cancel signal before execution:

// 返回一个仅仅实现了 Value 接口的 context
// 只保留 context 内的数据，但忽略 cancel 信号

func DetachedContext(ctx context.Context) context.Context {
	return &detachedContext{Context: context.Background(), orig: ctx}
}

type detachedContext struct {
	context.Context
	orig context.Context
}

func (c *detachedContext) Value(key interface{}) interface{} {
	return c.orig.Value(key)
}

func storeUserInfo(ctx context.Context, info interface{}) {
  ctx = DetachedContext(ctx)
	saveToDB(ctx, info)
  updateCahce(ctx, info)
}

Context Propagation

Whenever a request comes in, the http request context carries various information about the current request, such as traceId, user information. This data can be propagated all the way through the business chain with the context. The monitoring data collected during this process will be associated with these data, facilitating the aggregation of monitoring data.

Context.Value should inform, not control.

The most important thing to note when using context to pass data is: the data in the context is only used for monitoring, and should not be used for business logic. As the saying goes, "explicit is better than implicit". Since context does not directly expose any internal data, using context to pass business data makes the program very inelegant and difficult to test. In other words, even if a function is passed an emptyCtx, it should not affect its correctness.

Error Collection¶

Errors are just values

The error in Go is a common value (from an external perspective, it's just a string), which brings some trouble to error collection: when we collect errors, we need to know not only the content of the error line, but also the context information of the error.

Go1.13 introduced the concept of error wrapping. Through the design of Wrap/Unwrap, an error can be transformed into a singly linked list structure, where each node can store custom context information. Moreover, an error can be used as the head of the list to read all subsequent error nodes.

For a single error, the stacktrace of the error is one of the most important pieces of information. Go implements stacktrace collection through runtime.Callers:

Callers fills the slice pc with the return program counters of function invocations on the calling goroutine's stack.

As you can see, Callers can only collect the call stack within a single goroutine. If you want to collect a complete error trace, you need to include the stacktrace in the error when passing errors across goroutines. At this point, you can use errors.WithStack or errors.Wrap from the third-party library pkg/errors to achieve this. They will create a new error node and store the current call stack:

// WithStack annotates err with a stack trace at the point WithStack was called.
// If err is nil, WithStack returns nil.
func WithStack(err error) error {
	if err == nil {
		return nil
	}
	return &withStack{
		err,
		callers(),
	}
}

func main() {
  ch := make(chan error)
  go func() {
    err := doSomething()
	  ch <- errors.withStack(err)    
  }()
  err := <-ch
  fmt.Printf("%w", err)

The final error collection (often on the root web middleware) can directly use Sentry:

sentry.CaptureException(err)

Sentry will, based on the errors.Unwrap interface, extract the error from each layer. Sentry can automatically export the error stack for each layer of error. Since stacktrace is not a formal standard, Sentry has proactively adapted several mainstream Stacktrace schemes, including that of pkg/errors.

Eventually, you can view the complete error information through the Sentry backend. As shown in the figure below, each large section is an error layer, and each section contains the context information within this error.

Reference Links¶

TJ Discusses the Productivity Advantages of Go Over Node

Standard Go Project Layout

The Twelve-Factor App

Go Wiki - Module: Releasing Modules (V2 or Higher)

How to correctly use context.Context in Go 1.7

Dave Cheney - Don’t just check errors, handle them gracefully

Uber Go Coding Guidelines