cover

String Interning in Go

sorcererxw•

Introduction

In Java, the JVM defaults to creating a cache pool for strings to avoid frequently creating new string objects. However, Go does not implement this layer of optimization (Go only caches strings during compilation), so when dealing with a large number of dynamic strings, it can bring significant performance loss. So, can we imitate Java to implement a string pool to locally cache hotspot strings? This is the String Interning technique.

source: https://stackabuse.com/guide-to-string-interning-in-python/

Implementation

Let's take a look at the internal structure of Go string:

package reflect

// StringHeader is the runtime representation of a string.
// It cannot be used safely or portably and its representation may
// change in a later release.
// Moreover, the Data field is not sufficient to guarantee the data
// it references will not be garbage collected, so programs must keep
// a separate, correctly typed pointer to the underlying data.
type StringHeader struct {
	Data uintptr
	Len  int
}

As we can see, a string is essentially a memory pointer plus length. Due to the immutable nature of Go strings, we can consider it safe to reuse strings. Here is a simple implementation:

type stringInterner map[string]string

func (si stringInterner) InternBytes(b []byte) string {
	if interned, ok := si[string(b)]; ok {
		return interned
	}
	s := string(b)
	si[s] = s
	return s
}

The above stringInterner checks whether the string has been interned while converting []byte to string. If so, it directly returns the old string; otherwise, it interns the new string. This function seems very counterintuitive. Compared with the direct string(b), it obviously does so many more operations, but it performs better. So, let's analyze it:

The principle of string(b) is to directly use b as the Data of the string. If the program continuously reads in new byte streams, it will continuously generate new string objects, and the space occupied by the string will grow linearly. In the function above, only []byte is used as the basis for comparison. The newly generated string(b) (allocated on the stack) will be recycled as the function ends, and only the missed strings will be cached on the heap.

Another point worth noting is that we know Go passes parameters and returns results entirely by value copying when calling functions. Why is the returned string the interned string, rather than copying a new one to return? Whether to copy data determines whether this function holds. According to the explanation of Go Internal ABI:

The string type is a sequence of a *[len]byte pointer to the string backing store, and an int giving the len of the string.

It can be known that whether in parameter passing or result returning, strings are pointed to the actual data by reference, and there will be no data copying when passing strings.

In summary, the above-mentioned stringInterning can reduce memory consumption to a certain extent when there is a lot of repeated text. However, when there is almost no repetition in the text, the execution efficiency may be slightly reduced due to the additional comparison operations.

Experiment

Now we can design a real scenario to verify the actual effect of stringInterning.

Suppose we need to deserialize a JSON data, which contains a large number of repeated strings:

[
		{"year":"2021","month":"January","day":"1"}
		{"year":"2021","month":"January","day":"2"}
		...
		{"year":"2021","month":"December","day":"31"}
]


// calendar generate JSON of 2021
func calendarJSON() []byte {
	var calendar []interface{}
	d := time.Date(2021, 0, 0, 0, 0, 0, 0, time.UTC)
	for d.Year() < 2022 {
		d = d.Add(time.Hour * 24)
		calendar = append(calendar, map[string]string{
			"year":  strconv.Itoa(d.Year()),
			"month": d.Month().String(),
			"day":   strconv.Itoa(d.Day()),
		})
	}
	ret, _ := json.Marshal(calendar)
	return ret
}

Write two functions separately:

// SimpleString wraps go string.
type SimpleString struct {
	data string
}

func (i *SimpleString) UnmarshalJSON(b []byte) error {
	i.data = string(b)
	return nil
}

func BenchmarkUnmarshal(b *testing.B) {
	raw := calendarJSON()

	var dest []struct {
		Year     *SimpleString `json:"year"`
		Calendar *SimpleString `json:"calendar"`
		Day      *SimpleString `json:"day"`
	}

	for i := 0; i < b.N; i++ {
		json.Unmarshal(raw, &dest)
	}
}

var si = make(stringInterner)

// InterningString wraps go string with string interning.
type InterningString struct {
	data string
}

func (i *InterningString) UnmarshalJSON(b []byte) error {
	i.data = si.InternBytes(b)
	return nil
}

func BenchmarkUnmarshalWithInterning(b *testing.B) {
	raw := calendarJSON()

	var dest []struct {
		Year     *InterningString `json:"year"`
		Calendar *InterningString `json:"calendar"`
		Day      *InterningString `json:"day"`
	}

	for i := 0; i < b.N; i++ {
		json.Unmarshal(raw, &dest)
	}
}

To control variables, the strings of both functions are encapsulated with structures. The only difference is that InterningString will do interning once when it is read.

go test -bench=. -benchmem
goos: darwin
goarch: amd64
pkg: github.com/sorcererxw/demo
cpu: Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz
BenchmarkUnmarshal-4                        4778            249779 ns/op            6060 B/op        800 allocs/op
BenchmarkUnmarshalWithInterning-4           4764            252346 ns/op             324 B/op          6 allocs/op

As can be seen, there is no significant difference in the running speed between the two, but the memory usage significantly decreases after using interning.

Summary

String Interning can help Go programs reduce memory usage and improve performance in specific scenarios.

At the same time, this technique has been widely used. For example, easyjson can mark intern on the json tag to indicate that this field will have a large number of repeated values, allowing easyjson to introduce String Interning when deserializing this field:

GitHub - mailru/easyjson: Fast JSON serializer for golang.
Fast JSON serializer for golang. Contribute to mailru/easyjson development by creating an account on GitHub.
https://github.com/mailru/easyjson#string-interning