Parsing strings with Go
Parsing strings would arguably be one of the more basic operations that one can do in any language. Sometimes, this may mean comma separated values from an user input on a web page, it may mean parsing application arguments from os.Args
, or parse some line based input like lines from an IRC chat, Slack, Discord or some other chat system for a bot.
A discussion developed on the #golang Discord channel where a person nicknamed #JavaScriptForEverything
(Omar H.) asked if this function on this playground link can be made better. The function parses a slice of strings in a way where the strings will be joined together if they are quoted.
Input: []string{".hi", "\"My", "name", "is", "Omar\"", "\"123\""}
Output: []string{".hi", "My name is Omar", "\"123\""}
Coming from a PHP background, I’m used to some string primitives like explode,
implode and trim. As it happens
Go has the excellent strings package in the standard library, which provides us with the equivalents of strings.Split
, strings.Join
and strings.TrimSpace
. Maybe matching the quoted strings would be a good job for the regexp package, and I can use the other functions to split the non-quoted strings.
Naively, I came up with this version:
func GetStringFromQuotes(parts []string) []string {
var newParts []string
re := regexp.MustCompile(`("[^"]+")`)
str := strings.Join(parts, " ")
indexes := re.FindAllStringSubmatchIndex(str, -1)
if indexes[0][0] > 0 {
words := strings.Split(str[:indexes[0][0]-1], " ")
newParts = append(newParts, words...)
}
for key, index := range indexes {
if key > 0 {
end := indexes[key-1][1]
if end < index[1] {
words := strings.Split(str[end+1:index[0]-1], " ")
newParts = append(newParts, words...)
}
}
newParts = append(newParts, str[index[0]+1:index[1]-1])
}
if len(str) > indexes[len(indexes)-1][1] {
words := strings.Split(str[indexes[len(indexes)-1][1]+1:], " ")
newParts = append(newParts, words...)
}
return newParts
}
I join the slices back into a full string, before I perform a regex match for the quoted strings inside. The regular expression ("[^"]+")
would find a match of a string between two quotes, including the quotes itself. The rest of the code is dealing with splitting non-quoted strings by space.
For example, this part of the code, finds out all the matches for the quoted strings, and returns indexes of where in the original string they are located:
indexes := re.FindAllStringSubmatchIndex(str, -1)
And immediately after, we check if the first match is at the beginning of the string. If it’s later in the string, this means that we have one or more words before the quoted strings which need to be split by a space.
if indexes[0][0] > 0 {
words := strings.Split(str[:indexes[0][0]-1], " ")
newParts = append(newParts, words...)
}
This was a very naive implementation, with a quite obvious pitfall, being much slower and less readable than the original version. Lucky for us, we can improve it by improving the regular expression and using the more appropriate FindAllString
function:
func GetStringFromQuotes(parts []string) []string {
re := regexp.MustCompile(`"[^"]+"|\S+`)
str := strings.Join(parts, " ")
return re.FindAllString(str, -1)
}
This is closer to what I had in mind, when I was thinking of regular expressions. The function is very much improved in readability, but care should be taken in using it very frequently. Without caching the result of regexp.MustCompile
, this function is 20x times slower than the original, and with caching it is still 4x slower than the original.
Non-regexp version
Obviously, if one doesn’t want the performance penalty of regular expressions, there must be a better way to perform parsing of strings than the initial example? There sure is! The strings package exposes strings.FieldsFunc for similar purposes:
FieldsFunc splits the string s at each run of Unicode code points c satisfying f(c) and returns an array of slices of s. If all code points in s satisfy f(c) or the string is empty, an empty slice is returned. FieldsFunc makes no guarantees about the order in which it calls f(c). If f does not return consistent results for a given c, FieldsFunc may crash.
This is a bit technical way of saying it will check each character in the string and you should say from which parts of the string the slices will be created. We want to create a new slice item every time we find a space, unless we’re in a quoted string.
func GetStringFromQuotes(parts []string) []string {
str := strings.Join(parts, " ")
inQuote := false
f := func(c rune) bool {
switch {
case c == '"':
inQuote = !inQuote
return false
case inQuote:
return false
default:
return c == ' '
}
}
return strings.FieldsFunc(str, f)
}
This playground link has the working code if you want to take it for a spin. The readability is much improved from the original, but we introduced a critical issue: the FieldsFunc function will panic if for example you pass a string which isn’t quoted properly. I’m not a fan of handling panics, especially when you can write a bit of code to avoid the panic completely. I wrote my own version of FieldsFunc that would still produce a slice but wouldn’t panic. Obviously the produced slice will not be parsed correctly, but that’s a job for a different part of the code.
func MyFieldsFunc(str string, f func(c rune) bool) []string {
r := []string{}
part := ""
for _, rune := range str {
if f(rune) {
r = append(r, part)
part = ""
} else {
part = part + string(rune)
}
}
if len(part) > 0 {
r = append(r, part)
}
return r
}
In my version I loop through all the runes in str, appending slices as new substrings are detected. I’m pretty sure that this function is terribly inefficient as it allocates memory with the append calls (possibly even over-allocating). In fact:
BenchmarkGetStringFromQuotesShort-4 1000000 1443 ns/op
BenchmarkGetStringFromQuotesLong-4 1000000 1588 ns/op
BenchmarkGetStringFromQuotes2Short-4 300000 3839 ns/op
BenchmarkGetStringFromQuotes2Long-4 200000 8305 ns/op
It runs about 2.6X-5.2x slower than the original version. If you want to look at the benchmarks, Omar put a version on a gist that you can inspect. The first version is faster mostly because it’s already checking a slice of individual words, two characters at a time. My last version is checking each character, so I’m dealing with O(len(line)) complexity, and the original version has a complexity of O(2*len(words)). Using bytes.Buffer
has a negligible impact on performance so we can rewrite the code a bit for clarity if needed.
So, is there really some take-away from all of this? I think the best thing that I learned is that there are many ways to perform string processing in Go, and you should think what kind of way is most appropriate for you. While you might like the more functional approach of PHP, or the objective approach of Java, Go stands up for itself quite well with a strong standard library dedicated to this task.
While I have you here...
It would be great if you buy one of my books:
- Go with Databases
- Advent of Go Microservices
- API Foundations in Go
- 12 Factor Apps with Docker and Go
Feel free to send me an email if you want to book my time for consultancy/freelance services. I'm great at APIs, Go, Docker, VueJS and scaling services, among many other things.
Want to stay up to date with new posts?
Stay up to date with new posts about Docker, Go, JavaScript and my thoughts on Technology. I post about twice per month, and notify you when I post. You can also follow me on my Twitter if you prefer.