UnmarshalBSONValue Help

Topher_Sterling · June 17, 2022, 8:30pm

We have a field in mongo which is stored in old data as a string and new data as a uint, so I attempted to support that pattern by using UnmarshalBSONValue as such:

type MachineScoreType uint8
func (m *MachineScoreType) UnmarshalBSONValue(t bsontype.Type, b []byte) (err error) {
	if len(b) == 0 {
		*m = NOT_SET
		return nil
	}
	if len(b) == 5 && b[0] == 1 && b[1] == 0 && b[2] == 0 && b[3] == 0 {
		// Empty string case - bizarre what mongo passes us, but this is the empty string
		*m = NOT_SET
		return nil
	}
	if (b[0] == 0x0003 || b[0] == 0x000a || b[0] == 0x000b || b[0] == 0x000c || b[0] == 0x000d || b[0] == 0x000e) && len(b) > 4 {
		// If this is a string machine score, it will be proceeded by a few NUL/whitespace chars
		// e.g. ' offTopicScore'
		// It also has a trailing space
		// bytes hex before trimming: 0E0000006F6666546F70696353636F726500
		s := string(b[4 : len(b)-1])
		*m = MachineScoreTypeFromString(s)
	} else {
		// Otherwise, this is a uint8
		*m = MachineScoreType(binary.LittleEndian.Uint16(b))
	}

	// Now we should have a uint8 - check to see if it's valid
	if !m.IsValid() {
		*m = INVALID_SCORE
		return errors.WithStack(fmt.Errorf("'%s' (%X) is not a valid score type. valid choices: [%s]", string(b), b, strings.Join(machineScoreStrings, ", ")))
	}
	return nil
}

This works, but the fact that there are a range of control characters before a string (e.g. it could be 0x000a-0x000e) is confusing. With UnmarshalJSON, a string is prefixed and suffixed with a ".

First question - can you help me understand why there are different NUL/control chars before a string?
Second and most important question - how should I be doing this in a more standard way? Decoding either as an uint8 or a string? (note - I realize mongo stores uint16 rather than uint8, but just want to be clear about what we’re ultimately trying to get from the function.

Matt_Dale · June 18, 2022, 1:17am

Hey @Topher_Sterling thanks for the question! The extra bytes you’re seeing is the BSON binary encoding for a string. Here’s how the BSON specification describes the string format:

string ::= int32 (byte*) “\x00” String - The int32 is the number bytes in the (byte*) + 1 (for the trailing ‘\x00’). The (byte*) is zero or more UTF-8 encoded characters.

The first 4 bytes are 32-bit integer that describe the length of the string. The next N-1 bytes are the actual string bytes. The Nth byte is always the string terminating character \x00.

purpose: length   string     terminator
bytes:   [0 ... 3][4 ... N-1][N]

A more robust implementation of UnmarshalBSONValue should switch logic depending on the bsontype.Type parameter. For BSON strings, it’s probably safe to ignore the length bytes because bson.Unmarshal already trims the input bytes when calling UnmarshalBSONValue to only the bytes in the field being decoded (i.e. you already know the length). For BSON integer values, BSON only supports 32-bit and 64-bit signed integers. A Go uint8 value is encoded as a BSON Int32, so we should read the full 4-byte signed integer and make sure it can safely be converted to a uint8.

For example:

func (m *MachineScoreType) UnmarshalBSONValue(t bsontype.Type, b []byte) (err error) {
	if len(b) == 0 {
		*m = NOT_SET
		return nil
	}

	switch t {
	case bsontype.String:
		// Ignore the first 4 bytes because they specify the length of the
		// string field, which has already been trimmed for us. Ignore the
		// last byte because it is always the terminating character '\x00'.
		s := string(b[4 : len(b)-1])
		if s == "" {
			*m = NOT_SET
			return nil
		}
		*m = MachineScoreTypeFromString(s)
	case bsontype.Int32:
		// Go uint8 values are encoded as BSON Int32 by default.
		// Decode the value as an int32 and make sure it can fit in
		// a uint8.
		i := int32(binary.LittleEndian.Uint32(b))
		if i < 0 || i > math.MaxUint8 {
			*m = INVALID_SCORE
			return fmt.Errorf("%d cannot be stored as a uint8 value and is not a valid score", i)
		}
		*m = MachineScoreType(i)
	default:
		*m = INVALID_SCORE
		return fmt.Errorf("unsupported BSON type %v", t)
	}

	// Now we should have a uint8 - check to see if it's valid
	if !m.IsValid() {
		*m = INVALID_SCORE
		return errors.WithStack(fmt.Errorf("'%s' (%X) is not a valid score type. valid choices: [%s]", string(b), b, strings.Join(machineScoreStrings, ", ")))
	}
	return nil
}

Topher_Sterling · June 20, 2022, 9:05pm

Beautiful. This was excellent. I hadn’t thought to look at the bson spec for the offset bytes. I had just figured that the value that came in was the actual value already trimmed and I just missed it.

Thank you for both the pointer to the spec and for explaining it. You are appreciated.

system · June 25, 2022, 9:06pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.