A couple of days ago a colleague of mine encountered a problem where he had some objects that define a configuration needed to be stored in the database.
He would serialize the data with the following code:
And we would read deserialize the string with these lines:
When the string then was read from the database we got an error:
Let’s check why.
I think the exception is in the first 2 characters of the xml:
Nothing weird there… However, let’s look at the characters themselves:
What the? What’s that thing there at ? Who put it there?
Well, according to some web documentation it’s the ZERO WIDTH NO-BREAK SPACE (what’s with the screaming anyway…), also known as the BOM!
So what is the BOM? Wikipedia does a better explanation than me. The only thing you need to know is that it is optional.
So where is it coming from? And why is this an issue?
Let’s investigate our serializer helper, we run the code again, but a breakpoint in the method that actually serializes the entity, and investigate the memoryStream:
As we can see, after carefully reading the Wikipedia page, these first 3 chars are actually the BOM. So the issue is not our reader, it’s our writer.
Let’s get back to the question: why is this an issue? Why can’t our reader interpret this as just a UTF-8 string, and just parse it, why do we manually have to do a TrimStart()?
It seems that the StringReader used in our code just passes through the string as a stream to our deserializer, without caring whether the string starts with or without a BOM. So we can’t do anything there except for calling the TrimStart() method as mentioned above.
So since we control both sides, let’s do it less nasty, the documentation mentions that the BOM is not required, and our reader doesn’t play nice with it. We’re not going to save it at all then .
Let’s check the XmlTextWriter constructor: it accepts 2 parameters, a stream and an encoding. So what’s with this encoding? Apparently if we use the Encoding.UTF8 it emits the BOM. We can avoid this by using new UTF8Encoding(false) to prevent the BOM from being emitted.
Another way, which I find cleaner is this one:
This doesn’t involve any encoding being manipulated / used on our side, so it uses a bit more magic, but it’s way more straightforward. Please add your experience to the comments
You can find the full code on GitHub here!
Have a good one,