While browsing through the programming newsgroup today, I came across a post from Erland Sommarskog – a short backgrounder about code pages and collations. I’ve never seen code pages described so coherent and with so few words, so I asked Erland if I could quote his text in my blog (no, Erland doesn’t blog 🙂 ). So below quoted text is with Erland’s kind permission.
For those of you who want to know more about Erland or read some of his great deep-dive articles, check out http://www.sommarskog.se/.
“To start with, if we should be picky, there are no ASCII characters >= 128.
There are however lot of other character sets that defines this area.
Way back in the 80s vendors started to explore the area 128-255, and
about each vendor come with its character set(s). The contribution
from the IBM/Microsoft combo that ran MS-DOS was a number of code
pages, of which 437 was of their oldest. Later, they realized that
they did not support all languages in Western Europe, and they defined
CP850 which served Western Europe better.
Meanwhile, HP had Roman-8 and Digital had their DEC Multinational Character
Set. Eventually, ISO settled on composing a standard, and they worked
from DEC MCS – or DEC were smart to work from the ISO drafts, I don’t know
which. This resulted in ISO-8859 a family or originally eight 8-bit
character sets, which recently evolved into 15 sets.
By the time Microsoft divorced from IBM, they abandoned CP437 and
CP850 as the character set for Windows, and went with ISO-8859, at
least for Western Europe. Except that they added some printable
characters in the range 128-159 where Latin-1 has only control characters.
This became CodePage 1252, and CP1252 is the code page typically
used for 8-bit Windows applications on a computer installed in Western
Europe or the Americas. However, CP437/CP850 still lives on Windows
today; the command-line windows uses a so-called OEM character set which
is one of these.
If you have a Windows program that uses CP1252, and the server collation
is CP437, the client API will convert the data for you, so if you pass
for instance Ö which has character code 216 in CP1252, the byte that
gets stored in SQL Server will be another. When you retrieve data,
data will be converted in the other direction. However, since CP1252
and CP437 does not include the same characters, the conversion may
not be roundtrip. For instance, Å may not be in CP437, so an Å from
CP1252 will become A, and will remain A when you retrieve it.
<TiborComment>Here I removed a section which was only relevant for the newsgroup thread in question</TiborComment>
Finally, all collations have 255 characters for varchar, and at least
65535 characters for nvarchar.”
For those of you who want to dive deep in collations and such topics, check out http://msdn.microsoft.com/en-us/library/bb330962.aspx.