Monday, January 12, 2009

Difference between NVARCHAR and VARCHAR in SQL Server

Difference between NVARCHAR and VARCHAR in SQL Server

The broad range of data types in SQL Server can sometimes throw people through a loop, especially when the data types seem to be highly interchangeable. Two in particular that constantly spark questions areVARCHAR and NVARCHAR: what's the difference between the two, and how important is the difference?

VARCHAR is an abbreviation for variable-length character string. It's a string of text characters that can be as large as the page size for the database table holding the column in question. The size for a table page is 8,196 bytes, and no one row in a table can be more than 8,060 characters. This in turn limits the maximum size of a VARCHAR to 8,000 bytes.

The "N" in NVARCHAR means uNicode. Essentially, NVARCHAR is nothing more than a VARCHAR that supports two-byte characters. The most common use for this sort of thing is to store character data that is a mixture of English and non-English symbols — in my case, English and Japanese.

The key difference between the two data types is how they're stored. VARCHAR is stored as regular 8-bit data. But NVARCHAR strings are stored in the database as UTF-16 — 16 bits or two bytes per character, all the time — and converted to whatever codepage is being used by the database connection on output (typically UTF-8). That said, NVARCHAR strings have the same length restrictions as their VARCHAR cousins — 8,000 bytes. However, since NVARCHARs use two bytes for each character, that means a given NVARCHAR can only hold 4,000 characters (not bytes) maximum. So, the amount of storage needed for NVARCHAR entities is going to be twice whatever you'd allocate for a plain old VARCHAR.

Because of this, some people may not want to use NVARCHAR universally, and may want to fall back on VARCHAR — which takes up less space per row — whenever possible.

Here's an example of how to mix and match the use of the two types. Let's say we have a community website where people log in with a username, but can also set a public "friendly" name to be more easily identified by other users. The login name can be a VARCHAR, which means it must be 8-bit ASCII (and it can be constrained further to conventional alphanumerics with a little more work, typically on the front end). The friendly name can be an NVARCHAR to allow Unicode entities. This way you're allowing support for Unicode, but only in the place where it matters most — both for the users, and where the extra storage space is going to be put to the best possible use.

VARCHAR and NVARCHAR in SQL Server 2005

One fairly major change to both VARCHAR and NVARCHAR in SQL Server 2005 is the creation of the VARCHAR(MAX) and NVARCHAR(MAX) data types. If you create a VARCHAR(MAX) column, it can hold up to 2^31 bytes of data, or 2,147,483,648 characters; NVARCHAR(MAX) can hold 2^30 bytes, or 1,073,741,823 characters.

These new data types are essentially replacements for the Large Object or LOB data types such as TEXT and NTEXT, which have a lot of restrictions. They can't be passed as variables in a stored procedure, for instance. The(MAX) types don't have those restrictions; they just work like very large string types. Consequently, if you're in the process of re-engineering an existing data design for SQL Server 2005, it might make sense to migrate some (although not all!) TEXT / NTEXT fields to VARCHAR(MAX) / NVARCHAR(MAX) types when appropriate.

The big difference between VARCHAR and NVARCHAR is a matter of need. If you need Unicode support for a given data type, either now or soon enough, go with NVARCHAR. If you're sticking with 8-bit data for design or storage reasons, go with VARCHAR. Note that you can always migrate from VARCHAR to NVARCHAR at the cost of some room -- but you can't go the other way 'round. Also, because NVARCHAR involves fetching that much more data, it may prove to be slower depending on how many table pages must be retrieved for any given operation.

No comments: