Representing Text

Robert P. Webber, Scott McElfresh, Don Blaheta, Longwood University

ASCII

ASCII (American Standard Code for Information Interchange) is a code used to represent characters inside a computer. A character is any single keystroke on the keyboard. Characters are usually indicated by including them in single quotes. For example, ‘A’ and ‘a’ are characters. So are ‘!’, ‘.’, and ‘6’.

A byte is the unit of storage for a character. In computers, 1 byte = 8 bits, so ASCII is a 1 byte, or 8 bit, code. There is a separate eight bit combination of 0’s and 1’s for every individual character on a computer keyboard. (As an historical note, ASCII was originally defined as a seven bit code. The left-most bit is not used in true ASCII, and we will always write it as 0.) Many computer books have tables of ASCII values, and you can find them in lots of locations on the Web. Here is the address of a helpful one. It gives the codes in their base 10 (dec), base 16 (hx), and base 8 (oct) forms.

http://www.lookuptables.com/

For example, the string hi in ASCII would be 68 69₁₆, or 0110100 01101001. Note that Hi would be 48 69₁₆, or 01001000 01101001. Upper case letters have different codes from lower case letters.

As you can see, the binary ASCII code representation of a string of characters can get very long and difficult to read. Consequently, the codes are often shown in their hexadecimal or decimal forms.

Example: Decode the message 43 4D 53 43 20 31 32 31, which is written in hexadecimal ASCII.

Solution: Look up the codes in the ASCII table to find the message CMSC 121. Notice that the space character has its own ASCII code.

Computer documentation of such things as memory addresses and the contents of memory locations are often given in hexadecimal ASCII. Here is a way to actually see some ASCII code. On a PC, type and save a short file using a text editor such as Notepad. Give it a short name such as temp, and do not use any extension. It does not matter what you put in the file. Next, open a command window and open the file you saved using the command DEBUG followed by the file name. Then give a -d command (for display). The contents of your file will be shown in ASCII using hexadecimal notation. To exit DEBUG and return to command level, enter the command -q (for quit).

ISO-8859

The ISO-8859 standard (including ISO-8859-1, aka ISO-Latin-1) included additional characters, to encode languages other than English, but still couldn't represent certain pairs of languages (like German and Greek) at the same time, and it couldn't represent the East Asian languages at all. It is less important now than it used to be, but it was long a default encoding on the web and can still be found in many places. This page has tables of the different ISO-8859 encodings:

http://czyborra.com/charsets/iso8859.html

Unicode

Unicode resolves many of the limitations of ASCII and ISO-Latin-1. This is a site with lookup tables for Unicode:

http://www.unicode-table.com/

You can also look up characters by block.

Exercises

Convert the following short phrases into ASCII. Use the decimal form (so, capital A would be 65). You don't need to encode the quotation marks.
1. "GO LU"
2. "fnord"
3. "So, so what?"
Convert the following ASCII sequences (given in decimal form) into text:
1. 65-108-111-104-97-33
2. 70-105-115-104-32-38-32-99-104-105-112-115
3. 84-73-69-32-102-105-103-104-116-101-114-58-32-60-61-62
Look at the table for ISO-Latin-1. Write a word or phrase in a language you know (perhaps even in English) that requires one or more of the characters in ISO-8859-1 that aren't part of ASCII. Which character(s) is(/are) not in ASCII? What number represents it(/them) in ISO-Latin-1? How would you "fake it" if you only had access to ASCII? What information is lost?
Write a word that can't even be represented in ISO-Latin1, but requires one of the other ISO-8859 tables. (Hint: googling for information about a language often turns up a few words in that language.) Give this information about the word: its language, the "problem" character(s) that aren't in ASCII or ISO-Latin1, the ISO-8859 set required, the numbers that represent the "problem" character(s) in the correct standard, and what those numbers represent in the Latin1 table.
Look through the Unicode tables and find a character that is not in ASCII or any of the ISO-8859 tables. Give its Unicode name and its Unicode "code point" (the U+ number), and draw a picture of it. What block is it in? What language or context is it used for?

Credits and licensing

This article is by Robert P. Webber, Scott McElfresh, and Don Blaheta, licensed under a Creative Commons BY-SA 3.0 license.

Version 2015-Sep-22 23:20