What is a word? Myths and assumptions

Dictionaries expose us to the rich variety of concepts we can express with the English language. But when writing dictionaries, we are often challenged by the variety of ways English can be written. We don't just have to deal with the twenty-six letters of the alphabet.

Sometimes people rely on myths about the way English words are written: these false assumptions can cause problems, especially when computers are told to make these assumptions.

English words do not use accents

English has more than a hundred words with accents and diacritics in relatively common use. One of the most frequent, café is a word with an acute accent which many learners will come across in their first year of learning English. Most of these words come from other languages: from French we have café as well as many others: crèche (which has a grave accent), crêpe (with a circumflex), façade (the cedilla tells us that the c is pronounced like an s), and naïve (the diaeresis tells us the i must be pronounced separately from the a). Spanish gives us jalapeño (the tilde indicates the letter is a sort of ny sound); doppelgänger is a German loan-word with an umlaut indicating the vowel is to be pronounced more like an 'e' than an 'a'.

There are even a few occasions where diacritics have been added after a word has entered the language - coöperate uses a diaeresis to indicate the two o vowels are separate syllables; nowadays, however, a hyphen to separate the vowels is more common (co-operate).

Accents do not matter in English

Many words written with accents can be also be spelt without - naïve is probably more usually spelt naive. Spelling façade without the cedilla is usually considered incorrect, but at least there is no word to confuse it with. However, there are some occasions where the accent does distinguish between two words: perhaps the most frequent is résumé, a summary of work you have done that you show to a potential employer; written without accents it is indistinguishable from resume, a verb meaning to start doing something that you did before but stopped. The meat spread pâté should also not be written pate, as that refers to the top of someone's head.

English words contain only the letters A-Z and accents

Numbers are increasingly becoming an integral part of the English language. 3G is almost never written or pronounced "third generation", at least when it refers to mobile connections. Many chemical formulas with subscript numerals are now in common use - H2O for water and CO2 for carbon dioxide, sometimes in spoken English too. Measurements also use superscripts: cm2 stands for "centimeters squared" or "square centimeters".

English words contain only letters A-Z, accents and numbers

The English apostrophe is used to show that a letter has been left out, e.g. in I'm or can't, or with s to show possession, e.g. Anna's. It can also occur at the beginning of words, as in 'twas, or at the end, as in the builders' tools (= the tools of the builders). Some words like 'phone can be written with an apostrophe (indicating they are short for telephone, though this use is increasingly rare). When telling the time, o'clock is used to refer to a time which is exactly on the hour. It's short for of the clock though the long form is no longer used.

In American English in particular, dots (or periods) are used after letters to indicate when they are part of an acronym, whereas in British English it is more usual not to: the U.N. (American) vs the UN (British). Where the remaining letter is lower case, full stops are more normal (c., short for circa, meaning about). Units of measurement, on the other hand, usually do without: 10 mm, 7 lb.

Slashes are used in a similar way in a few specific cases - c/o meaning care of (used in addresses when the person you are writing to is staying at someone else's home). w/ for with and w/o for without are sometimes seen. Slashes can also be used in units of measurement, e.g. km/h, a written abbreviation for kilometres per hour.

It is rare to find a word which is not abbreviated or inflected in any way which nevertheless has punctuation. The programming language C++ is one.

In addition to words you might find in a dictionary, if you're analysing real English text, you'll also come across dates, times (08:30), currency (£2.50p), large numbers (1,337), fractions (1/4 or 0.25 or 25%) and other things (a 2:1; a 4×4; email and web addresses; smileys) with characters that behave more like part of a word than like punctuation.

English words are always single words

Many words in English don't occur on their own at all, or sometimes occur in phrases where the constituent parts don't keep their original meaning. Language experts sometimes disagree on where words begin or end, especially as some words can be written as two words or one.

The compound noun is made up of two or more words which together are taken as a single noun. Some are fairly easy to decode: rock salt is salt from a rock; lemon grass is a grass that smells like lemon (though it is not related to the lemon fruit). But many are much less intuitive: a red admiral is not a naval officer, but a butterfly. A grease monkey is someone who repairs engines, and a dead ringer is not a deceased campanologist but someone who looks a lot like someone else.

Some words, especially of foreign origin, only appear in compounds: cappella only exists in English in a cappella (a type of singing). The a in this case is from Italian and unrelated to English a, an, and is an inseparable part of the compound.

Hyphenated compounds are usually easier to spot (if the writer has remembered the hyphen!). left-handed is an adjective describing something done with the left hand, someone who uses their left hand more than their right, or things designed for use with the left hand. The verb to hot-desk is to share one or more desks with others, usually where there are fewer desks than people.

Phrasal verbs are another sort of multi-word unit: broken machines and naughty children are said to play up. Often the object of the phrasal verb divides the verb from the preposition, for instance to take something down, meaning to make a written note of something.

All sorts of other phrases are defined in the dictionary. In our dictionaries we refer to many of these constructions as idioms, for example: birds of a feather (people who are similar in how they think or act).


If that weren't complicated enough, we must also remember that the tricky problems above may be found together in a single word!

For instance, many accented words are found in compounds of some sort: A tête-à-tête is a private meeting with two people, often informally. Regular or formal meetings of this sort, especially in the workplace, are known as one-to-ones, sometimes written as 1-2-1s.

A black-eyed pea is a compound noun in which one part is a hyphenated compound.

A will-o'-the-wisp is a hyphenated compound with one word ending in an apostrophe. It's a sort of ghostly spirit. A maître d' is a compound which ends with an apostrophe, and refers to a manager of a restaurant or its waiters. In English, maître is found only in this context, never as a separate word.

Why are these important?
  • When writing a dictionary, we need to think about where to put these so that the user can find them
  • When using a dictionary, we need to think about how best to look the word up
  • Increasingly the use of online tools mean we must think about how those tools work out where word boundaries are.

Changes to entry structure

At the end of August we changed the way dictionary entries are split up in some of our titles. We are bringing related definitions together which we hope will improve the user experience. Hopefully, this will be a 'silent' improvement but we know our API consumers like to be kept updated. Below we've provided answers to the questions we're anticipating.


Q: What will this mean for users?

A: Users are more likely to find the definition in the first result; there will be fewer but longer entries.

We have found that users (including API consumers) want more information on one page and prefer not to click through search results. So, we are merging the entries which relate to a single headword.

This will mean that users will not have to choose the sense they want to get the definition, all definitions for a single headword will be displayed on one page. In the main British (Advanced Learner's) and American English datasets, definitions of a single word are spread across multiple entries, e.g. searching for "find" begins with the following four results:

find verb (DISCOVER)
find verb (JUDGE)
find noun

In the future, these will be one entry with separate definitions for all of these meanings. NB: many entries already have multiple meanings (look at "find verb (DISCOVER)").

Q: What is a headword?

A: For us, a headword is a word with a particular spelling and pronunciation.

All senses in the same headword will be brought together in one entry. Some words will continue to be separate, for instance record (verb) and record (noun) because they have different pronunciations. We will keep this under review.

Phrasal verbs and idioms will remain separate entries as well.

Q: Will the API change?

A: The API itself will not change, this is a content change.

The XML and HTML of the entries will change (not the XML/JSON or the endpoints). The search results will also change, because there will be fewer entries.

Q: Will this break my application?

A: If your code does not make any assumptions about the structure of entries, it should continue to work.

If you have used references to specific entryIds, those entries may no longer exist - perform a new search and it will be fine. Entries will conform to the same DTD, but if your application reads from the XML structure directly, be advised that the path to senses may include another block element (pos-block) compared to before.

Please let us know if you have any other questions about these exciting changes!

Best wishes,

Cambridge Dictionaries Online team

English Profile Levels

If you're using the API with the dataset 'british', you might be interested to know we've improved the coverage and added a new feature, English Profile Levels to the XML.

Senses may now have 'lvl' tags (usually at 'def-info' level), with information from the English Vocabulary Profile project. This categorises words and their senses into levels according to when learners are likely to be familiar with them. While everyone's learning path is different, it is a useful way of identifying how important the word is to learn and can also be used to determine how a learner is doing.

The six levels, A1-C2 have been designed to fit with the Common European Framework of Reference for Languages (CEFR), so are comparable with levels of learning across different levels. They are based on investigations into the language students actually produce in examination, using, among other things, the Cambridge Learner Corpus, and also a growing English Profile corpus of language students are using in schools.

So from the entry paint, you can see that the sense of making a picture is very well understood by learners, but the sense of covering a room in paint may not be used by learners until a little bit later.

How will you use this new information?

