Title Case or Capitalize a Sentence or Word

Published in 26-09-2016 by Luis Lopez

This is the first “language-agnostic” #DailyAlgorithm post that I make and it’s about both converting a sentence to “title case” and capitalising a string (can be a word or a full sentence) using a simple strategy with string splits and RegExp patterns (if need be). Even though the code is actually written in JavaScript , the strategy makes these algorithms language-agnostic indeed; some languages like Elixir already have a method to capitalize a string.

Capitalizing a Word or Sentence

The process of capitalising a string doesn’t require much thought. Just imagine your word, now remember what capitalising actually means: the first letter should be uppercase and the rest should be lowercase (it really depends on the trustworthiness of the input and if you wanna take in account uppercase abbrebiations too), that way, words like what, HeLlOo and daRKness can become What, Hello and Darkness respectively. In most programming languages you have a helper function to uppercase or lowercase a letter or string and another helper to subtract portions of your string.

1
2
3
4
5
6
7
function capitalize(str) {
if (str.length) {
return str[0].toUpperCase() + str.slice(1).toLowerCase();
} else {
return '';
}
}

And here’s the ES6 version with a ternary operator just in case:

1
2
3
4
const capitalize = str => str.length
? str[0].toUpperCase() +
str.slice(1).toLowerCase()
: '';

What the function does is first look if the string passed to it is empty, case in which it returns an empty string. If it does have at least 1 character, perform the operation: uppercase the first letter and append the rest of the string but in lowercase (this step is optional but you do need to append the rest).

Title Casing a Sentence

I title case every subtitle in my posts, as you can see above, “Title Casing a Sentence” has every word capitalized, meaning that the first letter of each word was put in uppercase. For some reason, most algorithms to title case a sentence are made so that every single word including monosyllables are capitalized (like “or” and “a”) and that’s really up to the needs of the developer but I find it a good challenge to account for those cases where you want monosyllables NOT capitalized. I’ve seen bad attempts at using RegExp and I’m going to mention one method I came up with while struggling with this algorithm.

Some edge cases

  • What if I want to split my sentence by not only spaces but also dashes and opening exclamation or interrogation (for the Spanish language and others), forward slashes or even underscores? See what could happen if I don’t take this edge case in account: hi super-man, you look amazing/cool... would turn into Hi Super-man, You Look Amazing/cool... instead of the desired Hi Super-Man, You Look Amazing/Cool; check this example as well: hola, ¿dónde Está el ¡¡baño!!? would become Hola, ¿dónde Está El ¡¡baño!!? instead of the obviously expected Hola, ¿Dónde Está El ¡¡Baño!!?.

  • I don’t want to capitalize words that are filler like “for, to, and, or, onto, in, on, into” and so on. You can learn more about these words here.

The first edge case can be easily solved by using a simple RegExp (regular expression pattern) and a little helper function to escape separators to make them compatible with a RegExp character list declaration or /[abc]/ for example. The second case I won’t even look into it but if you’re up for the challenge, go ahead!

Tip: Use an array of filler words you want to avoid capitalising inside the capitalize function.

By Using Replace and RegExp

If you’re not familiar with String.prototype.replace() I recommend you read MDN’s documentation article about it. This strategy is simple but can get messy if you don’t know the basics of regular expressions, basically what we are going to replace is the following: groups of characters that match the criteria of being composed of groups of characters that don’t contain any of your separators, meaning, if you have “_ -/¿¡” as your separators and this example string: ¿¿ dun, hello world-hehe yo/hey na_naa, when we run the RegExp it will make the following groups: ['dun,', 'hello', 'world', 'hehe', 'yo', 'hey', 'na', 'naa']. This group represents your words to be capitalized!

If we were to use replace with our sentence, the RegExp will capture these words and capitalize them if we pass capitalize as the callback function (second parameter/argument). There is one little problem, though: we need to first escape the separators with a backslash, this is why I declared another helper function that will add a backslash behind every separator. Well, in reality, two, because inside a JavaScript string, the backslash will escape characters like single or double quotation marks (any, really).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
const mySentence = "I'm the ¡very-best ¿¿agreed, you/ya aLL?";
// ... capitalize() goes here
const escape = str => str.replace(/./g, c => `\\${c}`);
const titleCase = (sentence, seps = ' _-¡¿/') => {
let wordPattern = new RegExp(`[^${escape(seps)}]+`, 'g');
return sentence.replace(wordPattern, capitalize);
};
console.log( titleCase(mySentence) );
// Outputs -> I'm The ¡Very-Best ¿¿Agreed, You/Ya All?

The escape helper function replaces EVERY single character in the string with itself, but having a backslash behind it. String interpolation is nice in ES6, for those who don’t do ES6, it’s just the equivalent of '\\' + c. The g flag is needed because it means “global”, not including it will only replace the first occurrence of the pattern.

Our new function declares a wordPattern that will determine how words are selected from the string. This is a RegExp constructor that assembles this: /[^\ \_\-\¡\¿\/]+/g behind the scenes. The caret character after the opening bracket will tell the RegExp engine that the characters inside the character list will not be taken into account, t’s an inverse selector that basically means grab one or more characters that aren’t “these”.

Conclusion

My two cents? For me, the mapping and joining approach that bad. Nowadays, computers have plenty of memory to handle extra arrays and the use cases of this algorithm aren’t likely to need any significant optimization, besides, RegExp can be daunting for newcomers. Also, the first strategy is complicated if you don’t understand recursion, the second one seems more elegant to me and handles extreme cases in my opinion. If you have more edge cases to consider or any other suggestion, let me know. For now, I’ll put the code that only handles splitting and joining by whitespace down below:

1
2
3
4
5
6
7
8
9
const capitalize = str => str.length
? str[0].toUpperCase() +
str.slice(1).toLowerCase()
: '';
const titleCase = str => str
.split(/\s+/)
.map(capitalize)
.join(' ');

In modern languages, things get easier or more difficult depending on the implementation of functional programming concepts and native string methods. For example, in Elixir I could do something like this:

1
2
3
4
5
6
my_sentence = "I'm a good mama's boy"
IO.puts my_sentence
|> String.split
|> Stream.map(&String.capitalize/1)
|> Enum.join(" ")

Beautiful isn’t it? If you’re new to Elixir but are currently learning it, the code just prints the result of grabbing the sentence, splitting it by whitespace (if you want to see the full implementation of strategies one and two in Elixir, give me a heads up in the comment section), mapping it to contain the capitalized words and then joining the words by a space.


Comments: