strings

Apart from numbers, we also have the character class. This is anything other than literal numbers. A string can be made up of one character or several characters.

library(stringr)
"hello world"

[1] "hello world"

The above, though having several characters, it is one string.

a <- "hello world"
length(a)

[1] 1

To count the number of characters within a string, we use nchar function:

str_count(a)

[1] 11

str_length(a) #length of string

[1] 11

nchar(a)#number of characters

[1] 11

To create a vector of strings, we use the c function:

b <- c("HELLO WORLD", "HeLlo WoRlD", "hello WORLD", "HELLO world")

the vector b has 4 elements in it. Since R is case sensitive, all the elements are considered to be different.

unique(b)

[1] "HELLO WORLD" "HeLlo WoRlD" "hello WORLD" "HELLO world"

In day to day, we come across a lot of strings/characters that need to be manipulated in such a way to obtain meaningful information from them. We will tackle some of the methods used to manipulate characters/strings. Note that though the above vector b has 4 unique elements, we know that the elements are just a variation of the string hello world. We can be able to manipulate the vector to ensure we have only 1 unique string.

String Methods

tolower – Converts the string/character to lowercase

tolower('A')

[1] "a"

tolower(b)

[1] "hello world" "hello world" "hello world" "hello world"

str_to_lower(b)

[1] "hello world" "hello world" "hello world" "hello world"

toupper – Converts the string/character to upper case

toupper(b)

[1] "HELLO WORLD" "HELLO WORLD" "HELLO WORLD" "HELLO WORLD"

str_to_upper(b)

[1] "HELLO WORLD" "HELLO WORLD" "HELLO WORLD" "HELLO WORLD"

casefold – Converts the string to the specified case, either upper or lower. by default converts to lower

casefold(b)

[1] "hello world" "hello world" "hello world" "hello world"

casefold(b, upper = TRUE)

[1] "HELLO WORLD" "HELLO WORLD" "HELLO WORLD" "HELLO WORLD"

paste – pastes strings together

a <- c("I", "am")
b <- c("you", "are")
paste(a, b)

[1] "I you"  "am are"

paste(a, b, sep = '_')

[1] "I_you"  "am_are"

paste(a, b, collapse = ' ')

[1] "I you am are"

paste(a, b, sep = ' ', collapse = ', ')

[1] "I you, am are"

The same can be accomplished using str_c:

str_c(a, b, sep = ' ')

[1] "I you"  "am are"

str_c(a, b, sep = '_')

[1] "I_you"  "am_are"

str_c(a, b, sep = ' ', collapse = ' ')

[1] "I you am are"

str_c(a, b, sep = ' ', collapse = ', ')

[1] "I you, am are"

strrep - repeats a string n times to create another string

strrep("abc", 4)

[1] "abcabcabcabc"

str_dup("abc", 4)

[1] "abcabcabcabc"

toString – pastes the strings of a vector into one string separated by commas

b <- c("banana", "orange","strawberry", "lemon")
toString(b)

[1] "banana, orange, strawberry, lemon"

We could break down the string on a letter.

strsplit(b, 'a')

[[1]]
[1] "b" "n" "n"

[[2]]
[1] "or"  "nge"

[[3]]
[1] "str"    "wberry"

[[4]]
[1] "lemon"

str_split(b, 'a')

[[1]]
[1] "b" "n" "n" "" 

[[2]]
[1] "or"  "nge"

[[3]]
[1] "str"    "wberry"

[[4]]
[1] "lemon"

(d <- strsplit(b, ''))

[[1]]
[1] "b" "a" "n" "a" "n" "a"

[[2]]
[1] "o" "r" "a" "n" "g" "e"

[[3]]
 [1] "s" "t" "r" "a" "w" "b" "e" "r" "r" "y"

[[4]]
[1] "l" "e" "m" "o" "n"

check whether a pattern is included in a string:

str_detect(b, 'a')

[1]  TRUE  TRUE  TRUE FALSE

replace certain patterns:

str_replace(b, "a", "3")

[1] "b3nana"     "or3nge"     "str3wberry" "lemon"

str_replace_all(b, "a", "3")

[1] "b3n3n3"     "or3nge"     "str3wberry" "lemon"

Remove certain patterns

str_remove(b, "a")

[1] "bnana"     "ornge"     "strwberry" "lemon"

str_remove_all(b, "a")

[1] "bnn"       "ornge"     "strwberry" "lemon"

Get the matched pattern:

str_extract(b, "a")

[1] "a" "a" "a" NA

str_extract_all(b, "a")

[[1]]
[1] "a" "a" "a"

[[2]]
[1] "a"

[[3]]
[1] "a"

[[4]]
character(0)

Notice that in most of the examples above, sometimes we have two functions that do the same task. This is because some of the functions are from base R while others are from the stringr package in tidyverse.

Function	Description	Similar To
`str_length`	number of characters	`nchar`
`str_c`	String concatenation	`paste`
`str_sub`	Extracts substrings	`substr` /`substring`
`str_sub_all`	Extracts substrings
`str_detect`	detects pattern	`grepl`
`str_replace`	String replacement	`sub`
`str_replace_all`	String replacement	`gsub`
`str_split`	Splits a string	`str_split`
`str_dup`	duplicates a string	`strrep`
`str_trim`	removes leading and/or trailing whitespaces	`trimws` # trim white spaces
`str_remove`	removes a pattern	`sub`
`str_remove_all`		`gsub`
`str_extract`	extracts from string	`regmatches` + `regexpr`
`str_extract_all`		`regmatches` + `gregexpr`
`str_wrap`	wraps a sting paragraph	`strwrap`
`str_pad`	pads a string

More examples and documentation can be found in stringr Package.

Regular Expressions

One main application of string manipulation is pattern matching. Finding patterns in text are useful for data validation, data scraping, text parsing, filtering search results, etc. A regular expression (or regex) is a set of symbols that describes a text pattern. More formally, a regular expression is a pattern that describes a set of strings. Regular expressions are a formal language in the sense that the symbols have a defined set of rules to specify the desired patterns. The best way to learn the syntax and become fluent with regular expressions is to practice.

Applications of Regular Expressions

Some common applications of regular expressions:

Test if a phone number has the correct number of digits
Test if a date follows a specifc format (e.g. mm/dd/yy)
Test if an email address is in a valid format
Test if a password has numbers and special characters
Search a document for gray spelled either as “gray” or “grey”
Search a document and replace all occurrences of “Will”, “Bill”, or “W.” with “William”
Count the number of times in a document that the word “analysis” is immediately preceded by the words “data”, “computer”, or “statistical”
Convert a comma-delimited file into a tab-delimited file
Find duplicate words in a text

and so many more …