strings

Apart from numbers, we also have the character class. This is anything other than literal numbers. A string can be made up of one character or several characters.

library(stringr)
"hello world"
[1] "hello world"

The above, though having several characters, it is one string.

a <- "hello world"
length(a)
[1] 1

To count the number of characters within a string, we use nchar function:

str_count(a)
[1] 11
str_length(a) #length of string
[1] 11
nchar(a)#number of characters
[1] 11

To create a vector of strings, we use the c function:

b <- c("HELLO WORLD", "HeLlo WoRlD", "hello WORLD", "HELLO world")

the vector b has 4 elements in it. Since R is case sensitive, all the elements are considered to be different.

unique(b)
[1] "HELLO WORLD" "HeLlo WoRlD" "hello WORLD" "HELLO world"

In day to day, we come across a lot of strings/characters that need to be manipulated in such a way to obtain meaningful information from them. We will tackle some of the methods used to manipulate characters/strings. Note that though the above vector b has 4 unique elements, we know that the elements are just a variation of the string hello world. We can be able to manipulate the vector to ensure we have only 1 unique string.

String Methods

  1. tolower – Converts the string/character to lowercase
tolower('A')
[1] "a"
tolower(b)
[1] "hello world" "hello world" "hello world" "hello world"
str_to_lower(b)
[1] "hello world" "hello world" "hello world" "hello world"
  1. toupper – Converts the string/character to upper case

    toupper(b)
    [1] "HELLO WORLD" "HELLO WORLD" "HELLO WORLD" "HELLO WORLD"
    str_to_upper(b)
    [1] "HELLO WORLD" "HELLO WORLD" "HELLO WORLD" "HELLO WORLD"
  1. casefold – Converts the string to the specified case, either upper or lower. by default converts to lower

    casefold(b)
    [1] "hello world" "hello world" "hello world" "hello world"
    casefold(b, upper = TRUE)
    [1] "HELLO WORLD" "HELLO WORLD" "HELLO WORLD" "HELLO WORLD"
  1. paste – pastes strings together

    a <- c("I", "am")
    b <- c("you", "are")
    paste(a, b)
    [1] "I you"  "am are"
    paste(a, b, sep = '_')
    [1] "I_you"  "am_are"
    paste(a, b, collapse = ' ')
    [1] "I you am are"
    paste(a, b, sep = ' ', collapse = ', ')
    [1] "I you, am are"

    The same can be accomplished using str_c:

    str_c(a, b, sep = ' ')
    [1] "I you"  "am are"
    str_c(a, b, sep = '_')
    [1] "I_you"  "am_are"
    str_c(a, b, sep = ' ', collapse = ' ')
    [1] "I you am are"
    str_c(a, b, sep = ' ', collapse = ', ')
    [1] "I you, am are"
  2. strrep - repeats a string n times to create another string

    strrep("abc", 4)
    [1] "abcabcabcabc"
    str_dup("abc", 4)
    [1] "abcabcabcabc"
  3. toString – pastes the strings of a vector into one string separated by commas

    b <- c("banana", "orange","strawberry", "lemon")
    toString(b)
    [1] "banana, orange, strawberry, lemon"

We could break down the string on a letter.

strsplit(b, 'a')
[[1]]
[1] "b" "n" "n"

[[2]]
[1] "or"  "nge"

[[3]]
[1] "str"    "wberry"

[[4]]
[1] "lemon"
str_split(b, 'a')
[[1]]
[1] "b" "n" "n" "" 

[[2]]
[1] "or"  "nge"

[[3]]
[1] "str"    "wberry"

[[4]]
[1] "lemon"
(d <- strsplit(b, ''))
[[1]]
[1] "b" "a" "n" "a" "n" "a"

[[2]]
[1] "o" "r" "a" "n" "g" "e"

[[3]]
 [1] "s" "t" "r" "a" "w" "b" "e" "r" "r" "y"

[[4]]
[1] "l" "e" "m" "o" "n"

check whether a pattern is included in a string:

str_detect(b, 'a')
[1]  TRUE  TRUE  TRUE FALSE

replace certain patterns:

str_replace(b, "a", "3")
[1] "b3nana"     "or3nge"     "str3wberry" "lemon"     
str_replace_all(b, "a", "3")
[1] "b3n3n3"     "or3nge"     "str3wberry" "lemon"     

Remove certain patterns

str_remove(b, "a")
[1] "bnana"     "ornge"     "strwberry" "lemon"    
str_remove_all(b, "a")
[1] "bnn"       "ornge"     "strwberry" "lemon"    

Get the matched pattern:

str_extract(b, "a")
[1] "a" "a" "a" NA 
str_extract_all(b, "a")
[[1]]
[1] "a" "a" "a"

[[2]]
[1] "a"

[[3]]
[1] "a"

[[4]]
character(0)

Notice that in most of the examples above, sometimes we have two functions that do the same task. This is because some of the functions are from base R while others are from the stringr package in tidyverse.

Function Description Similar To
str_length number of characters nchar
str_c String concatenation paste
str_sub Extracts substrings substr /substring
str_sub_all Extracts substrings
str_detect detects pattern grepl
str_replace String replacement sub
str_replace_all String replacement gsub
str_split Splits a string str_split
str_dup duplicates a string strrep
str_trim removes leading and/or trailing whitespaces trimws # trim white spaces
str_remove removes a pattern sub
str_remove_all gsub
str_extract extracts from string regmatches + regexpr
str_extract_all regmatches + gregexpr
str_wrap wraps a sting paragraph strwrap
str_pad pads a string

More examples and documentation can be found in stringr Package.

Regular Expressions

One main application of string manipulation is pattern matching. Finding patterns in text are useful for data validation, data scraping, text parsing, filtering search results, etc. A regular expression (or regex) is a set of symbols that describes a text pattern. More formally, a regular expression is a pattern that describes a set of strings. Regular expressions are a formal language in the sense that the symbols have a defined set of rules to specify the desired patterns. The best way to learn the syntax and become fluent with regular expressions is to practice.

Applications of Regular Expressions

Some common applications of regular expressions:

  • Test if a phone number has the correct number of digits

  • Test if a date follows a specifc format (e.g. mm/dd/yy)

  • Test if an email address is in a valid format

  • Test if a password has numbers and special characters

  • Search a document for gray spelled either as “gray” or “grey”

  • Search a document and replace all occurrences of “Will”, “Bill”, or “W.” with “William”

  • Count the number of times in a document that the word “analysis” is immediately preceded by the words “data”, “computer”, or “statistical”

  • Convert a comma-delimited file into a tab-delimited file

  • Find duplicate words in a text

and so many more …

Back to top