21  Strings

A string is a character that is made of one character or contains a collection of characters. It can be enclosed inside single quotes (‘This is a string’) or inside the double quotes (“This is also a string”). But in the internal representation of R strings are represented in double-quotes. As good practice, you should keep your use of quotation marks consistent throughout your code.

Before advancing to stringr, let us introduce how to handle strings in Base R. Once you’ve mastered Base R, you should find stringr similarly and even easier to use.

21.1 Rules for strings in R

The string that starts with a single quote needs to end with a single quote. However, you can put double quotes, and through the Escape Sequence(\), single quote can also become a part of the string.

'data science'
[1] "data science"
'Mike"s favorite course'
[1] "Mike\"s favorite course"
'Mike\'s favorite course'
[1] "Mike's favorite course"

The string that starts with double-quote needs to end with a double quote. However, you can put single quotes, and through the Escape Sequence(\), double-quote can also become a part of the string.

"data science"
[1] "data science"
"Mike's favorite course"
[1] "Mike's favorite course"
"Mike\"s favorite course"
[1] "Mike\"s favorite course"

Exercise A

Enter a few lines from Lewis Carroll’s Alice’s Adventures in Wonderland. Alice has just arrived at the tea party…

Q1

“No room! No room!” they cried out when they saw Alice coming.

Q2

“There’s plenty of room!” said Alice indignantly, and she sat down in a large arm-chair at one end of the table.

21.2 Concatenaton of strings

Concatenation of String is making the strings to join or merge. The syntax for concatenating strings in R is done by:

arcadia <- "Arcadia"
uni <- "University"
paste(arcadia, uni)
[1] "Arcadia University"
paste(arcadia, uni, sep = "-")
[1] "Arcadia-University"
paste0(arcadia, uni)
[1] "ArcadiaUniversity"

The first few arguments should indicate one or more characters or objects which converts into character vectors. sep specifies a separation character.

If the arguments are vectors, they are concatenated term-by-term to give a character vector result.

myvar1 <- c("CS", "Data")
myvar2 <- c("229", "Science")
paste(myvar1, myvar2)
[1] "CS 229"       "Data Science"

If a value is specified for collapse, the values in the result are then concatenated into a single string, with the elements being separated by the value of collapse.

paste(myvar1, myvar2, collapse = "-")
[1] "CS 229-Data Science"
paste0(myvar1, myvar2, collapse = "-")
[1] "CS229-DataScience"

Here, the output contains a - between “229” and “Data” due to the use of a collapse, which makes the separation between two different vectors. The collapse parameter lets you define a top-level separator and instructs paste to concatenate the generated strings using that separator. In contrast, the default value of sep in paste() includes a space, where individual items are separated by it. The space can be easily removed by calling paste0.

If one or more arguments are vectors of strings and they don’t have the same length, paste will recycle the elements in the longer vector.

students <- c("Amy", "Blake", "Charlie")
paste(students, "is a sophomore.")
[1] "Amy is a sophomore."     "Blake is a sophomore."  
[3] "Charlie is a sophomore."

We can also add the collapse argument to obtain

paste(students, "is a sophomore", collapse = ", and ")
[1] "Amy is a sophomore, and Blake is a sophomore, and Charlie is a sophomore"

Exercise B

Q1

Each unique customer ID is composed of 3 letters followed by 2 numbers. Given the data set below, you are required to combine the letterID and numberID columns to form a complete customer ID for each individual. Add this as a new column in the customers data frame and name it UniqueID.

customers <- data.frame(name = c("John", "Tom", "William"),
                        letterID = c("ABC", "OPQ", "XYZ"), 
                        numberID = c(35, 68, 97))

Your task is to manipulate the customers data frame by creating the UniqueID column, ensuring it accurately reflects the combination of each customer’s letters and numbers as their unique identifier.

Q2

Create a vector of ordinal number strings — ["1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th", "9th", "10th"] — utilizing the paste() or paste0() function in R. Ensure you incorporate the numeric vector 1:10 as part of your function arguments to generate this sequence. Your final output should precisely match the target string vector format.

21.3 Finding the length

To find the total number of characters in a given string, we can use the nchar() function, NOT the length() function.

nchar("Data")
[1] 4
nchar("Science")
[1] 7

If you apply nchar to a vector of strings, it returns the length of each string:

students <- c("Amy", "Blake", "Charlie")
nchar(students)
[1] 3 5 7

There is a function called nzchar() which can be used to check whether a string is empty. nzchar(x) returns TRUE if the x is non-empty, and FALSE if it contains no string. For example,

nzchar(" ")
[1] TRUE
nchar(" ")
[1] 1
nzchar("")
[1] FALSE
nchar("")
[1] 0

As can be seen above, an empty string has a length of 0.

Exercise C

Q1

Determine the lengths of the following strings: Arcadia University, CS229 Module 6, and Strings&Dates. Present your results in a vector that contains exactly three values, each representing the respective length of these strings.

21.4 Changing to upper and lower cases

We can easily modify the cases of characters using the toupper() and tolower() functions. As their names suggest, toupper() changes all the characters present to uppercase, while tolower() changes all the characters present to lowercase.

toupper("Every letter is changed to UPPER case.")
[1] "EVERY LETTER IS CHANGED TO UPPER CASE."
tolower("Every letter is changed to LOWER case.")
[1] "every letter is changed to lower case."

Exercise D

Q1

In our data frame, there is a column containing country names; however, the cases of these entries are inconsistent. For example, the first 10 entries are as follows:

["United States", "germany", "United states", "CHINA", "SPain", "UNITED KINGDOM", "Australia", "GERMANY", "United Kingdom", "SPAIN"].

To simplify data cleaning, try to convert all these 10 entries to lowercase.

21.5 Splitting a string according to a delimiter

Here we are considering splitting a string into substrings. The substrings are separated by a delimiter. To do it, we can use strsplit(), which takes two arguments: the string and the delimiter of the substrings.

Let us take a look at an example. It is common for a string to contain multiple substrings separated by the same delimiter. One example is a filepath, whose components are separated by slashes /:

path <- "/Users/weihong_ni/MyProject/inputdata/training.csv"

We can split that path into its components by using strsplit() according to a delimiter of /:

strsplit(path, "/")
[[1]]
[1] ""             "Users"        "weihong_ni"   "MyProject"    "inputdata"   
[6] "training.csv"

Notice that the first “component” is actually an empty string because nothing preceded the first slash.

Also notice that strsplit() returns a list and that each element of the list is a vector of substrings. This two-level structure is necessary because the first argument can be a vector of strings. Each string is split into its substrings (a vector), and then those vectors are returned in a list.

If you are operating only on a single string, you can pop out the first element like this:

unlist(strsplit(path, "/"))
[1] ""             "Users"        "weihong_ni"   "MyProject"    "inputdata"   
[6] "training.csv"
# or
strsplit(path, "/")[[1]]
[1] ""             "Users"        "weihong_ni"   "MyProject"    "inputdata"   
[6] "training.csv"

The following example splits three file paths and returns a three-element list:

paths <- c(
  "/Users/weihong_ni/MyProject/inputdata/training.csv",
  "/Users/weihong_ni/MyProject/outputdata/results.csv",
  "/Users/weihong_ni/MyProject/Rscripts/clean.R")
strsplit(paths, "/", fixed = T)
[[1]]
[1] ""             "Users"        "weihong_ni"   "MyProject"    "inputdata"   
[6] "training.csv"

[[2]]
[1] ""            "Users"       "weihong_ni"  "MyProject"   "outputdata" 
[6] "results.csv"

[[3]]
[1] ""           "Users"      "weihong_ni" "MyProject"  "Rscripts"  
[6] "clean.R"   

The third argument of strsplit() (the delimiter argument) is actually much more powerful than these examples indicate. It can be a regular expression, letting you match patterns far more complicated than a simple string. In fact, to turn off the regular expression feature (and its interpretation of special characters), you must include the fixed=TRUE argument.

Exercise E

Q1

Split the following strings at the delimiter _, then extract and compile the text following each _ into a single vector. The strings are:

["Customer_ID", "Customer_Name", "Customer_Gender", "Customer_Race", "Customer_Age"]

Consequently, your output should display the following in the console:

[1] "ID" "Name" "Gender" "Race" "Age"

This task requires you to effectively separate each string at the specified delimiter and collate the segments found after this point.

21.6 Extracting and replacing a character string

We will first learn the substr() or substring() function for extracting and replacing a character string. This function has three main arguments: x or text, start or first, stop or last.

  • x or text: Indicates a character string.
  • start or first: Indicates an integer that specifies the corresponding starting value to be returned.
  • stop or last: Indicates an integer that specifies the corresponding stopping value to be returned.

In the context of our course, distinguishing between substr() and substring() is not essential, as their functionality is largely overlapping for our practical applications. Therefore, we will use these two functions interchangeably throughout the curriculum.

For instance,

substring("football", 5, 8)
[1] "ball"

The above code prints out the 5th to the 8th character in the string “football”.

In fact, all the arguments can be vectors, in which case substr() will treat them as parallel vectors. From each string, it extracts the substring delimited by the corresponding entries in the starting and ending points. For instance,

substr(rep("abcdef", 4), 1:4, 4:5)
[1] "abcd" "bcde" "cd"   "de"  

This can facilitate some useful tricks. For example, the following code snippet extracts the last two characters from each string; each substring starts on the penultimate character of the original string and ends on the final character:

cities <- c("Philadephia, PA", "New York, NY", "Los Angeles, CA")
substr(cities, nchar(cities) - 1, nchar(cities))
[1] "PA" "NY" "CA"

Let’s see another example where the characters get replaced.

word <- "football"
substring(word, 1, 4) <- "hand"
word
[1] "handball"

Replacing is many times conducted according to some instances in the string. For such replacing, we need the sub() and gsub() functions. Both functions need three inputs:

  • pattern: instance of old substring
  • replacement: new substring
  • x: target string

The sub() function finds the first instance of the old substring within string and replaces it with the new substring.

str <- "Amy loves watching football. Amy plays football, too."
sub("football", "basketball", str)
[1] "Amy loves watching basketball. Amy plays football, too."

gsub() does the same thing, but it replaces all instances of the substring (a global replace), not just the first.

gsub("football", "basketball", str)
[1] "Amy loves watching basketball. Amy plays basketball, too."

To remove a substring altogether, simply set the new substring to be empty.

sub(", too", "", str)
[1] "Amy loves watching football. Amy plays football."

The old argument can be an regular expression, which allows you to match patterns much more complicated than a simple string. This is actually assumed by default, so you must set the fixed=TRUE argument if you don’t want sub() and gsub() to interpret the pattern argument as a regular expression.

Often, we encounter tasks requiring the removal of either letters or numbers from a series of license plates. Consider the following list:

license <- c("AC372", "EAGLES", "KNL9270", "MPC2553")

To remove all letters (keeping only the numbers), we use the gsub() function, targeting uppercase letters as follows:

gsub("[A-Z]", "", license)
[1] "372"  ""     "9270" "2553"

Conversely, if we need to remove the numbers (keeping only the letters), we apply gsub() differently, as shown below:

gsub("[0-9]", "", license)
[1] "AC"     "EAGLES" "KNL"    "MPC"   

In each case, gsub() searches for patterns (letters or numbers) and replaces them with an empty string, effectively removing the targeted characters.

Exercise F

Q1

Apply the sub() or gsub() function to the following vector of strings:
["Customer_ID", "Customer_Name", "Customer_Gender", "Customer_Race", "Customer_Age"]

Your task is to manipulate the strings so that you extract only the portions after the underscore “_“, thereby obtaining the following result:
[1] "ID" "Name" "Gender" "Race" "Age"