'data science'[1] "data science"
'Mike"s favorite course'[1] "Mike\"s favorite course"
'Mike\'s favorite course'[1] "Mike's favorite course"
Strings and dates are common objects to work on in data processing. As soon as you read files or print reports, you need strings. When you work with real-world problems, you need dates very often.
R has facilities for both strings and dates. The basic package has already a solid set of string and dates operations, but has limitations. Some of the limitations with strings and dates have been improved through the tidyverse packages stringr and lubridate. However, in this course, we will be focusing on handling strings and dates using Base R.
Below is a list of functions and objects to be covered in this section.
paste(), paste0()
sepnchar()tolower(), toupper()strsplit()substr(), substring()sub(), gsub()Date, POSIXct, POSIXltSys.Date(), Sys.time()as.Date()ISOdate()format()
b, B, d, m, y, Yjulian()as.POSIXlt()seq()
from, to, by, length.outA string is a character that is made of one character or contains a collection of characters. It can be enclosed inside single quotes (‘This is a string’) or inside the double quotes (“This is also a string”). But in the internal representation of R strings are represented in double-quotes. As good practice, you should keep your use of quotation marks consistent throughout your code.
Before advancing to stringr, let us introduce how to handle strings in Base R. Once you’ve mastered Base R, you should find stringr similarly and even easier to use.
RThe string that starts with a single quote needs to end with a single quote. However, you can put double quotes, and through the Escape Sequence(\), single quote can also become a part of the string.
'data science'[1] "data science"
'Mike"s favorite course'[1] "Mike\"s favorite course"
'Mike\'s favorite course'[1] "Mike's favorite course"
The string that starts with double-quote needs to end with a double quote. However, you can put single quotes, and through the Escape Sequence(\), double-quote can also become a part of the string.
"data science"[1] "data science"
"Mike's favorite course"[1] "Mike's favorite course"
"Mike\"s favorite course"[1] "Mike\"s favorite course"
Enter a few lines from Lewis Carroll’s Alice’s Adventures in Wonderland. Alice has just arrived at the tea party…
Concatenation of String is making the strings to join or merge. The syntax for concatenating strings in R is done by:
arcadia <- "Arcadia"
uni <- "University"
paste(arcadia, uni)[1] "Arcadia University"
paste(arcadia, uni, sep = "-")[1] "Arcadia-University"
paste0(arcadia, uni)[1] "ArcadiaUniversity"
The first few arguments should indicate one or more characters or objects which converts into character vectors. sep specifies a separation character.
If the arguments are vectors, they are concatenated term-by-term to give a character vector result.
myvar1 <- c("CS", "Data")
myvar2 <- c("229", "Science")
paste(myvar1, myvar2)[1] "CS 229" "Data Science"
If a value is specified for collapse, the values in the result are then concatenated into a single string, with the elements being separated by the value of collapse.
paste(myvar1, myvar2, collapse = "-")[1] "CS 229-Data Science"
paste0(myvar1, myvar2, collapse = "-")[1] "CS229-DataScience"
Here, the output contains a - between “229” and “Data” due to the use of a collapse, which makes the separation between two different vectors. The collapse parameter lets you define a top-level separator and instructs paste to concatenate the generated strings using that separator. In contrast, the default value of sep in paste() includes a space, where individual items are separated by it. The space can be easily removed by calling paste0.
If one or more arguments are vectors of strings and they don’t have the same length, paste will recycle the elements in the longer vector.
students <- c("Amy", "Blake", "Charlie")
paste(students, "is a sophomore.")[1] "Amy is a sophomore." "Blake is a sophomore."
[3] "Charlie is a sophomore."
We can also add the collapse argument to obtain
paste(students, "is a sophomore", collapse = ", and ")[1] "Amy is a sophomore, and Blake is a sophomore, and Charlie is a sophomore"
To find the total number of characters in a given string, we can use the nchar() function, NOT the length() function.
nchar("Data")[1] 4
nchar("Science")[1] 7
If you apply nchar to a vector of strings, it returns the length of each string:
students <- c("Amy", "Blake", "Charlie")
nchar(students)[1] 3 5 7
There is a function called nzchar() which can be used to check whether a string is empty. nzchar(x) returns TRUE if the x is non-empty, and FALSE if it contains no string. For example,
nzchar(" ")[1] TRUE
nchar(" ")[1] 1
nzchar("")[1] FALSE
nchar("")[1] 0
As can be seen above, an empty string has a length of 0.
We can easily modify the cases of characters using the toupper() and tolower() functions. As their names suggest, toupper() changes all the characters present to uppercase, while tolower() changes all the characters present to lowercase.
toupper("Every letter is changed to UPPER case.")[1] "EVERY LETTER IS CHANGED TO UPPER CASE."
tolower("Every letter is changed to LOWER case.")[1] "every letter is changed to lower case."
Here we are considering splitting a string into substrings. The substrings are separated by a delimiter. To do it, we can use strsplit(), which takes two arguments: the string and the delimiter of the substrings.
Let us take a look at an example. It is common for a string to contain multiple substrings separated by the same delimiter. One example is a filepath, whose components are separated by slashes /:
path <- "/Users/weihong_ni/MyProject/inputdata/training.csv"We can split that path into its components by using strsplit() according to a delimiter of /:
strsplit(path, "/")[[1]]
[1] "" "Users" "weihong_ni" "MyProject" "inputdata"
[6] "training.csv"
Notice that the first “component” is actually an empty string because nothing preceded the first slash.
Also notice that strsplit() returns a list and that each element of the list is a vector of substrings. This two-level structure is necessary because the first argument can be a vector of strings. Each string is split into its substrings (a vector), and then those vectors are returned in a list.
If you are operating only on a single string, you can pop out the first element like this:
unlist(strsplit(path, "/"))[1] "" "Users" "weihong_ni" "MyProject" "inputdata"
[6] "training.csv"
# or
strsplit(path, "/")[[1]][1] "" "Users" "weihong_ni" "MyProject" "inputdata"
[6] "training.csv"
The following example splits three file paths and returns a three-element list:
paths <- c(
"/Users/weihong_ni/MyProject/inputdata/training.csv",
"/Users/weihong_ni/MyProject/outputdata/results.csv",
"/Users/weihong_ni/MyProject/Rscripts/clean.R")
strsplit(paths, "/", fixed = T)[[1]]
[1] "" "Users" "weihong_ni" "MyProject" "inputdata"
[6] "training.csv"
[[2]]
[1] "" "Users" "weihong_ni" "MyProject" "outputdata"
[6] "results.csv"
[[3]]
[1] "" "Users" "weihong_ni" "MyProject" "Rscripts"
[6] "clean.R"
The third argument of strsplit() (the delimiter argument) is actually much more powerful than these examples indicate. It can be a regular expression, letting you match patterns far more complicated than a simple string. In fact, to turn off the regular expression feature (and its interpretation of special characters), you must include the fixed=TRUE argument.
We will first learn the substr() or substring() function for extracting and replacing a character string. This function has three main arguments: x or text, start or first, stop or last.
x or text: Indicates a character string.start or first: Indicates an integer that specifies the corresponding starting value to be returned.stop or last: Indicates an integer that specifies the corresponding stopping value to be returned.In the context of our course, distinguishing between substr() and substring() is not essential, as their functionality is largely overlapping for our practical applications. Therefore, we will use these two functions interchangeably throughout the curriculum.
For instance,
substring("football", 5, 8)[1] "ball"
The above code prints out the 5th to the 8th character in the string “football”.
In fact, all the arguments can be vectors, in which case substr() will treat them as parallel vectors. From each string, it extracts the substring delimited by the corresponding entries in the starting and ending points. For instance,
substr(rep("abcdef", 4), 1:4, 4:5)[1] "abcd" "bcde" "cd" "de"
This can facilitate some useful tricks. For example, the following code snippet extracts the last two characters from each string; each substring starts on the penultimate character of the original string and ends on the final character:
cities <- c("Philadephia, PA", "New York, NY", "Los Angeles, CA")
substr(cities, nchar(cities) - 1, nchar(cities))[1] "PA" "NY" "CA"
Let’s see another example where the characters get replaced.
word <- "football"
substring(word, 1, 4) <- "hand"
word[1] "handball"
Replacing is many times conducted according to some instances in the string. For such replacing, we need the sub() and gsub() functions. Both functions need three inputs:
pattern: instance of old substringreplacement: new substringx: target stringThe sub() function finds the first instance of the old substring within string and replaces it with the new substring.
str <- "Amy loves watching football. Amy plays football, too."
sub("football", "basketball", str)[1] "Amy loves watching basketball. Amy plays football, too."
gsub() does the same thing, but it replaces all instances of the substring (a global replace), not just the first.
gsub("football", "basketball", str)[1] "Amy loves watching basketball. Amy plays basketball, too."
To remove a substring altogether, simply set the new substring to be empty.
sub(", too", "", str)[1] "Amy loves watching football. Amy plays football."
The old argument can be an regular expression, which allows you to match patterns much more complicated than a simple string. This is actually assumed by default, so you must set the fixed=TRUE argument if you don’t want sub() and gsub() to interpret the pattern argument as a regular expression.
Often, we encounter tasks requiring the removal of either letters or numbers from a series of license plates. Consider the following list:
license <- c("AC372", "EAGLES", "KNL9270", "MPC2553")To remove all letters (keeping only the numbers), we use the gsub() function, targeting uppercase letters as follows:
gsub("[A-Z]", "", license)[1] "372" "" "9270" "2553"
Conversely, if we need to remove the numbers (keeping only the letters), we apply gsub() differently, as shown below:
gsub("[0-9]", "", license)[1] "AC" "EAGLES" "KNL" "MPC"
In each case, gsub() searches for patterns (letters or numbers) and replaces them with an empty string, effectively removing the targeted characters.
Here, we will focus on how to handle dates and times using the Base R functions.
R has a variety of classes for working with dates and times, which is nice if you prefer having a choice but annoying if you prefer living simply. There is a critical distinction among the classes: some are date-only classes, some are datetime classes. All classes can handle calendar dates (e.g., March 15, 2019), but not all can represent a datetime (e.g., 11:45 AM on March 1, 2019).
The following classes are included in the base distribution of R:
Date
The Date class can represent a calendar date but not a clock time. It is a solid, general-purpose class for working with dates, including conversions, formatting, basic date arithmetic, and time-zone handling.
POSIXct
This is a datetime class, and it can represent a moment in time with an accuracy of one second. Internally, the datetime is stored as the number of seconds since January 1, 1970, and so it’s a very compact representation. This class is recommended for storing datetime information (e.g., in data frames).
POSIXlt
This is also a datetime class, but the representation is stored in a nine-element list that includes the year, month, day, hour, minute, and second. This representation makes it easy to extract date parts, such as the month or hour. Obviously, this is much less compact than the POSIXct class; hence, it is normally used for intermediate processing and not for storing data.
The base distribution also provides functions for easily converting between representations: as.Date(), as.POSIXct(), and as.POSIXlt().
If you prefer more advanced tools to handle dates and times, refer to the following helpful packages that are available for downloading from CRAN:
chron
The chron package can represent both dates and times but without the added complexities of handling time zones and Daylight Saving Time. It’s therefore easier to use than Date but less powerful than POSIXct and POSIXlt. It would be useful for work in econometrics or time series analysis.
lubridate
This is a tidyverse package designed to make working with dates and times easier while keeping the important bells and whistles such as time zones. It’s especially clever regarding datetime arithmetic. This package introduces some helpful constructs like duration, periods, and intervals. lubridate is part of the tidyverse, so it is installed when you install.packages('tidyverse') but it is not part of “core tidyverse”, so it does not get loaded when you run library(tidyverse). This means you must explicitly load it by running library(lubridate).
mondate
This is a specialized package for handling dates in units of months in addition to days and years. It can be helpful in accounting and actuarial work, for example, where month-by-month calculations are needed.
timeDate
This is a high-powered package with well-thought-out facilities for handling dates and times, including date arithmetic, business days, holidays, conversions, and generalized handling of time zones. It was originally part of the Rmetrics software for financial modeling, where precision in dates and times is critical. If you have a demanding need for date facilities, consider this package.
Which class should you select? The article “Date and Time Classes in R” by Gabor Grothendieck and Thomas Petzoldt offers this general advice:
When considering which class to use, always choose the least complex class that will support the application. That is, use Date if possible, otherwise use chron and otherwise use the POSIX classes. Such a strategy will greatly reduce the potential for error and increase the reliability of your application.
You may encounter the term “POSIXt” in the output. In the R programming environment, “POSIXt” is a virtual class from which the “POSIXct” and “POSIXlt” classes inherit. This structure facilitates operations, such as subtraction, that require the interaction of the two derived classes.
You can use the as.Date() function to convert a string, such as “2021-01-01”, to a Date object. However, you must know the format of the string. By default, as.Date() assumes the string looks like yyyy-mm-dd. To handle other formats, you must specify the format parameter of as.Date(). Use format = "%m/%d/%Y" if the date is in American style, for instance.
The following example shows the default format assumed by as.Date(), which is the ISO 8601 standard format of yyyy-mm-dd.
as.Date("2021-01-01")[1] "2021-01-01"
The as.Date() function returns a Date object that is being converted here back to a string for printing; this explains the double quotes around the output.
The string can be in other formats, but you must provide a format argument so that as.Date() can interpret your string. Americans often mistakenly try to convert the usual American date format (mm/dd/yyyy) into a Date object, with these unhappy results:
as.Date("01/01/2021")[1] "1-01-20"
Here is the correct way to convert an American-style date:
as.Date("01/01/2021", format = "%m/%d/%Y")[1] "2021-01-01"
Observe that the Y in the format string is capitalized to indicate a four-digit year. If you’re using two-digit years, specify a lowercase y.
It is common for the input data to contain dates encoded as three numbers: year, month, and day. The ISOdate() function can combine them into a POSIXct object:
ISOdate(2020, 2, 29)[1] "2020-02-29 12:00:00 GMT"
You can keep your date in the POSIXct format. However, when working with pure dates (not dates and times), we often convert to a Date object and truncate the unused time information:
as.Date(ISOdate(2020, 2, 29))[1] "2020-02-29"
Trying to convert an invalid date results in NA:
ISOdate(2023, 2, 29) # 2023 is not a leap year[1] NA
When we have a date represented by its year, month, and day in different variables and would like to merge these elements into a single Date object representation, we should think of the ISOdate() function. Its output is a POSIXct object that you can convert into a Date object:
year <- 2024
month <- 12
day <- 31
# the output of ISOdate is POSIXct
class(ISOdate(year, month, day))[1] "POSIXct" "POSIXt"
# we can further convert it to Date
as.Date(ISOdate(year, month, day))[1] "2024-12-31"
ISOdate() can process entire vectors of years, months, and days, which is quite handy for mass conversion of input data. The following example starts with the year/month/day numbers and then combines them all into Date objects.
years <- rep(2024, 5)
months <- 1:5
days <- c(15, 21, 20, 18, 17)
ISOdate(years, months, days)[1] "2024-01-15 12:00:00 GMT" "2024-02-21 12:00:00 GMT"
[3] "2024-03-20 12:00:00 GMT" "2024-04-18 12:00:00 GMT"
[5] "2024-05-17 12:00:00 GMT"
# convert to Date
as.Date(ISOdate(years, months, days))[1] "2024-01-15" "2024-02-21" "2024-03-20" "2024-04-18" "2024-05-17"
Purists will note that the vector of years is redundant and that the last expression can therefore be further simplified by invoking the Recycling Rule:
as.Date(ISOdate(2024, months, days))[1] "2024-01-15" "2024-02-21" "2024-03-20" "2024-04-18" "2024-05-17"
You can also extend this recipe to handle year, month, day, hour, minute, and second data by using the ISOdatetime() function (see the help page for details).
ISOdatetime(2024, 10, 01, 21, 30, 59)[1] "2024-10-01 21:30:59 UTC"
Sometimes we also want to convert a Date object into a character string, usually because we want to print the date. Either the function format() or as.character() would help us with this.
format(Sys.Date())[1] "2024-12-02"
as.character(Sys.Date())[1] "2024-12-02"
Both functions allow a format argument that controls the formatting. We can use format = "%m/%d/%Y" to get American-style dates, for example:
format(Sys.Date(), format = "%m/%d/%Y")[1] "12/02/2024"
The format argument defines the appearance of the resulting string. Normal characters, such as slash (/) or hyphen (-) are simply copied to the output string. Each two-letter combination of a percent sign (%) followed by another character has special meaning. Some common ones are:
%b: Abbreviated month name (“Jan”)%B: Full month name (“January”)%d: Day as a two-digit number%m: Month as a two-digit number%y: Year without century (00–99)%Y: Year with centurySee the help page for the strftime() function for a complete list of formatting codes.
Given a Date object, we can extract the Julian date - which is, in R, the number of days since January 1, 1970, using either the as.integer() or the julian() function:
as.integer(as.Date("2024-01-01"))[1] 19723
julian(as.Date("2024-01-01"))[1] 19723
attr(,"origin")
[1] "1970-01-01"
Given a Date object, you can extract a date part such as the day of the week, the day of the year, the calendar day, the calendar month, or the calendar year.
To do this, conversion to a POSIXlt object would be helpful. Recall that POSIXlt stores date elements in a list. Extracting the desired part in a date is just like extracting it from the list:
dt <- as.Date("2023-12-31")
plt <- as.POSIXlt(dt)
# Day of the month
plt$mday[1] 31
# Month (0 = January)
plt$mon[1] 11
# Year
plt$year + 1900[1] 2023
The POSIXlt object represents a date as a list of date parts. Convert your Date object to POSIXlt by using the as.POSIXlt() function, which will give you a list with these members:
sec: Seconds (0-61)min: Minutes (0-59)hour: Hours (0-23)mday: Day of the month (1-31)mon: Month (0-11)year: Years since 1900wday: Day of the week (0-6, 0 = Sunday)yday: Day of the year (0-365)isdst: Daylight Saving Time flagSometimes we need to create a sequence of dates, such as a sequence of daily, monthly, or annual dates. The seq() function is a generic function that has a version for Date objects. It can create a Date sequence similarly to the way it creates a sequence of numbers.
A typical use of seq() specifies a starting date (from), ending date (to), and increment (by). An increment of 1 indicates daily dates:
fromdt <- as.Date("2020-02-01")
todt <- as.Date("2020-03-01")
seq(from = fromdt, to = todt, by = 1) [1] "2020-02-01" "2020-02-02" "2020-02-03" "2020-02-04" "2020-02-05"
[6] "2020-02-06" "2020-02-07" "2020-02-08" "2020-02-09" "2020-02-10"
[11] "2020-02-11" "2020-02-12" "2020-02-13" "2020-02-14" "2020-02-15"
[16] "2020-02-16" "2020-02-17" "2020-02-18" "2020-02-19" "2020-02-20"
[21] "2020-02-21" "2020-02-22" "2020-02-23" "2020-02-24" "2020-02-25"
[26] "2020-02-26" "2020-02-27" "2020-02-28" "2020-02-29" "2020-03-01"
Another typical use specifies a starting date (from), increment (by), and number of dates (length.out):
seq(from = fromdt, by = 1, length.out = 7)[1] "2020-02-01" "2020-02-02" "2020-02-03" "2020-02-04" "2020-02-05"
[6] "2020-02-06" "2020-02-07"
The increment (by) is flexible and can be specified in days, weeks, months, or years:
fromdt <- as.Date("2024-01-01")
# First day of each month in one year
seq(from = fromdt, by = "month", length.out = 12) [1] "2024-01-01" "2024-02-01" "2024-03-01" "2024-04-01" "2024-05-01"
[6] "2024-06-01" "2024-07-01" "2024-08-01" "2024-09-01" "2024-10-01"
[11] "2024-11-01" "2024-12-01"
# Quarterly dates for one year
seq(from = fromdt, by = "3 months", length.out = 4)[1] "2024-01-01" "2024-04-01" "2024-07-01" "2024-10-01"
# Year-start dates for one decade
seq(from = fromdt, by = "year", length.out = 10) [1] "2024-01-01" "2025-01-01" "2026-01-01" "2027-01-01" "2028-01-01"
[6] "2029-01-01" "2030-01-01" "2031-01-01" "2032-01-01" "2033-01-01"
Be careful with by = "month" near month-end. In the following example, the end of February overflows into March, which is probably not what you wanted:
seq(as.Date("2023-01-29"), by = "month", length.out = 3)[1] "2023-01-29" "2023-03-01" "2023-03-29"