c(1, 2, 3, 4, 5)
[1] 1 2 3 4 5
c(10, 100, 20, -4)
[1] 10 100 20 -4
c(2, 2, 2)
[1] 2 2 2
A vector is substantially a list that contains elements of the same kind. There are two types of vectors, Atomic Vectors and Generic Vectors. We will first talk of those that are one dimensional and atomic. By atomic we mean that the elements within the vector cannot by themselves hold other elements. Due to the simplicity structure, we can manipulate all the elements within a vector simultaneously.
Different functions have been provided in R to be able to create vector. The most common function is the c
function which is used for c
oncatenation. Eg To create a vector that contains the elements 1 to 5, we could do the following.
c(1, 2, 3, 4, 5)
[1] 1 2 3 4 5
c(10, 100, 20, -4)
[1] 10 100 20 -4
c(2, 2, 2)
[1] 2 2 2
At times we need to generate a sequence of numbers. The :
operator is used for this task.
1:5
[1] 1 2 3 4 5
0.1:5.1
[1] 0.1 1.1 2.1 3.1 4.1 5.1
Notice that the sequence has an increment of 1
. This is too restrictive. What if we need a sequence with increment of say 2 or even 0.2? We then use the function seq
.
seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10
seq(1, 2, 0.1)
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
seq(1, 2, length.out = 11)
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
seq(c(1,4,6,8,-2))
[1] 1 2 3 4 5
sequence(1:3)
[1] 1 1 2 1 2 3
Other vector functions include rep
for repetition and length
to determine the length.
Examples:
rep(2, 4)
[1] 2 2 2 2
rep(c(2,3), 4)
[1] 2 3 2 3 2 3 2 3
rep(c(2,3), each = 4)
[1] 2 2 2 2 3 3 3 3
rep(c(2,3), c(4,4))
[1] 2 2 2 2 3 3 3 3
With vectors, we can easily carry out arithmetic manipulations for each element simultaneously. That is because most of R’s functions are VECTORIZED, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time. This makes writing code more concise, easy to read, and less error prone.
c(1,2,3) + c(6,7,8)
[1] 7 9 11
log(c(-4, 10, 6, 8))
Warning in log(c(-4, 10, 6, 8)): NaNs produced
[1] NaN 2.302585 1.791759 2.079442
Notice that log(-4)
produced NaN
while the others gave results. This shows that the computation of rest do not depend on the one with a warning. Also, NaN
is a numeric variable that stands for not a number.
We can see vector recycling, when we perform some kind of operations like addition, subtraction. . . .etc on two vectors of unequal length. The vector with a small length will be repeated as long as the operation completes on the longer vector. If we perform an addition operation on a vector of equal length the first value of vector1 is added with the first value of vector 2 like that. So, the repetition of small length vector as long as completion of operation on long length vector is known as vector recycling. This is the special property of vectors is available in R language. Let us see the implementation of vector recycling.
# creating vector with
# 1 to 6 values
<- 1:6
vec1
# creating vector with 1:2
# values
<- 1:2
vec2
# adding vector1 and vector2
print(vec1 + vec2)
[1] 2 4 4 6 6 8
In vector recycling, the length of the long length vector should be the multiple of the length of a small length vector. If not we will get a warning that longer object length is not a multiple of shorter object length. Here the longer object length is multiple of the shortest object length. So, we didn’t get a warning message.
# creating vector with 10 to 14 values
<- 10:14
vec1
# creating vector with 3 to 5 values
<- 3:5
vec2
# adding vector1 and vector2
print(vec1 + vec2)
Warning in vec1 + vec2: longer object length is not a multiple of shorter
object length
[1] 13 15 17 16 18
The notion of vector recycling is the main idea behind R programming language.
<- c(-3,-2,-2,-1,-1,1,2,3,3) vec1
length of vector:
length(vec1)
[1] 9
Absolute value for each element of the vector
abs(vec1)
[1] 3 2 2 1 1 1 2 3 3
sum all the elements in the vector
sum(vec1)
[1] 0
mean of all the elements in the vector
mean(vec1)
[1] 0
median of all the elements in the vector
median(vec1)
[1] -1
minimum of all the elements in the vector
min(vec1)
[1] -3
Position of the minimum in the vector
which.min(vec1)
[1] 1
maximum value of all the elements in the vector
max(vec1)
[1] 3
Position of the maximum in the vector
which.max(vec1)
[1] 8
variance
var(vec1)
[1] 5.25
standard deviation
sd(vec1)
[1] 2.291288
covariance between vec1
and vec1
cov(vec1, vec1)
[1] 5.25
correlation
cor(vec1, vec1)
[1] 1
frequency table of the values in the vector
table(vec1)
vec1
-3 -2 -1 1 2 3
1 2 2 1 1 2
Sorting the vector in ascending order
sort(vec1)
[1] -3 -2 -2 -1 -1 1 2 3 3
sorting the vector in descending order
sort(vec1, decreasing = TRUE)
[1] 3 3 2 1 -1 -1 -2 -2 -3
ranking the vector
rank(vec1)
[1] 1.0 2.5 2.5 4.5 4.5 6.0 7.0 8.5 8.5
ranking vector and in case of ties, we take the minimum rank
rank(vec1,ties.method = 'min')
[1] 1 2 2 4 4 6 7 8 8
The position each element would take if the vector was to be sorted
order(vec1)
[1] 1 2 3 4 5 6 7 8 9
Position that satisfy a condition:
which(vec1>=3) # Indices of vector 1 where it is greater than 3
[1] 8 9
First difference: The difference between the current valueand the previous value for all the elements in a vector
diff(vec1)
[1] 1 0 1 0 2 1 1 0
The unique values of a vector
unique(vec1)
[1] -3 -2 -1 1 2 3
Return a logical value if it is duplicated
duplicated(vec1)
[1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
Are all the values in the vector positive ie all greater than 0?
all(vec1>0)
[1] FALSE
Is any of the values in the vector greater than 0?
any(vec1>0)
[1] TRUE
Let us do some matching:
<- c(5,-2,-1,1,3,-3,7) vec2
To obtain the position of elements of vec1
inside vec2
match(vec1, vec2) #Position of vec1 in vec2. Why NA?
[1] 6 2 2 3 3 4 NA 5 5
Ensure any element in vec1
that is not in vec2
to be given position value 0
match(vec1, vec2, 0)
[1] 6 2 2 3 3 4 0 5 5
Which values in vec1
are in vec2
?
%in% vec2 vec1
[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
match(vec1, vec2, 0) > 0
[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
Which values in vec2
are in vec1
?
%in% vec1 vec2
[1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE
inner product
%*% vec1 vec1
[,1]
[1,] 42
sum(vec1^2)
[1] 42
euclidean norm/ magnitude of a vector
sqrt(sum(vec1^2))
[1] 6.480741
sqrt(vec1 %*% vec1)
[,1]
[1,] 6.480741
norm(vec1, '2')
[1] 6.480741
Element-wise minimum and maximum
pmin(-3:3, 0)
[1] -3 -2 -1 0 0 0 0
pmin(c(1,3,5,10), c(-1,4,3,11))
[1] -1 3 3 10
pmax(c(1,3,5,10), c(-1,4,3,11))
[1] 1 4 5 11
Need more functions? - we shall discuss this in class
findInterval(vec1, c(-2,0,2))
[1] 0 1 1 1 1 2 3 3 3
In computing, an attribute is defined as a piece of information which determines the properties of a field or tag in a database or a string of characters in a display. This is quite a lot of jargon. As for now, understand an attribute to be an extra information contained within an object. This information is not the main information, but rather the object carries it along and it gives more description to the object itself. NB: So far we have not tackled what an object is. In this lesson, take a vector to be the object.
The common attribute that a vector can posses is the element names. That is each element in the vector can contain a name. Lets look at the example below.
<- c(a = 1, b = 3, c = 5)
vec3 vec3
a b c
1 3 5
The first element of the vector above is named as a
while the last element is d
. These names are NOT the values of the vector. The vector still has the values 1, 3, 5
and in addition each element is named.
Notice that we can still do math manipulation on the vector as the values are numeric:
* 5 vec3
a b c
5 15 25
How can we access the names? By using the names
function:
names(vec3)
[1] "a" "b" "c"
It is also possible to set the names to a vector that does not contain names:
<- c(1, 3, 5)
vec_4 names(vec_4) <- c("a", "b", "c")
vec_4
a b c
1 3 5
To remove the names, we simply set the names to NULL
names(vec_4) <- NULL
vec_4
[1] 1 3 5
Of course we can add attributes to a vector by using the attr
or the attributes
functions.
<- c(3,4,5)
point attr(point, 'names') <- c('a', 'b', 'c')
point
a b c
3 4 5
Notice that R realized the metadata we added and printed the information accordingly.
Sometimes we need to add metadata that is not recognized by R. For example, assume we are calculating the value of a function but at the same time need the gradient score at that particular point. We could save this extra information as an attribute.
<- c(3,4,5)
point attr(point, 'gradient') <- 10
point
[1] 3 4 5
attr(,"gradient")
[1] 10
More information on this later.
This is the process of extracting components/elements from the vector to obtain a smaller vector. Notice that in R, an atomic object of length 1 is still considered a vector of length 1.
To accomplish this, we use the extracting functions. ie [
or [[
or even getElement
, together with either the element position or name in case where the elements contain names.
<- c(a = 1, b = 3, c = 5, d = NA) # d is a missing value
vec3 1] #Get the first element vec3[
a
1
'a'] #Get element named a vec3[
a
1
-2] #remove the second element vec3[
a c d
1 5 NA
-c(2,4)] vec3[
a c
1 5
getElement(vec3, 'b')
[1] 3
getElement(vec3, 2)
[1] 3
>3] vec3[vec3
c <NA>
5 NA
<=2] vec3[vec3
a <NA>
1 NA
>6] vec3[vec3
<NA>
NA
!is.na(vec3)] vec3[
a b c
1 3 5
na.omit(vec3)
a b c
1 3 5
attr(,"na.action")
d
4
attr(,"class")
[1] "omit"
Vec3 above contains a missing value represented by NA
. Notice that I do not have quotes around NA
as it is a special value in R.
How would I compute sum
of vec3?
sum(vec3)
[1] NA
sum(na.omit(vec3))
[1] 9
sum(vec3, na.rm = TRUE)
[1] 9
max(vec3, na.rm = TRUE)
[1] 5
Note that the getElement
function was introduced recently and does not support extracting more than one elements.
For vectors, the [[
is used when you desire to drop the attributes. eg Notice the difference between the following two commands
1] vec3[
a
1
1]] vec3[[
[1] 1
or even
'a']] vec3[[
[1] 1
The extraction functions can also be used to replace values in a vector
# Remind ourselves what vec3 is vec3
a b c d
1 3 5 NA
3] <- 10
vec3['c'] <- 5 vec3[
What happens if you use a position that does not exist?
-10] vec3[
a b c d
1 3 5 NA
0] vec3[
named numeric(0)
Replace many at once
<- vec3>1 & !is.na(vec3)
index <- vec3[index] + 10
vec3[index] vec3
a b c d
1 13 15 NA
Suppose 5 exams were taken by 2 students. The score of the exams are 98,90,70,92,87
. Suppose you have a second vector which contains the student who did the exam, ie 1,2,1,1,2
whereby student 1 scored grades 98,70,92
and student 2 scored the grades 90,87
. How can we find the mean for each student? what about the sum? sd?max? etc.
This is considered as grouping of data. Many functions can be used to manipulate this. The well known one is tapply
:
tapply(your_vector, grouping_vector, your_function)
<- c(98,90,70,92,87)
marks <- c(1,2,1,1,2)
student tapply(marks, student, mean)
1 2
86.66667 88.50000
tapply(marks, student, max)
1 2
98 90
Suppose we wanted to replace the values with their average instead of only computing the means?
One way we could do that is get the students, match them against their grade means then replace:
<- c(98,90,70,92,87)
marks <- c(1,2,1,1,2)
student <- tapply(marks, student, mean)
means means[student]
1 2 1 1 2
86.66667 88.50000 86.66667 86.66667 88.50000
At the same time we could use a function known as ave
:
<- c(98,90,70,92,87)
marks <- c(1,2,1,1,2)
student ave(marks, student)
[1] 86.66667 88.50000 86.66667 86.66667 88.50000
What about if there was an NA
how would you approach that?
<- c(98,90,70,92,87, NA)
marks <- c(1,2,1,1,2,2)
student tapply(marks, student, mean, na.rm =TRUE) # The na.rm =TRUE is for the mean and not tapply
1 2
86.66667 88.50000
Sequence generation: Write R code to generate the following sequences:
1,1,2,1,2,3,1,2,3,4,1,2,3,4,5
1,2,3,1,2,3,1,2,3,1,2
1,1,1,2,2,2,3,3,3,1,1
Set Operations: Given that the first vector A
contains the values 3,3,4,4,4,10,-2
while the second vector B
contains the values 4,3,6,-1
obtain the following in R: example
a. The unique values of A and of B
b. The frequency of `A` ie a table showing the number of times each unique element occurs. eg 3 occurs 2 times
c. Which values in A are in B? What about values in B that are in A?
d. Obtain the position of the elements in A in the vector B ie the first element in A is in position 2 in B
e. What is the intersection, union, set difference of the unique values of A and B?
f. Obtain the position of the duplicated values in A
g. Find the cumulative sum, product, cumulative minimum and maximum of A
Given the data 4,7,2,8,1,1,2
compute the standard deviation. \(sd = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i - \bar x)^2}\) where \(\bar x = \frac{1}{n}\sum_{i=1}^nx_i\)
Triangle numbers: Obtain the first 5 triangle numbers.
Nearest neighbor: Given a vector x, for every element in x find the closest value in THAT SAME vector excluding the element in question. Let x be the vector below:
1 5 6 2 3 0 5 2 1 9
The results should be
1 5 5 2 2 1 5 2 1 6
Ie: The first number 1, is closest to 1 which is 2nd last. The last number 9 is closest to 6.
Write a function named nearest_neighbor
that would solve the above for any vector x
.
Dense Ranking: Suppose 8 students did an exam and the results were as follows: 98, 98, 96, 93, 85, 80, 85, 91
. Rank the students to obtain the following results.
ie Notice that normal ranking would give us the results:
1 1 3 4 6 8 6 5
And the dense rank would give us:
1 1 2 3 5 6 5 4
Use R to obtain the results above.
Write an R function that would perform the dense rank on any given vector.
Name the function dense_rank
and will take two parameters. A numeric vector x
and a logical parameter decreasing
.
Value redistribution: Suppose we have a vector with many zeros.
v <- c(3,0,0,5,0,0,0,10,0,0,0,0)
We want to distribute the nonzero numbers forward and replace everything before a nonzero number with the average. For example (3,0,0) should be replaced by (1,1,1).
(3+0+0)/3=1
v should become
(1,1,1,1.25,1.25,1.25,1.25,2,2,2,2,2)
How can we solve this using R?
searchsorted Given two vectors, a and v, Find indices where elements should be inserted to maintain order. ie Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.
<- 1:5
a <- c(-10,10,2,3) v
The result should be:
1, 6, 2, 3
or 0, 5, 1, 2
Write a function named search_sorted
that would accomplish the task above given any two arrays. Let the first array be the sorted array a
and the second array be v
. Refer to week 1 notes on how to write a basic function
(Continuation form 7) Using the idea above, assume the vector is not sorted. You need to determine the index of the first element in vector a that is greater than the element in question in vector v:
Write a function named search_unsorted
to accomplish this. Test it on the two vectors below
Example: suppose we have the data below:
<- c(1, -3, 5, 10, 13, 4, 8, 20, 24)
x <- c(2, 17, 23, -10, 12) y
We see that the results should be 3, 8, 9, 1, 5
or 2, 7, 8, 0, 4
How? first \(2\le5\) , so the index of 5 is 3. then \(17\le20\) so the index of 20 is 8. etc
Write a code to obtain the results above
Run Length Encoding: Given a vector \(x\) obtain its RLE. RLE is defined as a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run.(Wikipedia)
Example:
Suppose we have: 1, 1, 1, 3, 3, 3, 3, 1, 1, 1, 1, 1, 2, 2, 3, 2, 2, 2
We see that we have 3
1’s, followed by 4
3’s followed by 5
1’s then 2
2’s, 1
3 and finally 3
2’s.
write a function named my_rle
that would solve the task at hand. Note that you should output the vector of values, with a lengths attribute that contains the lengths.
Compare your output to the output of the function call below:
<- c(1,1,1,3,3,3,3,1,1,1,1,1,2,2,3,2,2)
x rle(x)
Run Length Encoding
lengths: int [1:6] 3 4 5 2 1 2
values : num [1:6] 1 3 1 2 3 2
(Continuation from 9): Write a function named rle_id
that would Create a grouping vector for each run length of a vector \(x\) . ie when you run your function on the vector x
above ie rle_id(x)
Your results should be: 1,1,1,2,2,2,2,3,3,3,3,3,4,4,5,6,6
Also write a function named row_id
that would create a element id within each rle group: and when called on x above ie row_id(x)
should result in: 1,2,3,1,2,3,4,1,2,3,4,5,1,2,1,1,2