Vectors

A vector is substantially a list that contains elements of the same kind. There are two types of vectors, Atomic Vectors and Generic Vectors. We will first talk of those that are one dimensional and atomic. By atomic we mean that the elements within the vector cannot by themselves hold other elements. Due to the simplicity structure, we can manipulate all the elements within a vector simultaneously.

Vector Creation

Different functions have been provided in R to be able to create vector. The most common function is the c function which is used for concatenation. Eg To create a vector that contains the elements 1 to 5, we could do the following.

c(1, 2, 3, 4, 5)

[1] 1 2 3 4 5

c(10, 100, 20, -4)

[1]  10 100  20  -4

c(2, 2, 2)

[1] 2 2 2

At times we need to generate a sequence of numbers. The : operator is used for this task.

1:5

[1] 1 2 3 4 5

0.1:5.1

[1] 0.1 1.1 2.1 3.1 4.1 5.1

Notice that the sequence has an increment of 1. This is too restrictive. What if we need a sequence with increment of say 2 or even 0.2? We then use the function seq.

seq(10)

 [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 10)

 [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 2, 0.1)

 [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

seq(1, 2, length.out = 11)

 [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

seq(c(1,4,6,8,-2))

[1] 1 2 3 4 5

sequence(1:3)

[1] 1 1 2 1 2 3

Other vector functions include rep for repetition and length to determine the length.

Examples:

rep(2, 4)

[1] 2 2 2 2

rep(c(2,3), 4)

[1] 2 3 2 3 2 3 2 3

rep(c(2,3), each = 4)

[1] 2 2 2 2 3 3 3 3

rep(c(2,3), c(4,4))

[1] 2 2 2 2 3 3 3 3

With vectors, we can easily carry out arithmetic manipulations for each element simultaneously. That is because most of R’s functions are VECTORIZED, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time. This makes writing code more concise, easy to read, and less error prone.

c(1,2,3) + c(6,7,8)

[1]  7  9 11

log(c(-4, 10, 6, 8))

Warning in log(c(-4, 10, 6, 8)): NaNs produced

[1]      NaN 2.302585 1.791759 2.079442

Notice that log(-4) produced NaN while the others gave results. This shows that the computation of rest do not depend on the one with a warning. Also, NaN is a numeric variable that stands for not a number.

Vector Recycling

We can see vector recycling, when we perform some kind of operations like addition, subtraction. . . .etc on two vectors of unequal length. The vector with a small length will be repeated as long as the operation completes on the longer vector. If we perform an addition operation on a vector of equal length the first value of vector1 is added with the first value of vector 2 like that. So, the repetition of small length vector as long as completion of operation on long length vector is known as vector recycling. This is the special property of vectors is available in R language. Let us see the implementation of vector recycling.

# creating vector with
# 1 to 6 values
vec1 <- 1:6

# creating vector with 1:2
# values
vec2 <- 1:2

# adding vector1 and vector2
print(vec1 + vec2)

[1] 2 4 4 6 6 8

In vector recycling, the length of the long length vector should be the multiple of the length of a small length vector. If not we will get a warning that longer object length is not a multiple of shorter object length. Here the longer object length is multiple of the shortest object length. So, we didn’t get a warning message.

# creating vector with 10 to 14 values
vec1 <- 10:14

# creating vector with 3 to 5 values
vec2 <- 3:5

# adding vector1 and vector2
print(vec1 + vec2)

Warning in vec1 + vec2: longer object length is not a multiple of shorter
object length

[1] 13 15 17 16 18

The notion of vector recycling is the main idea behind R programming language.

Vector Manipulation

vec1 <- c(-3,-2,-2,-1,-1,1,2,3,3)

length of vector:

length(vec1)

[1] 9

Absolute value for each element of the vector

abs(vec1)

[1] 3 2 2 1 1 1 2 3 3

sum all the elements in the vector

sum(vec1)

[1] 0

mean of all the elements in the vector

mean(vec1)

[1] 0

median of all the elements in the vector

median(vec1)

[1] -1

minimum of all the elements in the vector

min(vec1)

[1] -3

Position of the minimum in the vector

which.min(vec1)

[1] 1

maximum value of all the elements in the vector

max(vec1)

[1] 3

Position of the maximum in the vector

which.max(vec1)

[1] 8

variance

var(vec1)

[1] 5.25

standard deviation

sd(vec1)

[1] 2.291288

covariance between vec1 and vec1

cov(vec1, vec1)

[1] 5.25

correlation

cor(vec1, vec1)

[1] 1

frequency table of the values in the vector

table(vec1)

vec1
-3 -2 -1  1  2  3 
 1  2  2  1  1  2

Sorting the vector in ascending order

sort(vec1)

[1] -3 -2 -2 -1 -1  1  2  3  3

sorting the vector in descending order

sort(vec1, decreasing = TRUE)

[1]  3  3  2  1 -1 -1 -2 -2 -3

ranking the vector

rank(vec1)

[1] 1.0 2.5 2.5 4.5 4.5 6.0 7.0 8.5 8.5

ranking vector and in case of ties, we take the minimum rank

rank(vec1,ties.method = 'min')

[1] 1 2 2 4 4 6 7 8 8

The position each element would take if the vector was to be sorted

order(vec1)

[1] 1 2 3 4 5 6 7 8 9

Position that satisfy a condition:

which(vec1>=3) # Indices of vector 1 where it is greater than 3

[1] 8 9

First difference: The difference between the current valueand the previous value for all the elements in a vector

diff(vec1)

[1] 1 0 1 0 2 1 1 0

The unique values of a vector

unique(vec1)

[1] -3 -2 -1  1  2  3

Return a logical value if it is duplicated

duplicated(vec1)

[1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE

Are all the values in the vector positive ie all greater than 0?

all(vec1>0)

[1] FALSE

Is any of the values in the vector greater than 0?

any(vec1>0)

[1] TRUE

Let us do some matching:

vec2 <- c(5,-2,-1,1,3,-3,7)

To obtain the position of elements of vec1 inside vec2

match(vec1, vec2) #Position of vec1 in vec2. Why NA?

[1]  6  2  2  3  3  4 NA  5  5

Ensure any element in vec1 that is not in vec2 to be given position value 0

match(vec1, vec2, 0)

[1] 6 2 2 3 3 4 0 5 5

Which values in vec1 are in vec2?

vec1 %in% vec2

[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

match(vec1, vec2, 0) > 0

[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

Which values in vec2 are in vec1?

vec2 %in% vec1

[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

inner product

vec1 %*% vec1

     [,1]
[1,]   42

sum(vec1^2)

[1] 42

euclidean norm/ magnitude of a vector

sqrt(sum(vec1^2))

[1] 6.480741

sqrt(vec1 %*% vec1)

         [,1]
[1,] 6.480741

norm(vec1, '2')

[1] 6.480741

Element-wise minimum and maximum

pmin(-3:3, 0)

[1] -3 -2 -1  0  0  0  0

pmin(c(1,3,5,10), c(-1,4,3,11))

[1] -1  3  3 10

pmax(c(1,3,5,10), c(-1,4,3,11))

[1]  1  4  5 11

Need more functions? - we shall discuss this in class

findInterval(vec1, c(-2,0,2))

[1] 0 1 1 1 1 2 3 3 3

Vector Attributes

In computing, an attribute is defined as a piece of information which determines the properties of a field or tag in a database or a string of characters in a display. This is quite a lot of jargon. As for now, understand an attribute to be an extra information contained within an object. This information is not the main information, but rather the object carries it along and it gives more description to the object itself. NB: So far we have not tackled what an object is. In this lesson, take a vector to be the object.

The common attribute that a vector can posses is the element names. That is each element in the vector can contain a name. Lets look at the example below.

vec3 <- c(a = 1, b = 3, c = 5)
vec3

a b c 
1 3 5

The first element of the vector above is named as a while the last element is d. These names are NOT the values of the vector. The vector still has the values 1, 3, 5 and in addition each element is named.

Notice that we can still do math manipulation on the vector as the values are numeric:

vec3 * 5

 a  b  c 
 5 15 25

How can we access the names? By using the names function:

names(vec3)

[1] "a" "b" "c"

It is also possible to set the names to a vector that does not contain names:

vec_4 <- c(1, 3, 5)
names(vec_4) <- c("a", "b", "c")
vec_4

a b c 
1 3 5

To remove the names, we simply set the names to NULL

names(vec_4) <- NULL
vec_4

[1] 1 3 5

Of course we can add attributes to a vector by using the attr or the attributes functions.

point <- c(3,4,5)
attr(point, 'names') <- c('a', 'b', 'c')
point

a b c 
3 4 5

Notice that R realized the metadata we added and printed the information accordingly.

Sometimes we need to add metadata that is not recognized by R. For example, assume we are calculating the value of a function but at the same time need the gradient score at that particular point. We could save this extra information as an attribute.

point <- c(3,4,5)
attr(point, 'gradient') <- 10
point

[1] 3 4 5
attr(,"gradient")
[1] 10

More information on this later.

Vector Sub-setting

This is the process of extracting components/elements from the vector to obtain a smaller vector. Notice that in R, an atomic object of length 1 is still considered a vector of length 1.

To accomplish this, we use the extracting functions. ie [ or [[ or even getElement, together with either the element position or name in case where the elements contain names.

Using Position and Names

vec3 <- c(a = 1, b = 3, c = 5, d = NA) # d is a missing value
vec3[1] #Get the first element

a 
1

vec3['a'] #Get element named a

a 
1

vec3[-2] #remove the second element

 a  c  d 
 1  5 NA

vec3[-c(2,4)]

a c 
1 5

getElement(vec3, 'b')

[1] 3

getElement(vec3, 2)

[1] 3

vec3[vec3>3]

   c <NA> 
   5   NA

vec3[vec3<=2]

   a <NA> 
   1   NA

vec3[vec3>6]

<NA> 
  NA

vec3[!is.na(vec3)]

a b c 
1 3 5

na.omit(vec3)

a b c 
1 3 5 
attr(,"na.action")
d 
4 
attr(,"class")
[1] "omit"

Vec3 above contains a missing value represented by NA. Notice that I do not have quotes around NA as it is a special value in R.

How would I compute sum of vec3?

sum(vec3)

[1] NA

sum(na.omit(vec3))

[1] 9

sum(vec3, na.rm = TRUE)

[1] 9

max(vec3, na.rm = TRUE)

[1] 5

Note that the getElement function was introduced recently and does not support extracting more than one elements.

For vectors, the [[ is used when you desire to drop the attributes. eg Notice the difference between the following two commands

vec3[1]

a 
1

vec3[[1]]

[1] 1

or even

vec3[['a']]

[1] 1

The extraction functions can also be used to replace values in a vector

vec3 # Remind ourselves what vec3 is

 a  b  c  d 
 1  3  5 NA

vec3[3] <- 10
vec3['c'] <- 5

What happens if you use a position that does not exist?

vec3[-10]

 a  b  c  d 
 1  3  5 NA

vec3[0]

named numeric(0)

Replace many at once

index <- vec3>1 & !is.na(vec3)
vec3[index] <- vec3[index] + 10
vec3

 a  b  c  d 
 1 13 15 NA

Advanced vector functions

Suppose 5 exams were taken by 2 students. The score of the exams are 98,90,70,92,87. Suppose you have a second vector which contains the student who did the exam, ie 1,2,1,1,2 whereby student 1 scored grades 98,70,92 and student 2 scored the grades 90,87. How can we find the mean for each student? what about the sum? sd?max? etc.

This is considered as grouping of data. Many functions can be used to manipulate this. The well known one is tapply:

tapply(your_vector, grouping_vector, your_function)

marks <- c(98,90,70,92,87)
student <- c(1,2,1,1,2)
tapply(marks, student, mean)

       1        2 
86.66667 88.50000

tapply(marks, student, max)

 1  2 
98 90

Suppose we wanted to replace the values with their average instead of only computing the means?

One way we could do that is get the students, match them against their grade means then replace:

marks <- c(98,90,70,92,87)
student <- c(1,2,1,1,2)
means <- tapply(marks, student, mean)
means[student]

       1        2        1        1        2 
86.66667 88.50000 86.66667 86.66667 88.50000

At the same time we could use a function known as ave:

marks <- c(98,90,70,92,87)
student <- c(1,2,1,1,2)
ave(marks, student)

[1] 86.66667 88.50000 86.66667 86.66667 88.50000

What about if there was an NA how would you approach that?

marks <- c(98,90,70,92,87, NA)
student <- c(1,2,1,1,2,2)
tapply(marks, student, mean, na.rm =TRUE) # The na.rm =TRUE is for the mean and not tapply

       1        2 
86.66667 88.50000

Exercise 5

Sequence generation: Write R code to generate the following sequences:
1. 1,1,2,1,2,3,1,2,3,4,1,2,3,4,5
2. 1,2,3,1,2,3,1,2,3,1,2
3. 1,1,1,2,2,2,3,3,3,1,1
Set Operations: Given that the first vector A contains the values 3,3,4,4,4,10,-2 while the second vector B contains the values 4,3,6,-1 obtain the following in R: example

a. The unique values of A and of B

b. The frequency of `A` ie a table showing the number of times each unique element occurs. eg 3 occurs 2 times

c. Which values in A are in B? What about values in B that are in A?

d. Obtain the position of the elements in A in the vector B ie the first element in A is in position 2 in B

e. What is the intersection, union, set difference of the unique values of A and B?

f. Obtain the position of the duplicated values in A

g. Find the cumulative sum, product, cumulative minimum and maximum of A

Given the data 4,7,2,8,1,1,2 compute the standard deviation. \(sd = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i - \bar x)^2}\) where \(\bar x = \frac{1}{n}\sum_{i=1}^nx_i\)
Triangle numbers: Obtain the first 5 triangle numbers.
Nearest neighbor: Given a vector x, for every element in x find the closest value in THAT SAME vector excluding the element in question. Let x be the vector below:
```
1 5 6 2 3 0 5 2 1 9
```
The results should be
```
1 5 5 2 2 1 5 2 1 6
```
Ie: The first number 1, is closest to 1 which is 2nd last. The last number 9 is closest to 6.

Write a function named nearest_neighbor that would solve the above for any vector x.
Dense Ranking: Suppose 8 students did an exam and the results were as follows: 98, 98, 96, 93, 85, 80, 85, 91. Rank the students to obtain the following results.

ie Notice that normal ranking would give us the results:
```
1 1 3 4 6 8 6 5
```
And the dense rank would give us:
```
1 1 2 3 5 6 5 4
```
Use R to obtain the results above.

Write an R function that would perform the dense rank on any given vector.

Name the function dense_rank and will take two parameters. A numeric vector x and a logical parameter decreasing .
Value redistribution: Suppose we have a vector with many zeros.
```
v <- c(3,0,0,5,0,0,0,10,0,0,0,0)
```
We want to distribute the nonzero numbers forward and replace everything before a nonzero number with the average. For example (3,0,0) should be replaced by (1,1,1).

(3+0+0)/3=1

v should become
```
(1,1,1,1.25,1.25,1.25,1.25,2,2,2,2,2)
```
How can we solve this using R?
searchsorted Given two vectors, a and v, Find indices where elements should be inserted to maintain order. ie Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.
```
a <- 1:5
v <- c(-10,10,2,3)
```
The result should be:

1, 6, 2, 3 or 0, 5, 1, 2

Write a function named search_sorted that would accomplish the task above given any two arrays. Let the first array be the sorted array a and the second array be v . Refer to week 1 notes on how to write a basic function
(Continuation form 7) Using the idea above, assume the vector is not sorted. You need to determine the index of the first element in vector a that is greater than the element in question in vector v:

Write a function named search_unsorted to accomplish this. Test it on the two vectors below

Example: suppose we have the data below:
```
x <- c(1, -3, 5, 10, 13, 4, 8, 20, 24) 
y <- c(2, 17, 23, -10, 12) 
```
We see that the results should be 3, 8, 9, 1, 5 or 2, 7, 8, 0, 4

How? first \(2\le5\) , so the index of 5 is 3. then \(17\le20\) so the index of 20 is 8. etc

Write a code to obtain the results above
Run Length Encoding: Given a vector \(x\) obtain its RLE. RLE is defined as a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run.(Wikipedia)

Example:

Suppose we have: 1, 1, 1, 3, 3, 3, 3, 1, 1, 1, 1, 1, 2, 2, 3, 2, 2, 2

We see that we have 3 1’s, followed by 4 3’s followed by 5 1’s then 2 2’s, 1 3 and finally 3 2’s.

write a function named my_rle that would solve the task at hand. Note that you should output the vector of values, with a lengths attribute that contains the lengths.

Compare your output to the output of the function call below:
```
x <- c(1,1,1,3,3,3,3,1,1,1,1,1,2,2,3,2,2)
rle(x)
```
```
Run Length Encoding
  lengths: int [1:6] 3 4 5 2 1 2
  values : num [1:6] 1 3 1 2 3 2
```
(Continuation from 9): Write a function named rle_id that would Create a grouping vector for each run length of a vector \(x\) . ie when you run your function on the vector x above ie rle_id(x) Your results should be: 1,1,1,2,2,2,2,3,3,3,3,3,4,4,5,6,6

Also write a function named row_id that would create a element id within each rle group: and when called on x above ie row_id(x) should result in: 1,2,3,1,2,3,4,1,2,3,4,5,1,2,1,1,2