An unexpected challenge
After becoming familiar with R and the tidyverse and enjoying its power and expressiveness, one naturally wants to expand the use of R to more general purpose programming tasks. Once the data is in tibble format, surely there is a handy helper function that allows processing each row -- for example, perform a sql update, or perform a web service lookup -- on each row. This is such a common pattern that one assumes that it is a pretty simple task for the mighty R tidyverse.
Let's imagine we have a tibble of employees that looks like this:
df.employees <- tribble (
~id, ~first_name, ~last_name, ~dob, ~gender, ~title,
'101A', 'Bob', 'Francis', '1983-06-12', 'M', 'Director',
'102C', 'Susan', 'Bluebell', '1990-04-21', 'F', 'Assistent Director',
'201C', 'Emily', 'Rosen', '1971-11-07', 'F', 'CTO',
'301X', 'Ashley', 'Emerson', '2001-01-25', 'F', 'CFO'
)
First let's go with an intuitive but naive first approach on iterating. Pipe the employees to an an anonymous function using purrr's map function. The function itself just outputs the parameter.
df.employees |>
purrr::map(function(x) {
x
})
The output might not be what one would expect coming from a general programming background. The output is as follows:
$id
[1] "101A" "102C" "201C" "301X"
$first_name
[1] "Bob" "Susan" "Emily" "Ashley"
$last_name
[1] "Francis" "Bluebell" "Rosen" "Emerson"
$dob
[1] "1983-06-12" "1990-04-21" "1971-11-07" "2001-01-25"
$gender
[1] "M" "F" "F" "F"
$title
[1] "Director" "Assistent Director" "CTO" "CFO"
Instead of iterating over the rows it iterated over the columns! Indeed, if we look at the data type of the tibble:
typeof(df.employees)
[1] "list"
We see that a tibble is essentially a list (albeit of columns).
typeof(df.employees) [1] "list"
str(df.employees) tibble [4 × 6] (S3: tbl_df/tbl/data.frame) $ id : chr [1:4] "101A" "102C" "201C" "301X" $ first_name: chr [1:4] "Bob" "Susan" "Emily" "Ashley" $ last_name : chr [1:4] "Francis" "Bluebell" "Rosen" "Emerson" $ dob : chr [1:4] "1983-06-12" "1990-04-21" "1971-11-07" "2001-01-25" $ gender : chr [1:4] "M" "F" "F" "F" $ title : chr [1:4] "Director" "Assistent Director" "CTO" "CFO"
So map
iterated over the list but not in the row by manner we were hoping for. Surely the mighty tidyverse has a way an elegant
way of handling this.
Welcome 'pmap' a map variant to the rescue
If we look in the documentation for map
, we may notice the variant pmap
or parallel map, which, has
a special behavior when passed a dataframe-- it iterates row by row:
A data frame is an important special case of .l. It will cause .f to be called once for each row.
This sounds like it is exactly what we want.
df.employees |>
purrr::pmap(function(...) {
tb.row <- tibble(...)
tb.row
})
Each row becomes a tibble that we are able to access any column we want.
df.employees |>
purrr::pmap(function(...) {
tb.row <- tibble(...)
tb.row$IsFemale <- tb$gender == 'F'
tb.row
}) %>%
bind_rows()
We can pipe the output into bind_rows()
so we get one tibble as we had originally. We have appended a new column that can enrich the dataset with the result of our row by row processing, if desired.
# A tibble: 4 × 7
id first_name last_name dob gender title IsFemale
1 101A Bob Francis 1983-06-12 M Director FALSE
2 102C Susan Bluebell 1990-04-21 F Assistent Director TRUE
3 201C Emily Rosen 1971-11-07 F CTO TRUE
4 301X Ashley Emerson 2001-01-25 F CFO TRUE
This is precisely the kind of row by row iteration that I was hoping for and syntax is pretty clean.
Conclusion
With a little digging and experimentation one can use R and the tidyverse as a general purpose language and reap a lot of its benefits: expressiveness and conciseness in getting a lot of work done without much code. One of R's charms is that it doesn't always do things the way one would expect, but that isn't necessarily a bad thing.
While investigating this topic I've found there are a lot possible variations on this theme.