最近折腾Shiny的时候接触到了一款非常好用的数据读取包。写一下备忘录。
vroom有自动识别文件格式功能,所以不管是csv,还是tsv文件都只需要同一个读取指令vroom(”xxx.csv”)
就可以。
library(vroom)
data <- vroom("flights.tsv")
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
会跳出来一大段有关该数据各列属性的信息,不需要的话可以关掉。
s <- spec(data)
data <- vroom("flights.tsv", col_types = s)
批量读取数据是vroom的一大亮点。
files <- fs::dir_ls(glob = "flights_*tsv")
files
#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv
#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv
#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv
#> flights_YV.tsv
data <- vroom(files)
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
vroom_write()
可以直接写出压缩文件vroom_write(flights, "flights.tsv.gz")
# Check file sizes to show file is compressed
fs::file_size(c("flights.tsv", "flights.tsv.gz"))
#> 29.62M 7.87M
# Read the file back in
data <- vroom("flights.tsv.gz")
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv"
data <- vroom(file)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
这个有点神奇的,完全代替Perl。
# Return only flights on United Airlines
data <- vroom(pipe("grep -w UA flights.tsv"), col_names = names(flights))
#> Observations: 58,665
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
pigz
bench::workout({
vroom_write(flights, "flights.tsv.gz")
vroom_write(flights, pipe("pigz > flights.tsv.gz"))
})
#> # A tibble: 2 x 3
#> exprs process real
#> <bch:expr> <bch:tm> <bch:tm>
#> 1 vroom_write(flights, "flights.tsv.gz") 3.5s 2.69s
#> 2 vroom_write(flights, pipe("pigz > flights.tsv.gz")) 1.54s 975.09ms
data <- vroom("flights.tsv", col_select = c(year, flight, tailnum))
#> Observations: 336,776
#> Variables: 3
#> chr [1]: tailnum
#> dbl [2]: year, flight
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour))
#> Observations: 336,776
#> Variables: 13
#> chr [4]: carrier, tailnum, origin, dest
#> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr...
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
data <- vroom("flights.tsv", col_select = list(plane = tailnum, everything()))
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
data
#> # A tibble: 336,776 x 19
#> plane year month day dep_time sched_dep_time dep_delay arr_time
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 N142… 2013 1 1 517 515 2 830
#> 2 N242… 2013 1 1 533 529 4 850
#> 3 N619… 2013 1 1 542 540 2 923
#> 4 N804… 2013 1 1 544 545 -1 1004
#> 5 N668… 2013 1 1 554 600 -6 812
#> 6 N394… 2013 1 1 554 558 -4 740
#> 7 N516… 2013 1 1 555 600 -5 913
#> 8 N829… 2013 1 1 557 600 -3 709
#> 9 N593… 2013 1 1 557 600 -3 838
#> 10 N3AL… 2013 1 1 558 600 -2 753
#> # … with 336,766 more rows, and 11 more variables: sched_arr_time <dbl>,
#> # arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>,
#> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>
大多数情况下vroom可以准确的判断变量属性,当然偶尔也会出错,这个时候可以手动指定。当然也可以后期用dplyr
改,当然这样做就会稍微麻烦点。
属性对照,[ ]里的字符是实际用到的缩写字符。
col_logical()
‘l’, containing only T
, F
, TRUE
, FALSE
, 1
or 0
.col_integer()
‘i’, integer values.col_double()
‘d’, floating point values.col_number()
[n], numbers containing the grouping_mark
col_date(format = "")
[D]: with the locale’s date_format
.col_time(format = "")
[t]: with the locale’s time_format
.col_datetime(format = "")
[T]: ISO8601 date times.col_factor(levels, ordered)
‘f’, a fixed set of values.col_character()
‘c’, everything else.col_skip()
‘_, -', don’t import this column.col_guess()
‘?', parse using the “best” type based on the input.用例如下:
# read the 'year' column as an integer
data <- vroom("flights.tsv", col_types = c(year = "i"))
# also skip reading the 'time_hour' column
data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_"))
# also read the carrier as a factor
data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_", carrier = "f"))
data <- vroom("flights.tsv",
col_types = list(year = col_integer(), time_hour = col_skip(), carrier = col_factor())
)
一个字,快!非常适合机器学习动不动就几个G的数据。
下图是读取和输出1.55G数据时各个包所用的时间比较。