DTeam 团队日志


Julia 世界中的主流数据分析工具

胡键 Posted at — Aug 21, 2019 阅读

在 Julia 中,主流的数据分析工具有以下两套:

它们完全可以视为 Julia 世界的 Pandas ,并且我个人更喜欢它们提供的语法糖,在一定程度上让代码更容易使用,这里列出几个例子大家可以感受一下(这两大类包在这方面功能都类似,故只选取 JuliaDB 这一流派为例):

# Using JuliaDB
filter((:month => x -> x .== 1, :day => x -> x .== 1), flights);

# Using JuliaDBMeta
@filter flights :month .== 1 && :day .== 1
select(flights, (:year, :month, :day))
# Using JuliaDB
delay = groupby(
        count = length, 
        dist = :distance => x -> mean(skipmissing(x)), 
        delay = :arr_delay => x -> mean(skipmissing(x))
); # remove ; to see result

# Using JuliaDBMeta
delay = @groupby flights :tailnum {
    count = length(_),
    dist = mean(skipmissing(:distance)),
    delay = mean(skipmissing(:arr_delay))

既然两个包解决的问题域类似,那么它们之间的区别在哪里呢?关于这个问题的答案,可以参照 stackoverflow 上的这篇问答

What is the difference between DataFrames.jl and JuliaDB.jl

  1. JuliaDB.jl supports distributed parallelism; normal use of DataFrames.jl assumes that data fits into memory (you can work around this using SharedArray but this is not a part of the design) and if you want to parallelise computations you have to do it manually;
  2. JuliaDB.jl supports indexing while DataFrames.jl currently does not;
  3. Column types of JuliaDB.jl are stable and for DataFrames.jl currently they are not. The consequences are:
    • when using JuliaDB.jl each time a new type of data structure is created all functions that are applied over this type have to be recompiled (which for large data sets can be ignored but when working with many heterogeneous small data sets can have a visible performance impact);
    • when using DataFrames.jl you have to use special techniques ensuring type inference to achieve high performance is some situations (most notably barrier functions as discussed here).

简单来说,JuliaDB 假想处理超出内存的数据,支持分布式并行处理。并且,JuliaDB 首页也列出了它与其他流行数据处理包的对比。


最后补充一句,在之前关于时序分析的文章中,我曾说过没有找到支持 autocorrelation 的包。现在,这个问题得到解决了!在 OnlineStats 这个包中已经对此有支持,而且更赞的是,JuliaDB 与之有集成!