📝R言語

R言語まとめ

統計のための言語

The R Project for Statistical Computing

統計計算とグラフィック出力のための GNU プロジェクト.

ベースは関数型言語だが, オブジェクト指向にも, 手続き型にもかくことができる

統計解析部分は AT&T ベル研究所が開発した S 言語を参考にしている
- S は Dinamic Scoping にたいして R は Lexical Scoping
データ処理部分は Scheme の影響を受けている.

文法

基本的な構成要素は以下のとおり

環境 … オブジェクトとシンボルの pair をもつ集合
オブジェクト … もの
関数 … オブジェクトを操作するもの (関数オブジェクト)
シンボル … ものにつけられた名前 (識別子)

Basic

Immutable

R 言語において,

代入はオブジェクトのコピー
高階関数 … 関数の戻り値は関数オブジェクトのコピー

Binding to Symbol

関数はシンボルに割り当てることができる.(ここが関数型パラダイム)

インタプリタは

global environment からシンボルを探す.
Namespaces (Lexical Scope) からシンボルを探す.

Lexical Scoping

make.power <- function (n) {
    pow <- function (x) { x^n }
    pow
}
 
cube <- make.power (3)
square <- make.power (2)

print

explicite printing

print 関数を利用する.
```
x <- 1
print (x)
```
```
1
```

auto-printing

自動で形式を判別してで出力.
```
x
```
```
1
```
```
msg <- "hello"
```
```
hello
```
```
x <- 1:5
```
```
1
2
3
4
5
```

Functions

特徴

コマンドラインから interactive に操作できることを重視して設計された.

Basics

function () 宣言で生成される. 生成されると,関数クラスの R オブジェクトとして保持される.

f <- function (<arguments) {
  ###
}

R の関数は 第一級オブジェクト.

他の関数の引数に渡すことができる.
関数はネストできる.
戻り値は body の最後に評価された結果.

Arguments:

引数の照合には以下の順序がある.

明確な名前指定
部分一致する名前
入力順

Lagy Evaluation

遅延評価をサポートしている.

以下の例では, a は評価されて b は評価されない.

f <- function (a, b) {
    print (a)
    print (b)
}
f (45)

example

add2 <- function (x, y) {
    x + y
}
 
above10 <- function (x) {
    use <- x > 10
    x[use]
}
 
above <- function (x, n = 10) {
    use <- x > n
    x[use]
}
x <- 1:20
above (x, 12)

clumnmean <- function (y, removeNA = TRUE) {
    nc <- ncol (y)
    means <- numeric (c (nc))
    for (i in 1:nc) {
        means[i] <- mean (y[, 1], na.rm = removeNA)
    }
    means
}

Control Structures: 制御文

制御文は Ruby に似ている.

if

if (x > 3) {
    y  <- 10
} else {
    y <- 0
}
 
# cf.) = (condition) ? foo: bar;
y <- if (x > 3) {
    10
} else {
    0
}

For loops

for (i in 1:10) {
    print (i)
}

matrix は以下のように loop させる

x <- matrix (1:6, 2, 3)
 
for (i in seq_len (nrow (x))) {
    for (j in seq_len (ncol (x))) {
        print (x[i, j])
    }

while loops

count <- 0
while (count < 10{)) {
   print (count)
   count <- count + 1
}

repeat/break/next

repeat は infinite loop をつくるために利用する. break, next といっしょに利用する.

オブジェクト(データ)

Atomic Classes of Objects

R には 5 つのアトミックなオブジェクトがある.

charactor
numeric (real number)
integer
complex
ligical (true/false)

Integer

Integer で表現するときは, numeric のあとに L をつける.

Nan

Undefined valuable.(not a number).

Inf

Inf…Infinity number

Basic Objects

valuables
```
x <- 5
```

vetor

c で vector を生成する.

a <- c (0.5, 0.6)    # numeric
b <- c (TRUE, FALSE) # logial
c <- 0:5             # integer
d <- c ("a", "b", "c") #chalactor

型の混合も許す. tuple のような機能も併せ持つ.

a <- (1,7, "a")
b <- (TRUE, "a")

x <- 0:6
class (x)

integer

list

vector の特殊な形. 異なる型の vector を一つにまとめる.
```
x <- list (1, "a", TRUE, 1 + 4i)
x
```
```
1	a	TRUE	1+4i
```

Matrices

次元の性質をもつ vector. matrix 関数で生成.
```
m <- matrix (nrow = 2, ncol = 3)
m
```
```
nil	nil	nil
nil	nil	nil
```
```
m <- matrix (1:6, nrow = 2, ncol = 3)
m
```
```
1	3	5
2	4	6
```
- dim
  
  dim 関数をつかうと vector に次元の性質を与えることができる.
```
m <- 1:10
dim (m) <- c (2,5)
m
```
```
1	3	5	7	9
2	4	6	8	10
```
- cbind-ing and rbind-ing
  
  cbind, rbind を利用しても, vector から matrix を生成できる.
```
x <- 1:3
y <- 10:12
cbind (x, y)
```
```
1	10
2	11
3	12
```
```
rbind (x,y)
```
```
1	2	3
10	11	12
```

Factors

vector の特殊なかたち. categorical data を扱う.

integer vector について, それぞれの integer に label があるようなもの.

enum 列挙型ともいえる.factor 関数で作成.
```
x <- factor (c ("yes", "no", "no", "yes", "no"), labels = c ("yes", "no"))
table (x)
```
```
yes	3
no	2
```
- numeric と factor の相互変換 | Siguniang’s Blog

Data Frame

複数のベクトルからなるリスト.
- データフレーム Tips 大全 - RjpWiki
- Ruby - R のデータフレーム (data.frame) について - Qiita
list の特殊なかたち. list の list.
- list のなかのすべての list が同じ length をもつ必要がある.
- list の中の list は column とみなされる.
- list の中の各要素の番号は row とみなされる.
- 通常は, rad.table (), read.csv によって生成される.
- data.matrix (x) によって matrix 型に変換できる.
```
x <- data.frame (foo = 1:4, bar = c (T,T,F,F))
```
```
1	TRUE
2	TRUE
3	FALSE
4	FALSE
```
- ラベルを取得
```
names (data)
```
- 条件を指定してデータの抽出
```
adaltAnimalData <- animaldata[animaldata$Age.Intake>=1,]
```
- フレームからベクターを抽出
```
distance <-student$distance
```

names

オブジェクトには名前をつけることができる. 可読性を向上させる.

x <- 1:3
names (x) <- c ("foo", "bar", "norf")

x <- 1:3
names (x) <- c ("foo", "bar", "norf")
 
m <- matrix (1:4 nrow = 2, ncol = 2)
dimname (m) <- list (c ("a", "b"), c ("c", "d"))

split

カテゴリごとに DataFrame を分割する.

R.4.05. データフレーム型 | R Financial & Marketing Library

Reading/Writing Data

Reading

read.csv CSV ファイルから読み込み.

data <- read.csv ("foo.csv")

read.table R が適当に読み込んでくれる.

data <- read.table ("foo.txt")

100 行だけ読み込む.

initial <- read.table ("datatable.txt", nrows = 100)

Writing

dput, dump でテキストファイルて出力できる.

y <- data.frame (a = 1, b = "a")
dput (y)

1	a

Outside

Outsid World とのインタフェース.

file
gzfile
bzfile
url

Connection を利用してファイルを開くこともできる.

con <- file ("hw1_data.csv", "r")
data <- read.csv (con)
close (con)

website からも URL を指定することでデータを取得することができる.

con <- url ("http://www.jhsph.edu", "r")
data <- read.csv (con)
close (con)

Subsetting: 部分集合

サブセット (部分集合).

vector

x <- c ("a", "b", "c", "c", "d", "a")
x[1:4]

a
b
c
c

条件を指定して, 部分を抽出することができる.

x[x > "a"]

b
c
c
d

list

x <- list (foo = 1:4, bar = 0.6)
 
# index で指定
x[1]
 
# $で指定
x$bar

Marix

x <- matrix (1:6, 2, 3)

1	3	5
2	4	6

, を利用することで, 行や列だけを vector として抽出.

x[1,]

1
3
5

NA Values を取り除く

complete.cases で調べる.

x <- c (1, 2, NA, 4, NA, 5)
y <- c ("a", "b", NA, "d", NA, "f")
good <- complete.cases (x, y)
good

,  TRUE
   TRUE
   FALSE
   TRUE
   FALSE
   TRUE

x[good]

Apply Functions

R では, for 文を利用しないで, apply を利用すのがスマートな方法

行列タイプのデータを処理する apply
データをグループごとにまとめて処理する tapply
ベクトルやリストに並んだデータを順次処理する lapply と sapply
複数のベクトルやリストそれぞれからひとつづつデータをとりだしてそれらをまとめて処理する mapply.

行列計算をするようなものだとイメージしよう.

Bookmarks
- R-Source
- R プログラム (TAKENAKA’s Web Page)

apply (X, MARGIN, Fun, …)

ベクトルや行列, 配列の MARGIN に関数を適用し, その結果の配列かリストを返す.

適用する対象は MARGIN で指定する.

MARGIN = 1 ならば行
MARGIN = 2 ならば列
MARGIN = c (1,2) ならば各要素

lapply (X, Fun, …)

リストに関数を適用し, 結果のリストを返す.

x <- list (a = 1:5, b = rnorm (10))
lapply (x, mean)

無名関数も適用できる.

x <- list (a = matrix (1:4, 2, 2), b = matrix (1:6, 3, 2))
lapply (x, function (elt) elt[,1])

sapply (X, Fun, …)

リストに関数を適用し, 以下のいずれかを返す.

names 属性付きのベクトル
names 属性付きの行列

lapply に名前をつけて返す.

tapply (X, INDEX, 関数, …)

グループ化された変数について, グループごとに関数を適用する. INDEX は X の要素をグループに分ける因子の組み合わせのリスト (通常は文字列ベクトル) を与え, 各グループに関数を適用した結果をベクトルもしくはリストで返す.

x <- c (rnorm (10), runif (10), rnorm (10, 1))
f <- gl (3,10)
tapply (x, f, mean)

Excel の vlookup みたいなのを想像すればいい.

mapply (Fun F , x, y, z, … )

sapply () の多変量版. x, y, z, はベクトルや行列などを複数個指定でき, 関数 F (x, y, z, …) の結果をベクトルのリストで返す.

Operations

vector

x <- 1:4, y <- 4:9
x + y
x * y
x / y

matrix

x <- matrix (1:4, 2, 2)

1	3
2	4

y <- matrix (rep (10, 4), 2, 2)

10	10
10	10

x * y

10	30
20	40

計算系

データ数

table (adaltAnimalData$Animal.Type)

中央値・平均値・標準偏差

# 最大値
max (maleage)
 
# 平均値
mean (animaldata$Age.Intake)
 
# 中央値
median (animaldata$Age.Intake)
 
# 標準偏差: Standard Deviation
sd (animaldata$Age.Intake)
 
# fine number summery
fivenum (animaldata$Age.Intake)
 
# 四捨五入
# 小数点 2 桁まで
round (data,2)

cor: 共分散

R 言語で統計解析入門: 複数の変数間の相関係数をイチドキに求める梶山喜一郎

cor (bull$YearsPro, bull$BuckOuts)

categorical data を numeric data へ
```
val <- as.numeric (var)
```

マトリックスの作成

myvars <- c ('YearsPro', 'Events', 'BuckOuts')
cor (bull[,myvars])

検定

zscore: Z 検定

zcat <- (13- mean (catWeight))/sd (catWeight)
1-pnorm (zcat)

確率

table

分割表 (Contingency Tables) を作成する. 要素数をカウントする.

gtab <- table (acl$Grammy)

N	67
Y	49

prop

確率分布表 (marginal table) を作成する.

prop.table (gtab)

N	0.577586206896552
Y	0.422413793103448

gtab2 <- table (acl$Grammy, acl$Gender)
prop.table (gtab2)

0.181034482758621	0.396551724137931
0.120689655172414	0.301724137931034

描画系

描写のライブラリは 3 つある.

Base: “artist’s palette” model
Lattice: Entire plot specified by one function; conditioning
ggplot2: Mixes elements of Base and Lattice

Base

plot (x, y), hist (x)

R-Source

plot :Scatter Plot を描写

# plot
plot (bull$YearsPro, bull$BuckOuts, xlab='Years Pro', ylab='Buckouts', main='Plot of Years Buckouts')
# with
with (airquarity, plot (Ozon, Wind))

abline: 近似曲線をつける

abline (lm (bull$BuckOuts~bull$YearsPro))

hist: ヒストグラム

hist (animaldata$Age.Intake, main="Histgram of Intage Ages",
       xlab="Age at Intake")

boxplot: 箱ヒゲ図

boxplot (Ozone ~ Month, airquarity, xlab="Month", ylab="Ozone (ppb)")

parameters
- `pch`: the plotting symbol (default is open circle)
- `lty`: the line type (default is solid line), can be dashed, dotted, etc.
- `lwd`: the line width, specified as an integer multiple
- `col`: the plotting color, specified as a number, string, or hex
- code; the `colors ()` function gives you a vector of colors by name
- `xlab`: character string for the x-axis label
- `ylab`: character string for the y-axis label
The `par ()` function is used to specify global graphics parameters that affect all plots in an R session. These parameters can be overridden when specified as arguments to specific plotting functions.
- `las`: the orientation of the axis labels on the plot
- `bg`: the background color
- `mar`: the margin size
- `oma`: the outer margin size (default is 0 for all sides)
- `mfrow`: number of plots per row, column (plots are filled row-wise)
- `mfcol`: number of plots per row, column (plots are filled column-wise)

表

xtable library をつかう.
- LaTeX - RjpWiki
- LaTeX や HTML の表を作る - 迷途覚路夢中行 - Yahoo! ブログ

Lattice

ラテル. 相関関係を視覚化するときに, 役立つライブラリ.

xyplot (y ~ x | f * g, data)

Lattice Panel Function

x/y の相関関係をみるときに役立つ.

パネルをたくさん並べて傾向をみることもできる.パターンをみつける.

ggplot2

plot () の改良版. qprot ()

qplot を使えば自動でデータセットが分析されていい感じのグラフが作成できる. plot はすべてを自分で指定しないといけない.

R のグラフィック作成パッケージ”ggplot2”について|Colorless Green Ideas

qplot
- ggplot2 の qplot 関数のまとめ - ぬいぐるみライフ (仮)
- ggplot2 棒グラフ, 帯グラフメモ

Graphics File Devices

vector formats

line graphics に適している.
- pdf
- svg
- win.metafie
- postscript

bitmap format

scatter graphics に適している.
- png
- jpeg
- tiff
- bmp

Bookmarks

coursera

https://github.com/DataScienceSpecialization/courses/blob/master/04_ExploratoryAnalysis/PlottingBase/index.md

回帰分析

線形回帰直線: linFit

linFit を利用する.

linFit (mens800$Year, mens800$Record)

指数回帰曲線

expFit を利用する.

expFit (time, mv)
 
# 以下で数年後を予想
expFitPrid (time, mv, 12)

ロジスティック回帰曲線

logisticFit を利用する.

logisticFit (time, mv)
 
# 以下で数年後を予想
logisticFitPrid (time, mv, 12)

#+end_src

3 つの回帰線を同時に表示する

tripleFit を利用する.

tripleFit (time, mv)

サンプル抽出

1000 回の試行のなかで 10 回取り出す.

xbar10 <-rep (NA, 1000)
for (i in 1:1000)
{x <-sample (survey$name_letters, size =10)
xbar10[i] <- mean (x)}

t-testing

t.test (age, mu=30)
t.test (age, mu=30, alternative = 'less')
t.test (age, mu=30, alternative = 'greater')

aggregate

R を用いてグループごとに集計したいという場合に用いる.

統計量をもとめるときに利用する T (X) where X = (x1,x2,x3,…)

aggregate (x, by, FUN, …)

データ x を
リスト構造の by のグループごとに,
関数 FUN で統計量としてまとめる

averages <- aggregate (x=list (steps=data$steps),
                       by=list (interval=data$interval),
                       FUN=mean)

aggregate (formula, data, FUN)

formula を
data frame (data) から
関数 FUN で統計量としてまとめる

参考: fomula R-Source

データクリーニング

sort

order / sort.list を利用する.

R でデータフレームをソートする方法 - XXXannex

stateData <- stateData[order (stateData[,col]),]

不正な値の削除

numeric でなければ NA を挿入する.

data[, 11] <- as.numeric (data[, 11])

重複除去

unique を利用する.

u <- unique (d)

Simulation

Randum Number

dnorm
pnorm
qnorm
rnorm

dnorm (x, mean=0 sd=1, log=FALSE)
pnorm (x, mean=0 sd=1, lower.tail=TRUE, log.p=FALSE)
qnorm (x, mean=0 sd=1, lower.tail=TRUE, log.p=FALSE)
rnotm (x, mean=0 sd=1)

-1 ~ 1 の間で 10 のランダム変数を.

x <- rnorm (10)

平均と分散を指定.

x <- rnorm (10, 20, 2)

set.seed をセットすると, 実行するたびに毎回異なる数を得られる.

ex: Linier Models

y = b0 + b1*x + e

set.seed (20)
x <- rnorm (100)
e <- rnorm (100, 0, 2)
y <- 0.5 + 2 * x + e

Random Sampling

sample function で母集団のなかからサンプルをランダムに取り出す.

set.seed (1)
sample (1:10, 4)
sample (letters, 4)

Debug

ess-tracebug

ess-tracebug を利用する.

ess-tracebug - Tracing and debugging R code in ESS. - Google Project Hosting

BreakPoint 系 ess-bp-xxx

str

コンパクトにオブジェクトの内部の構造を表示する.

str (str)

summary

オブジェクトの内容を要約して表示.

system.time

処理にかかった時間を要約して表示してくれる.

Rprof

R のプロファイラ.

Rprof ()
summaryRpof ()

👨Hadley Wickham

dplyr及び関連のすごいライブラリを開発しまくったことでR言語をデータサイエンスの王者へと導いた男.

💡羽鳥教, 羽鳥神として崇め奉られている.

https://github.com/hadley

🔧tidyverse/dplyr

💡羽鳥教

データサイエンティストに憧れたわたしも入信した思い出.

Tools

🔧data.table

📝pandasのようなデータフレーム.

CRAN

パッケージリポジトリ. 国内サーバの指定.

options (repos="http://cran.md.tsukuba.ac.jp")

~/.Rprofile にかくと毎回読み込まれる.

CRAN 国内ミラーの使い方 - RjpWiki

tags. 🔖ProgLang 🔖DataScience

🔥虚無との戦い

エクスプローラー

📝R言語

R言語まとめ

文法

Basic

Immutable

Binding to Symbol

Lexical Scoping

print

Functions

特徴

Basics

Arguments:

Lagy Evaluation

example

Control Structures: 制御文

if

For loops

while loops

repeat/break/next

オブジェクト(データ)

Atomic Classes of Objects

Basic Objects

split

Reading/Writing Data

Reading

Writing

Outside

Subsetting: 部分集合

vector

list

Marix

NA Values を取り除く

Apply Functions

apply (X, MARGIN, Fun, …)

lapply (X, Fun, …)

sapply (X, Fun, …)

tapply (X, INDEX, 関数, …)

mapply (Fun F , x, y, z, … )

Operations

vector

matrix

計算系

データ数

中央値・平均値・標準偏差

cor: 共分散

マトリックスの作成

検定

zscore: Z 検定

確率

table

prop

描画系

Base

Lattice

ggplot2

Graphics File Devices

Bookmarks

回帰分析

線形回帰直線: linFit

指数回帰曲線

ロジスティック回帰曲線

3 つの回帰線を同時に表示する

サンプル抽出

t-testing

aggregate

データクリーニング

sort

不正な値の削除

重複除去

Simulation

Randum Number

ex: Linier Models

Random Sampling

Debug

ess-tracebug

str

summary

system.time