tutorial-efficient-R/efficient-R.html at master · DpfStat/tutorial-efficient-R

History

1071 lines (854 loc) · 50.4 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

<!DOCTYPE html>

<html>

<head>

<title>Writing Efficient R Code</title>

window.onload = function() {

var imgs = document.getElementsByTagName('img'), i, img;

for (i = 0; i < imgs.length; i++) {

img = imgs[i];

// center an image if it is the only element of its parent

if (img.parentElement.childElementCount === 1)

img.parentElement.style.textAlign = 'center';

}

};

</script>

pre .operator,

pre .paren {

color: rgb(104, 118, 135)

}

pre .literal {

color: #990073

}

pre .number {

color: #099;

}

pre .comment {

color: #998;

font-style: italic

}

pre .keyword {

color: #900;

font-weight: bold

}

pre .identifier {

color: rgb(0, 0, 0);

}

pre .string {

color: #d14;

}

</style>

hljs.initHighlightingOnLoad();

</script>

</script>

body, td {

font-family: sans-serif;

background-color: white;

font-size: 13px;

}

body {

max-width: 800px;

margin: auto;

padding: 1em;

line-height: 20px;

}

tt, code, pre {

font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;

}

h1 {

font-size:2.2em;

}

h2 {

font-size:1.8em;

}

h3 {

font-size:1.4em;

}

h4 {

font-size:1.0em;

}

h5 {

font-size:0.9em;

}

h6 {

font-size:0.8em;

}

a:visited {

color: rgb(50%, 0%, 50%);

}

pre, img {

max-width: 100%;

}

pre {

overflow-x: auto;

}

pre code {

display: block; padding: 0.5em;

}

code {

font-size: 92%;

border: 1px solid #ccc;

}

code[class] {

background-color: #F8F8F8;

}

table, td, th {

border: none;

}

blockquote {

color:#666666;

margin:0;

padding-left: 1em;

border-left: 0.5em #EEE solid;

}

hr {

height: 0px;

border-bottom: none;

border-top-width: thin;

border-top-style: dotted;

border-top-color: #999999;

}

@media print {

* {

background: transparent !important;

color: black !important;

filter:none !important;

-ms-filter: none !important;

}

body {

font-size:12pt;

max-width:100%;

}

a, a:visited {

text-decoration: underline;

}

hr {

visibility: hidden;

page-break-before: always;

}

pre, blockquote {

padding-right: 1em;

page-break-inside: avoid;

}

tr, img {

page-break-inside: avoid;

}

img {

max-width: 100% !important;

}

@page :left {

margin: 15mm 20mm 15mm 10mm;

}

@page :right {

margin: 15mm 10mm 15mm 20mm;

}

p, h2, h3 {

orphans: 3; widows: 3;

}

h2, h3 {

page-break-after: avoid;

}

</style>

</head>

<body>

<h1>Writing Efficient R Code</h1>

Chris Paciorek, Department of Statistics, UC Berkeley

<h1>0) This Tutorial</h1>

This tutorial covers strategies for writing efficient R code by taking advantage of the underlying structure of how R works. In addition it covers tools and strategies for timing and profiling R code.

While some of the strategies covered here are specific to R, many are built on principles that can guide your coding in other languages.

You should be able to work through this tutorial in any working R installation, including through RStudio. To work through it using R linked to a fast linear algebra package, you may want to use a virtual machine developed here at Berkeley, the <a href="http://bce.berkeley.edu">Berkeley Common Environment (BCE)</a>. BCE is a virtual Linux machine - basically it is a Linux computer that you can run within your own computer, regardless of whether you are using Windows, Mac, or Linux. This provides a common environment so that things behave the same for all of us. However, note that BCE has not been updated in a while.

This tutorial assumes you have a working knowledge of R.

Materials for this tutorial, including the R markdown file and associated code files that were used to create this document are available on Github at (<a href="https://github.com/berkeley-scf/tutorial-efficient-R">https://github.com/berkeley-scf/tutorial-efficient-R</a>). You can download the files by doing a git clone from a terminal window on a UNIX-like machine, as follows:

<pre><code class="r">git clone https://github.com/berkeley-scf/tutorial-efficient-R

</code></pre>

To create this HTML document, simply compile the corresponding R Markdown file in R as follows (the following will work from within BCE after cloning the repository as above).

<pre><code class="r">Rscript -e "library(knitr); knit2html('efficient-R.Rmd')"

</code></pre>

This tutorial by Christopher Paciorek is licensed under a Creative Commons Attribution 3.0 Unported License.

<h1>1) Background</h1>

In part because R is an interpreted language and in part because R

is very dynamic (objects can be modified essentially arbitrarily after

being created), R can be slow. Hadley Wickham's Advanced R book has

a section called Performance that discusses this in detail. However, there

are a variety of ways that one can write efficient R code.

In general, try to make use of R's built-in functions (including matrix

operations and linear algebra), as these tend to be implemented

internally (i.e., via compiled code in C or Fortran). Sometimes you

can figure out a trick to take your problem and transform it to make

use of the built-in functions.

Before you spend a lot of time trying to make your code go faster,

it's best to first write transparent, easy-to-read code to help avoid

bugs. Then if it doesn't run fast enough, time the different parts of

the code (profiling) to assess where the bottlenecks are. Concentrate

your efforts on those parts of the code. Try out different

specifications, checking that the results are the same as your

original code. And as you gain more experience, you'll get some

intuition for what approaches might improve speed, but even with

experience I find myself often surprised by what matters and what

doesn't.

Section 2 of this document discusses the use of fast linear algebra libraries, Section 3 discusses tools for timing and profiling code, and Section 4 discusses core strategies for writing efficient R code.

<h1>2) Fast linear algebra</h1>

One way to speed up a variety of operations in R (sometimes by as much as an order of magnitude) is to make sure your installation of R uses an optimized BLAS (Basic Linear Algebra Subroutines). The BLAS underlies all linear algebra, including costly calculations such as matrix-matrix multiplication and matrix decompositions such as the SVD and Cholesky decomposition. Some optimized BLAS packages are:

<ul>

<li>Intel's MKL</li>

<li>OpenBLAS</li>

<li>vecLib for Macs</li>

</ul>

To use an optimized BLAS, talk to your systems adminstrator, see <a href="https://cran.r-project.org/manuals.html">Section A.3 of the R Installation and Administration Manual</a>, or see <a href="http://statistics.berkeley.edu/computing/blas">these instructions to use vecLib BLAS on your own Mac</a>.

Any calls to BLAS or to the LAPACK libraries that use BLAS to do higher-level linear algebra calculations will be nearly as fast as if you used C/C++ or Matlab, because R is using the compiled code from the BLAS and LAPACK libraries.

In addition, the BLAS libraries above are threaded – they can use more than one core, and often will do so by default. More details in the tutorial on parallel programming.

<h1>3) Tools for assessing efficiency</h1>

<h2>3.1) Benchmarking</h2>

system.time is very handy for comparing the speed of different

implementations. Here's a basic comparison of the time to calculate the row means of a matrix using a for loop compared to the built-in rowMeans function.

<pre><code class="r">n <- 10000

m <- 1000

x <- matrix(rnorm(n*m), nrow = n)

system.time({

mns <- rep(NA, n)

for(i in 1:n) mns[i] <- mean(x[i , ])

})

</code></pre>

<pre><code>## user system elapsed

## 0.232 0.016 0.246

</code></pre>

<pre><code class="r">system.time(rowMeans(x))

</code></pre>

<pre><code>## user system elapsed

## 0.024 0.000 0.024

</code></pre>

In general, user time gives the CPU time spent by R and system time gives the CPU time spent by the kernel (the operating system) on behalf of R. Operations that fall under system time include opening files, doing input or output, starting other processes, etc.

To time code that runs very quickly, you may want to use the microbenchmark

package. Of course one would generally only care about such timing if a larger operation does the quick calculation very many times. Here's a comparison of different ways of accessing an element of a dataframe.

<pre><code class="r">library(microbenchmark)

df <- data.frame(vals = 1:3, labs = c('a','b','c'))

microbenchmark(

df[2,1],

df$vals[2],

df[2, 'vals']

)

</code></pre>

<pre><code>## Unit: microseconds

## expr min lq mean median uq max neval cld

## df[2, 1] 12.532 13.2260 13.79943 13.6360 13.946 18.980 100 b

## df$vals[2] 6.731 7.9160 8.71348 8.3565 8.641 52.525 100 a

## df[2, "vals"] 12.670 13.5355 14.36814 13.7850 14.136 44.447 100 b

</code></pre>

The rbenchmark package provides a nice wrapper function, benchmark,

that automates timings and comparisons.

<pre><code class="r">library(rbenchmark)

# speed of one calculation

n <- 1000

x <- matrix(rnorm(n^2), n)

benchmark(crossprod(x), replications = 10,

columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test elapsed replications

## 1 crossprod(x) 0.481 10

</code></pre>

<pre><code class="r"># comparing different approaches to a task

benchmark(

{mns <- rep(NA, n); for(i in 1:n) mns[i] <- mean(x[i , ])},

rowMeans(x),

replications = 10,

columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test

## 1 {\n mns <- rep(NA, n)\n for (i in 1:n) mns[i] <- mean(x[i, ])\n}

## 2 rowMeans(x)

## elapsed replications

## 1 0.196 10

## 2 0.024 10

</code></pre>

In general, it's a good idea to repeat (replicate) your timing, as there is some stochasticity in how fast your computer will run a piece of code at any given moment.

You might also checkout the tictoc package.

<h2>3.2) Profiling</h2>

The Rprof function will show you how much time is spent in

different functions, which can help you pinpoint bottlenecks in your

code. The output from Rprof can be hard to decipher, so you

may want to use the proftools package functions, which make use of

Rprof under the hood.

Here's a function that works with a correlation matrix such as one

might have for time series data. Basically, it creates a matrix of time lags (dd) and

computes the correlation between the outcome for all pairs of times, based on the time

lag between the pair of times. Then it computes the Cholesky factor of the correlation

matrix so that it can generate a random time series in the last line. The question

one might ask is which part(s) of the code take the most time.

<pre><code class="r">makeTS <- function(param, len){

times <- seq(0, 1, length = len)

dd <- rdist(times)

C <- exp(-dd/param)

U <- chol(C)

white <- rnorm(len)

return(crossprod(U, white))

}

## old approach:

library(fields)

if(FALSE) { # not running this, just for illustration

Rprof("makeTS.prof", interval = 0.005, line.profiling = TRUE)

out <- makeTS(0.1, 3000)

Rprof(NULL)

summaryRprof("makeTS.prof")

}

## using proftools instead:

library(proftools)

pd <- profileExpr(makeTS(0.1, 3000))

hotPaths(pd)

</code></pre>

Here's the result for the makeTS function:

<pre><code> path total.pct self.pct

makeTS 100.00 0.00

. ?? (#File 1: :4) 56.06 56.06

. chol (#File 1: :5) 30.30 0.00

. . standardGeneric 30.30 0.00

. . . chol 30.30 0.00

. . . . chol.default 30.30 30.30

. rdist (#File 1: :3) 12.12 0.00

. . .Call 12.12 12.12

. crossprod (#File 1: :7) 1.52 0.00

. . crossprod 1.52 0.00

. . . base::crossprod 1.52 1.52

</code></pre>

Note the nestedness of the results. For example, 12 percent of the time was spent in the call to rdist, of which essentially all of that was spent in .Call, which is a call out to C code.

In this case, the results are not fully helpful, as 56 percent of the time is spent in other computations within makeTS that are not shown individually (see the “??” line).

As we increase the number of time points,

the time taken up by the Cholesky would increase since that calculation

is order of $n^{3}$ while the others are order $n^{2}$ (more in the Linear Algebra unit).

In this case, since the Cholesky and the main calculations in rdist, as well as exp,

are all done in compiled C or Fortran code, there is probably not much we can do

to speed this up (apart from using an optimized BLAS, which is essential). But in other cases profiling may reveal the slow steps in a piece of code.

Note that Rprof works by sampling - every little while (the interval argument) during a calculation it finds out what function R is in and saves that information to the file given as the argument to Rprof. So if you try to profile code that finishes really quickly, there's not enough opportunity for the sampling to represent the calculation accurately and you may get spurious results.

You might also check out profvis for an alternative to displaying profiling information

generated by Rprof.

Warning: Rprof conflicts with threaded linear algebra,

so you may need to set OMP_NUM_THREADS to 1 to disable threaded

linear algebra if you profile code that involves linear algebra.

<h1>4) Strategies for improving efficiency</h1>

<h2>4.1) Pre-allocate memory</h2>

It is very inefficient to iteratively add elements to a vector, matrix,

data frame, array or list (e.g., using c, cbind,

rbind, etc. to add elements one at a time). Instead, create the full object in advance

(this is equivalent to variable initialization in compiled languages)

and then fill in the appropriate elements. The reason is that when

R appends to an existing object, it creates a new copy and as the

object gets big, most of the computation involves the repeated

memory allocation to create the new objects. Here's

an illustrative example, but of course we would not fill a vector

like this using loops because we would in practice use vectorized calculations.

<pre><code class="r">n <- 10000

z <- rnorm(n)

fun1 <- function(vals) {

x <- exp(vals[1])

for(i in 2:n) x <- c(x, exp(vals[i]))

return(x)

}

fun2 <- function(vals) {

n <- length(vals)

x <- rep(as.numeric(NA), n)

for(i in 1:n) x[i] <- exp(vals[i])

return(x)

}

fun3 <- function(vals) {

x <- exp(vals)

return(x)

}

benchmark(fun1(z), fun2(z), fun3(z),

replications = 20, columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test elapsed replications

## 1 fun1(z) 2.347 20

## 2 fun2(z) 0.021 20

## 3 fun3(z) 0.006 20

</code></pre>

It's not necessary to use as.numeric above though it saves

a bit of time. Challenge: figure out why I have <code>as.numeric(NA)</code>

and not just <code>NA</code>.

In some cases, we can speed up the initialization by initializing a vector of length one and then changing its length and/or dimension, although in many practical

circumstances this would be overkill.

For example, for matrices, start with a vector of length one, change the length, and then change the

dimensions

<pre><code class="r">nr <- nc <- 2000

benchmark(

x <- matrix(as.numeric(NA), nr, nc),

{x <- as.numeric(NA); length(x) <- nr * nc; dim(x) <- c(nr, nc)},

replications = 10, columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test

## 2 {\n x <- as.numeric(NA)\n length(x) <- nr * nc\n dim(x) <- c(nr, nc)\n}

## 1 x <- matrix(as.numeric(NA), nr, nc)

## elapsed replications

## 2 0.130 10

## 1 0.275 10

</code></pre>

For lists, we can do this

<pre><code class="r">myList <- vector("list", length = n)

</code></pre>

<h2>4.2) Vectorized calculations</h2>

One key way to write efficient R code is to take advantage of R's

vectorized operations.

<pre><code class="r">n <- 1e6

x <- rnorm(n)

benchmark(

x2 <- x^2,

{ x2 <- as.numeric(NA)

length(x2) <- n

for(i in 1:n) { x2[i] <- x[i]^2 } },

replications = 10, columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test

## 2 {\n x2 <- as.numeric(NA)\n length(x2) <- n\n for (i in 1:n) {\n x2[i] <- x[i]^2\n }\n}

## 1 x2 <- x^2

## elapsed replications

## 2 11.231 10

## 1 0.017 10

</code></pre>

So what is different in how R handles the calculations above that

explains the huge disparity in efficiency? The vectorized calculation is being done natively

in C in a for loop. The explicit R for loop involves executing the for

loop in R with repeated calls to C code at each iteration. This involves a lot

of overhead because of the repeated processing of the R code inside the loop. For example,

in each iteration of the loop, R is checking the types of the variables because it's possible

that the types might change, such as in this loop:

<pre><code>x <- 3

for( i in 1:n ) {

if(i == 7) {

x <- 'foo'

}

y <- x^2

}

</code></pre>

You can

usually get a sense for how quickly an R call will pass things along

to C or Fortran by looking at the body of the relevant function(s) being called

and looking for .Primitive, .Internal, .C, .Call,

or .Fortran. Let's take a look at the code for <code>+</code>,

mean.default, and chol.default.

<pre><code class="r">`+`

</code></pre>

<pre><code>## function (e1, e2) .Primitive("+")

</code></pre>

<pre><code class="r">mean.default

</code></pre>

<pre><code>## function (x, trim = 0, na.rm = FALSE, ...)

## {

## if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {

## warning("argument is not numeric or logical: returning NA")

## return(NA_real_)

## }

## if (na.rm)

## x <- x[!is.na(x)]

## if (!is.numeric(trim) || length(trim) != 1L)

## stop("'trim' must be numeric of length one")

## n <- length(x)

## if (trim > 0 && n) {

## if (is.complex(x))

## stop("trimmed means are not defined for complex data")

## if (anyNA(x))

## return(NA_real_)

## if (trim >= 0.5)

## return(stats::median(x, na.rm = FALSE))

## lo <- floor(n * trim) + 1

## hi <- n + 1 - lo

## x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]

## }

## .Internal(mean(x))

## }

## <bytecode: 0x2cb7598>

## <environment: namespace:base>

</code></pre>

<pre><code class="r">chol.default

</code></pre>

<pre><code>## function (x, pivot = FALSE, LINPACK = FALSE, tol = -1, ...)

## {

## if (is.complex(x))

## stop("complex matrices not permitted at present")

## .Internal(La_chol(as.matrix(x), pivot, tol))

## }

## <bytecode: 0x6a3f188>

## <environment: namespace:base>

</code></pre>

Many R functions allow you to pass in vectors, and operate on those

vectors in vectorized fashion. So before writing a for loop, look

at the help information on the relevant function(s) to see if they

operate in a vectorized fashion. Functions might take vectors for one or more of their arguments.

<pre><code class="r">address <- c("Four score and seven years ago our fathers brought forth",

" on this continent, a new nation, conceived in Liberty, ",

"and dedicated to the proposition that all men are created equal.")

nchar(address)

</code></pre>

<pre><code>## [1] 56 56 64

</code></pre>

<pre><code class="r"># use a vector in the 2nd and 3rd arguments, but not the first

startIndices = seq(1, by = 3, length = nchar(address[1])/3)

startIndices

</code></pre>

<pre><code>## [1] 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55

</code></pre>

<pre><code class="r">substring(address[1], startIndices, startIndices + 1)

</code></pre>

<pre><code>## [1] "Fo" "r " "co" "e " "nd" "se" "en" "ye" "rs" "ag" " o" "r " "at" "er"

## [15] " b" "ou" "ht" "fo" "th"

</code></pre>

Challenge: Consider the chi-squared statistic involved in

a test of independence in a contingency table:

\chi^{2}=\sum_{i}\sum_{j}\frac{(y_{ij}-e_{ij})^{2}}{e_{ij}},\,\,\,\, e_{ij}=\frac{y_{i\cdot}y_{\cdot j}}{y_{\cdot\cdot}}

where $y_{i\cdot}=\sum_{j}y_{ij}$ and $y_{\cdot j} = \sum_{i} y_{ij}$. Write this in a vectorized way

without any loops. Note that 'vectorized' calculations also work

with matrices and arrays.

Vectorized operations can sometimes be faster than built-in functions

(note here the ifelse is notoriously slow),

and clever vectorized calculations even better, though sometimes the

code is uglier. Here's an example of setting all negative values in a

vector to zero.

<pre><code class="r">x <- rnorm(1000000)

benchmark(

truncx <- ifelse(x > 0, x, 0),

{truncx <- x; truncx[x < 0] <- 0},

truncx <- x * (x > 0),

replications = 10, columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test elapsed replications

## 1 truncx <- ifelse(x > 0, x, 0) 1.571 10

## 2 {\n truncx <- x\n truncx[x < 0] <- 0\n} 0.164 10

## 3 truncx <- x * (x > 0) 0.083 10

</code></pre>

Additional tips:

<ul>

<li>If you do need to loop over dimensions of a matrix or array, if possible

loop over the smallest dimension and use the vectorized calculation

on the larger dimension(s). For example if you have a 10000 by 10 matrix, try to set

up your problem so you can loop over the 10 columns rather than the 10000 rows.</li>

<li>In general, looping over columns is likely to be faster than looping over rows

given R's column-major ordering (matrices are stored in memory as a long array in which values in a column are adjacent to each other) (see more in Section 4.6 on the cache).</li>

<li>You can use direct arithmetic operations to add/subtract/multiply/divide

a vector by each column of a matrix, e.g. <code>A*b</code> does element-wise multiplication of

each column of A by a vector b. If you need to operate

by row, you can do it by transposing the matrix. </li>

</ul>

Caution: relying on R's recycling rule in the context of vectorized

operations, such as is done when direct-multiplying a matrix by a

vector to scale the rows relative to each other, can be dangerous as the code is not transparent

and poses greater dangers of bugs. In some cases you may want to

first write the code transparently and

then compare the more efficient code to make sure the results are the same. It's also a good idea to comment your code in such cases.

<h2>4.3) Using apply and specialized functions</h2>

Another core efficiency strategy is to use the apply functionality.

Even better than apply for calculating sums or means of columns

or rows (it also can be used for arrays) is {row,col}{Sums,Means}.

<pre><code class="r">n <- 3000; x <- matrix(rnorm(n * n), nr = n)

benchmark(

out <- apply(x, 1, mean),

out <- rowMeans(x),

replications = 10, columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test elapsed replications

## 1 out <- apply(x, 1, mean) 2.615 10

## 2 out <- rowMeans(x) 0.220 10

</code></pre>

We can 'sweep' out a summary statistic, such as subtracting

off a mean from each column, using sweep

<pre><code class="r">system.time(out <- sweep(x, 2, STATS = colMeans(x), FUN = "-"))

</code></pre>

<pre><code>## user system elapsed

## 0.124 0.040 0.162

</code></pre>

Here's a trick for doing the sweep based on vectorized calculations, remembering

that if we subtract a vector from a matrix, it subtracts each element

of the vector from all the elements in the corresponding ROW. Hence the

need to transpose twice.

<pre><code class="r">system.time(out2 <- t(t(x) - colMeans(x)))

</code></pre>

<pre><code>## user system elapsed

## 0.276 0.048 0.324

</code></pre>

<pre><code class="r">identical(out, out2)

</code></pre>

<pre><code>## [1] TRUE

</code></pre>

<h3>Are apply, lapply, sapply, etc. faster than loops?</h3>

Using apply with matrices and versions of apply with lists may or may not be faster

than looping but generally produces cleaner code. Whether looping

is slower will depend on whether a substantial part of the work is

in the overhead involved in the looping or in the time required by the function

evaluation on each of the elements. If you're worried about speed,

it's a good idea to benchmark the apply variant against looping.

Here's an example where apply is not faster than a loop. Similar

examples can be constructed where lapply or sapply are not faster

than writing a loop.

<pre><code class="r">n <- 500000; nr <- 10000; nCalcs <- n/nr

mat <- matrix(rnorm(n), nrow = nr)

times <- 1:nr

system.time(

out1 <- apply(mat, 2, function(vec) {

mod = lm(vec ~ times)

return(mod$coef[2])

}))

</code></pre>

<pre><code>## user system elapsed

## 0.288 0.016 0.302

</code></pre>

<pre><code class="r">system.time({

out2 <- rep(NA, nCalcs)

for(i in 1:nCalcs){

out2[i] = lm(mat[ , i] ~ times)$coef[2]

}

})

</code></pre>

<pre><code>## user system elapsed

## 0.300 0.016 0.312

</code></pre>

And here's an example where sapply is much faster, because the core function evaluation at each iteration is very fast:

<pre><code class="r">z <- rnorm(10000)

fun2 <- function(vals) {

x <- as.numeric(NA)

length(x) <- length(vals)

for(i in 1:n) x[i] <- exp(vals[i])

return(x)

}

fun4 <- function(vals) {

x <- sapply(vals, exp)

return(x)

}

benchmark(fun2(z), fun4(z),

replications = 10, columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test elapsed replications

## 1 fun2(z) 2.302 10

## 2 fun4(z) 0.032 10

</code></pre>

You'll notice if you look at the R code for lapply (sapply just calls lapply) that it calls directly out to C code, so the for loop is executed in compiled code.

<pre><code class="r">print(lapply)

</code></pre>

<pre><code>## function (X, FUN, ...)

## {

## FUN <- match.fun(FUN)

## if (!is.vector(X) || is.object(X))

## X <- as.list(X)

## .Internal(lapply(X, FUN))

## }

## <bytecode: 0xa2fd48>

## <environment: namespace:base>

</code></pre>

<h2>4.4) Matrix algebra efficiency</h2>

Often calculations that are not explicitly linear algebra calculations

can be done as matrix algebra. For example, we can sum the rows of a matrix by multiplying by a vector of ones. Given the extra computation involved in actually multiplying each number by one, it's surprising that this is faster than using R's heavily optimized rowSums function. It might be at least partially related to cache effects as colSums is more than twice as fast as rowSums in this case.

<pre><code class="r">mat <- matrix(rnorm(500*500), 500)

benchmark(apply(mat, 1, sum),

mat %*% rep(1, ncol(mat)),

rowSums(mat),

replications = 10, columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test elapsed replications

## 1 apply(mat, 1, sum) 0.038 10

## 2 mat %*% rep(1, ncol(mat)) 0.002 10

## 3 rowSums(mat) 0.006 10

</code></pre>

On the other hand, big matrix operations can be slow. Challenge: Suppose you

want a new matrix that computes the differences between successive

columns of a matrix of arbitrary size. How would you do this as matrix

algebra operations? It's possible to write it as multiplying the matrix

by another matrix that contains 0s, 1s, and -1s in appropriate places.

Here it turns out that the

for loop is much faster than matrix multiplication. However,

there is a way to do it faster as matrix direct subtraction.

When doing matrix algebra, the order in which you do operations can

be critical for efficiency. How should I order the following calculation?

<pre><code class="r">n <- 5000

A <- matrix(rnorm(5000 * 5000), 5000)

B <- matrix(rnorm(5000 * 5000), 5000)

x <- rnorm(5000)

system.time(

res1 <- A %*% B %*% x

)

</code></pre>

<pre><code>## user system elapsed

## 22.203 5.396 3.966

</code></pre>

<pre><code class="r">system.time(

res2 <- A %*% (B %*% x)

)

</code></pre>

<pre><code>## user system elapsed

## 0.209 0.000 0.209

</code></pre>

Why is the second order much faster?

We can use the matrix direct product (i.e., <code>A*B</code>) to do

some manipulations much more quickly than using matrix multiplication.

Challenge: How can I use the direct product to find the trace

of a matrix, $XY$?

Finally, when working with diagonal matrices, you can generally get much faster results by being smart. The following operations: $X+D$, $DX$, $XD$

are mathematically the sum of two matrices and products of two matrices.

But we can do the computation without using two full matrices.

Challenge: How?

<pre><code class="r">n <- 1000

X <- matrix(rnorm(n^2), n)

diagvals <- rnorm(n)

D = diag(diagvals)

# the following lines are very inefficient

summedMat <- X + D

prodMat1 <- D %*% X

prodMat2 <- X %*% D

# How can we do each of those operations much more quickly?

</code></pre>

More generally, sparse matrices and structured matrices (such as block

diagonal matrices) can generally be worked with MUCH more efficiently

than treating them as arbitrary matrices. The R packages spam (for arbitrary

sparse matrices), bdsmatrix (for block-diagonal matrices),

and Matrix (for a variety of sparse matrix types) can help, as can specialized code available in other languages,

such as C and Fortran packages.

<h2>4.5) Fast mapping/lookup tables</h2>

Sometimes you need to map between two vectors. E.g.,

$y_{ij}\sim\mathcal{N}(\mu_{j},\sigma^{2})$

is a basic ANOVA type structure, where multiple observations in group $j$

are associated with a common mean, $\mu_j$.

How can we quickly look up the mean associated with each observation?

A good strategy is to create a vector, grp, that gives a numeric

mapping of the observations to their cluster. Then you can access

the $\mu$ value relevant for each observation as: <code>mus[grp]</code>. This requires

that grp correctly map to the right elements of mus.

The match function can help in creating numeric indices that can then be used for lookups.

Here's how you would create an index vector, grp, if it doesn't already exist.

<pre><code class="r">df <- data.frame(

id = 1:5,

clusterLabel = c('C', 'B', 'B', 'A', 'C'))

info <- data.frame(

grade = c('A', 'B', 'C'),

numGrade = c(95, 85, 75),

fail = c(FALSE, FALSE, TRUE) )

grp <- match(df$clusterLabel, info$grade)

df$numGrade = info$numGrade[grp]

</code></pre>

<pre><code>## id clusterLabel numGrade

## 1 1 C 75

## 2 2 B 85

## 3 3 B 85

## 4 4 A 95

## 5 5 C 75

</code></pre>

R allows you to look up elements of vector by name.

For example:

<pre><code class="r">vals <- rnorm(10)

names(vals) <- letters[1:10]

select <- c("h", "h", "a", "c")

vals[select]

</code></pre>

<pre><code>## h h a c

## 0.2511293 0.2511293 -0.4466439 0.2898673

</code></pre>

You can do similar things in terms of looking up by name with dimension

names of matrices/arrays, row and column names of dataframes, and

named lists.

However, looking things up by name can be slow relative to looking up by index.

Here's a toy example where we have a vector or list with a million elements and

the character names of the elements are just the character versions of the

indices of the elements.

<pre><code class="r">n <- 1000000

x <- 1:n

xL <- as.list(x)

nms <- as.character(x)

names(x) <- nms

names(xL) <- nms

benchmark(

x[500000], # index lookup in vector

x["500000"], # name lookup in vector

xL[[500000]], # index lookup in list

xL[["500000"]], # name lookup in list

replications = 10, columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test elapsed replications

## 2 x["500000"] 0.062 10

## 1 x[5e+05] 0.000 10

## 4 xL[["500000"]] 0.058 10

## 3 xL[[5e+05]] 0.000 10

</code></pre>

Lookup by name is slow because R needs to scan through the objects

one by one until it finds the one with the name it is looking for.

In contrast, to look up by index, R can just go directly to the position of interest.

In contrast, we can look up by name in an environment very quickly, because environments in R use hashing, which allows for fast lookup that does not require scanning through all of the names in the environment. In fact, this is how R itself looks for values when you specify variables in R code.

<pre><code class="r">xEnv <- as.environment(xL) # convert from a named list

xEnv$"500000"

</code></pre>

<pre><code>## [1] 500000

</code></pre>

<pre><code class="r"># I need quotes above because numeric; otherwise xEnv$nameOfObject is fine

xEnv[["500000"]]

</code></pre>

<pre><code>## [1] 500000

</code></pre>

<pre><code class="r">benchmark(

x[500000],

xL[[500000]],

xEnv[["500000"]],

xEnv$"500000",

replications = 10000, columns=c('test', 'elapsed', 'replications'))

</code></pre>

<pre><code>## test elapsed replications

## 1 x[5e+05] 0.026 10000

View remainder of file in raw view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

efficient-R.html

Latest commit

History

efficient-R.html

File metadata and controls