Thursday, May 11, 2017

Conditional probability, Bayes Theorem (and how it’s derived)

Given two events A and B, with their respective probabilities P(A) and P(B), the conditional probability of A given B (the probability of event A with the knowledge that event B occurred) is defined as:

The intuition behind is: We already know that event B occurred, therefore we cannot consider anymore all possible outcomes - our outcome space reduces to just B. In the actual reduced outcome space we must consider the event B to be sure thing (i.e. P*(B)=100%), so we have to scale down all theoretical probabilities by dividing by P(B). Therefore also probability of intersection A∩B is divided by P(B).

Two events A and B are independent (by definition) if:
It follows that if two events are independent, then probability of intersection A∩B is equal to product of the two probabilities:
Note that independence is symmetric:
If two events A and B are independent (and P(A),P(B)>0), they can still occur in one trial simultaneously. If it is the case that the two events cannot happen in one trial simultaneously, they are said to be mutually exclusive (disjoint), then:
The Bayes theorem puts into relation conditional probabilities P(A|B) and P(B|A):
Bayes theorem can be derived from the condition probabilities:
It is also useful to rewrite the theorem with the complement rule:


Typical example of Bayes theorem application is the cancer screening case. Event A is defined as cancer disease and event B is defined as positive result of the screening test. Suppose that prevalence of cancer in the population is P(A)=1%. Suppose that there is knowledge that for a person with cancer, the screening test is positive at P(B|A)=80% and for a healthy person it tends to be positive at P(B|Ac)=9.6%. The question is: what is the chance of having cancer if the test was positive?

Tuesday, May 2, 2017

Large data in R (the ff package)

Since the R runtime environment is storing all data structures in RAM memory and it also performs all operations in there, it might consume whole available free memory and stop execution with a memory related error (e.g.: Error: cannot allocate vector of length…). This is sometimes the case when large data needs to be processed.
 
To see the current consumption of memory by R:
 
i=1:1e9 # try to create an one billion long integer vector (integer is stored in 4 bytes)
format(object.size(x=i),units="Mb") # (approximate) size of the object (in Mb)
memory.size(max=FALSE) # memory amount currently in use by R (Mb)
memory.size(max=TRUE) # maximum memory amount used during R session (Mb)
 
The ff package enables working with large data. In the core stands the ff class which represents large data (vector or array/matrix…). It is implemented in such way, that raw data are stored as a flat file on disk and the R object contains merely metadata (dimensions, virtual storage modes etc.) When operating on the ff class object, data are exchanged between disk and memory via “memory pages”, size of which depends on the setup (pagesize option).
 
To see and change the size of memory page:
 
library("ff")
getOption("ffpagesize") # see page size (in b)
options(ffpagesize=2^16) # set page size (in b)
 
The ff class object can be of one of standard atomic types (logical, integer, double, raw), but also some nonstandard types are supported via the virtual storage mode functionality (vmode).
 
To see currently supported virtual modes:
 
data.frame(MODE=names(.vmode),IMPLEMENTED=.vimplemented[names(.vmode)],STORAGE=.rammode[names(.vmode)],MIN=.vmin[names(.vmode)],MAX=.vmax[names(.vmode)],NAs=ifelse(is.na(.vNA[names(.vmode)]),"Yes","No"),RAMbytes=.rambytes[names(.vmode)],FFbytes=.ffbytes[names(.vmode)])
 
In addition, also following compound data types are supported:
  • factor
  • ordered
  • custom compounds
  • Date
  • POSIXct


To create a one billion long ff vector which contains sequence of numbers 1:1e9, one can try:


f <- ff(initdata=1:1e9,vmode="integer") # try creating ff vector


However the above statement execution will probably fail due to memory constraints, because the initdata sequence 1,2,...1e9 is standard R operation on integers and it may not fit into memory at once. One has to create an empty one billion long ff vector of integers first and then populate it in a batch approach. The size of the batch can be set either with parameter BATCHSIZE (maximum number of elements to be processed at once) or with parameter BATCHBYTES (maximum number of bytes to be processed at once):


f <- ff(length=1e9,vmode="integer",caching="mmeachflush") # define an empty one billion long ff vector of integers, the „each flush“ caching secures that memory pages are flushed at each step
ffvecapply(f[i1:i2]<-i1:i2, X=f, ,BATCHSIZE = 1e5,VERBOSE=T) # the i1 and i2 are standard variables supplied by the ffvecapply function and mark the start and end of the records processed in each batch
class(f) # "ff_vector" "ff"
class(f[1:10]) # but if using [] result is standard R object (i.e. vector of integers)


To define a ff matrix(array) and populate it by columns or by rows:


m <- ff(length=3e8,vmode="quad",caching="mmeachflush",levels=c("A","B","C")) # define a long ff vector of quads (up to 4 values) and define levels
class(m) # "ff_vector" "ff"
dim(m)<-c(1e8,3) # define matrix 1e8x3
class(m) # "ff_matrix" "ff_array" ff"
class(m[1:10,]) # but if using [] result is standard R object (i.e. matrix of factors)
ffrowapply(m[i1:i2,,bydim=c(2,1)]<-c("A","B","C"), X=m, ,BATCHSIZE = 3e5,VERBOSE=T) # the i1 and i2 are standard variables supplied by the ffrowapply function and mark the first and last row of the batch, use bydim to set row-based ordering


To see the attributes of a ff class object:


physical(m) # shows the physical attributes of the R object
virtual(m) # shows the virtual attributes of the R object


To coerce the ff object to a ram object (if enough memory):


ramclass(m)
as.ram(m)


And to coerce ram object to an ff object:


as.ff(i)


To iterate through the ff object elements to perform some operation, the range index approach can be exploited, as the chunk.bit() function returns range indexes for each chunk. Either the for loop can be used or lapply functionality as well:


chunk.bit(f,BATCHBYTES = 1024^2) # range indexes by 1MB chunks


for (i in chunk.bit(f,BATCHBYTES = 10*1024^2)) f[i] <- rnorm(n=sum(as.bit(i))) # generate random numbers, by chunks


l=lapply(chunk(f,BATCHBYTES = 1024^2),function(i) max(f[i])) # get maximum for each chunk
print(crbind(l)) # bind results of all chunks by rows together


It is possible (and even wished) that multiple ff objects use same flat file.
To find out which flat file is attached by which R objects:


b1 <- as.bit(sample(c(FALSE,TRUE), 1000, TRUE))
class(b1) # bit
f1 <- as.ff(b1)
class(f1) # "ff_vector" "ff"
b2 <- as.bit(f1)
f2 <- as.ff(b2)
filename(f1)
filename(f2) # yet another object attached to the same flat file
f1[1]=T # set first object of first ff to TRUE
f2[1]=F # set first object of second ff to FALSE
f1[1]==f2[1] # since both use same file, the first object is equal (FALSE)


If there is need to physically copy whole ff object, use instead:


f2=clone(f1)
f1[1]=T
f2[1]=F
f1[1]==f2[1] # returns FALSE since they use different files


Similar to data.frame is the ffdf class, which is constructed as container of ff vectors:


m <- matrix(1:12, 3, 4, dimnames=list(c("r1","r2","r3"), c("m1","m2","m3","m4")))
v <- 1:3
ffm <- as.ff(m)
ffv <- as.ff(v)
ffd <- ffdf(ffm, v=ffv, row.names=row.names(ffm))
filename(ffm)
filename(ffv)
filename(ffd)
ffd
physical(ffd)


The storage of vectors can be controlled. Either split them into more or join them into fewer.


ffd <- ffdf(ffm, v=ffv, row.names=row.names(ffm), ff_join=list(newff=c(1,2)))
ffd <- ffdf(ffm, v=ffv, row.names=row.names(ffm), ff_split=1)






To sort the dataframe:


d2 <- ffdfsort(d) # sort directly


i <- ffdforder(d2) # sort through order index
d[i,]






To save ff objects:


ffsave(df, add=TRUE, file="d:/tmp/y", rootpath="d:/tmp")