---
title: "Multi-Objective Optimization case study"
author: "Max Conway"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Multi-Objective Optimization case study}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Multi Objective Optimization tutorial

This tutorial is an in depth example of the use of this package in the context of an evolutionary optimization approach. It also shows the use of `gene_eval` to calculate a reaction speed from gene expressions.

First, we need to load the appropriate libraries. You may also need to load your optimization library of choice.

```{r, results="hide"}
library(dplyr)
library(purrr)
library(stringr)
library(fbar)
```

This code block loads a model, then extracts the list of genes from the model.
The model takes the form of a tabular list of reactions.

```{r}
data("ecoli_core")
model <- ecoli_core %>%
  mutate(geneAssociation = geneAssociation %>%
           str_replace_all('and','&') %>%
           str_replace_all('or','|'))

genes_in_model <- model$geneAssociation %>%
  str_split('[()|& ]+') %>%
  flatten_chr() %>%
  discard(is.na) %>%
  discard(~ str_length(.x)==0)
```

The evaluation function is where the actual metabolic simulations are performed.
This has four main stages:

- The gene-reaction associations (`geneAssociation`) are evaluated in the context of which genes are present in this iteration (`genome`).
- We conduct a round of FBA, optimizing for maximum biomass.
- Having found the maximum biomass production, we fix the biomass at this value (+/-1%).
- With the biomass value fixed, we the optimize to maximize the synthetic objective.

The technique of fixing the biomass followed by maximizing the synthetic objective is important because there could still be slack in the model after the first optimization stage, and we wish to have a reliable synthetic objective estimate.

```{r}
evaluation_function <- function(genome){
  
  res <- model %>%
    mutate(activation = gene_eval(geneAssociation, names(genome), genome),
           activation = coalesce(activation, 1),
           uppbnd = uppbnd*activation,
           lowbnd = lowbnd*activation) %>%
    find_fluxes_df(do_minimization = FALSE) %>%
    mutate(lowbnd = ifelse(abbreviation=='BIOMASS_Ecoli_core_w_GAM', flux*0.99, lowbnd),
           uppbnd = ifelse(abbreviation=='BIOMASS_Ecoli_core_w_GAM', flux*1.01, uppbnd),
           obj_coef = 1*(abbreviation=='EX_ac_e')) %>%
    find_fluxes_df(do_minimization = FALSE)
  
  return(list(bm = filter(res, abbreviation=='BIOMASS_Ecoli_core_w_GAM')$flux, 
              synth = filter(res, abbreviation=='EX_ac_e')$flux))
}
```

Non-domination sorting is the first stage of the selection procedure in NSGA-II.
The code might be quite opaque, but the idea is as follows:

- We perform an inner_join in order to compare every point against every other point.
- For each point (`id.x`), we see if there exists any second point (`id.y`) that has a higher value than it in all objectives. Where such a second point exists, we term the original point 'dominated'.
- We find the set of points which have no dominating point, and term this the first non-dominated front.
- We repeat this procedure, but ignoring points in the first non-dominated front, to find the second on-dominated front, and so on.

```{r}
non_dom_sort <- function(input){
  input_long <- input %>%
    gather(property, value, -id) %>%
    mutate(front=NA)
  
  currentfront <- 1
  
  while(any(is.na(input_long$front))){
    
    input_long <- input_long %>%
      inner_join(.,., by='property') %>%
      group_by(id.x,id.y) %>%
      mutate(dominance = ifelse(all(value.x>=value.y), 
                                'xdomy', 
                                ifelse(all(value.y>=value.x), 
                                       'ydomx', 
                                       'nondom'
                                       )
                                )
      ) %>%
      group_by(id.x) %>%
      mutate(front = ifelse(all(dominance[is.na(front.y)] %in% c('xdomy', 'nondom')), 
                            pmin(currentfront, front.x, na.rm=TRUE), 
                            NA
                            )
      ) %>%
      group_by(id = id.x, property = property, front, value = value.x) %>%
      summarise()
    
    currentfront <- currentfront + 1
  }
  
  return(
    input_long %>%
      spread(property, value)
  )
  
}
```

The second part of the NSGA-II evaluation procedure is finding the crowding distance. This is used to break ties between points in the same non-dominated front.
In for each front, for each dimension, this function sorts the points into order along the dimension, and finds the normalized distance between the proceeding point and succeeding point. These values are summed up across each dimension to find the value for the point.

```{r}
crowding_distance <- function(input){
  return(
    input %>%
      gather(property, value, -id, -front) %>%
      group_by(front, property) %>%
      arrange(value) %>%
      mutate(crowding = (lead(value)-lag(value))/(max(value)-min(value)),
             crowding = ifelse(is.na(crowding),Inf, crowding)) %>%
      group_by(id) %>%
      mutate(crowding = sum(crowding)) %>%
      spread(property, value)
  )
}
```


This is the genetic loop of the algorithm. It is explained by code comments, but follows a normal pattern of evaluating, sorting, selecting from and mutating the population.


```{r, eval=FALSE}
start_genome <- set_names(rep_along(genes_in_model, TRUE), genes_in_model)
pop <- list(start_genome)

popsize = 50

for(i in 1:50){
  results <- map_df(pop, evaluation_function) %>% # Evaluate all the genomes
    mutate(bm=signif(bm), synth=signif(synth)) %>% # Round results
    unique() %>% # Throw away duplicates
    mutate(id = 1:n()) %>% # label the results
    sample_frac() %>% # Shuffle
    non_dom_sort() %>% # Find the non-dominated fronts
    crowding_distance %>% # Find the crowding distances
    arrange(front, desc(crowding)) # Sort by front, breaking ties by crowding distance
  
  selected <- results %>%
    filter(row_number() <= popsize/2) %>% # Keep the best half of the population
    getElement('id')
  
  kept_pop <- pop[selected]
  altered_pop <- kept_pop %>%
    sample(popsize-length(selected), TRUE) %>% # Select a random portion of the population as parents
    map(function(genome){xor(genome, runif(length(genome))>0.98)}) # Mutate a small number of genes from the parent population.
  
  pop <- unique(c(kept_pop, altered_pop)) # Combine the ofspring and parent populations
}
```


Now that we have results, the set of all non-dominated points, known as the Pareto Front. This describes the tradeoff between biomass production and our synthetic objective.

```{r, eval=FALSE}
library(ggplot2)

results %>%
  filter(front==1) %>%
  ggplot(aes(x=bm, y=synth, colour=front)) + 
  geom_point() + 
  geom_step(direction='vh') +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw()
```