quick start

For impatient people, you can convert a probe-ensembl map like the following 1

probe2ensembl <- tibble::tribble(
  ~ID,           ~Ensembl,                                                 
  "200064_at",   "ENSG00000096384",                                        
  "200066_at",   "ENSG00000113141",                                        
  "200068_s_at", "ENSG00000127022",                                        
  "200069_at",   "ENSG00000075856",                                        
  "200071_at",   "ENSG00000119953",                                        
  "200076_s_at", "ENSG00000105700",                                        
  "200077_s_at", "ENSG00000104904",                                        
  "200078_s_at", "ENSG00000117410",                                        
  "200082_s_at", "ENSG00000171863 /// ENSG00000183405 /// ENSG00000213326",
  "200084_at",   "ENSG00000110696"

# A tibble: 10 x 2
   ID          Ensembl                                                
   <chr>       <chr>                                                  
 1 200064_at   ENSG00000096384                                        
 2 200066_at   ENSG00000113141                                        
 3 200068_s_at ENSG00000127022                                        
 4 200069_at   ENSG00000075856                                        
 5 200071_at   ENSG00000119953                                        
 6 200076_s_at ENSG00000105700                                        
 7 200077_s_at ENSG00000104904                                        
 8 200078_s_at ENSG00000117410                                        
 9 200082_s_at ENSG00000171863 /// ENSG00000183405 /// ENSG00000213326
10 200084_at   ENSG00000110696                                        

to a probe-symbol map 2

probe2_symbol <- probe2ensembl %>% melt_map("ID", "Ensembl", " /// ") %>%
    dplyr::mutate("symbol" = as_symbol_from_ensembl(Ensembl)) %>%
    cast_map("ID", "symbol", " /// ")

# A tibble: 10 x 2
   ID          symbol          
   <chr>       <chr>           
 1 200064_at   HSP90AB1        
 2 200066_at   IK              
 3 200068_s_at CANX            
 4 200069_at   SART3           
 5 200071_at   SMNDC1          
 6 200076_s_at KXD1            
 7 200077_s_at OAZ1            
 8 200078_s_at ATP6V0B         
 9 200082_s_at RPS7 /// RPS7P11
10 200084_at   C11orf58        

Other common ID like entrez gene ID, Unigene ID, RefSeq accession are also supported.


the problem

probe2id <- tibble::tribble(  
    ~probe, ~id,
    "probe1", "id1 | id2",
    "probe2", "id3"

probe2symbol <- tibble::tribble(  
    ~probe, ~symbol,
    "probe1", "symbol1 | symbol2",
    "probe2", "symbol3"

To master this package, you need to understand the problem it aims to solve. That is, to turn something like

# A tibble: 2 x 2
  probe  id       
  <chr>  <chr>    
1 probe1 id1 | id2
2 probe2 id3      


# A tibble: 2 x 2
  probe  symbol           
  <chr>  <chr>            
1 probe1 symbol1 | symbol2
2 probe2 symbol3          

solve a simpler one

probe2id_easy <- tibble::tribble(  
    ~probe, ~id,
    "probe4", "id4",
    "probe5", "id5",
    "probe6", "id6"

probe2symbol_easy <- tibble::tribble(  
    ~probe, ~symbol,
    "probe4", "symbol4",
    "probe5", "symbol5",
    "probe6", "symbol6"

To point out the key difficulty, let’s contrast it with a simpler one — to turn something like

# A tibble: 3 x 2
  probe  id   
  <chr>  <chr>
1 probe4 id4  
2 probe5 id5  
3 probe6 id6  


# A tibble: 3 x 2
  probe  symbol 
  <chr>  <chr>  
1 probe4 symbol4
2 probe5 symbol5
3 probe6 symbol6

That’s quite easy, you just need a id-symbol map,

id2symbol <- tibble::tibble(  
    id = paste0("id", 1:6),
    symbol = paste0("symbol", 1:6)
) %>% dplyr::sample_frac()

# A tibble: 6 x 2
  id    symbol 
  <chr> <chr>  
1 id3   symbol3
2 id5   symbol5
3 id1   symbol1
4 id4   symbol4
5 id2   symbol2
6 id6   symbol6

and use the following code 3

dplyr::transmute(probe2id_easy, probe, symbol = id2symbol$symbol[match(id, id2symbol$id)])
# A tibble: 3 x 2
  probe  symbol 
  <chr>  <chr>  
1 probe4 symbol4
2 probe5 symbol5
3 probe6 symbol6

In the above code, we map probe to symbol in three steps:

  1. dplyr::transmute preserves probe2id$probe - probe2id$id relationship by position
  2. match() finds probe2id$id - id2symbol$id relationship by value
  3. [] finds id2symbol$id - id2symbol$symbol relationship by position

Let’s us understand the example by a concrete example of "probe4" to "symbol4":

  1. 1st element of probe2id$probe -> 1st element of probe2id$id:
    "probe4" is the 1st element of probe2id$probe, so we look for the 1st element of probe2id$id, "id4".

  2. "id4" in probe2id$id -> "id4" in id2symbol$id:
    "id4" is the 1st element of probe2id$id, then we look for the 1st element of match() (which gives the position of probe2id$id in id2symbol$idc(NA, 1) in this case). We get 3, so we look for the 3rd element of id2symbol$id, the exact value of "id4".

  3. 3rd element of id2symbol$id -> 3rd element of id2symbol$symbol:
    finally, "id4" is the 3rd element of `id2symbol$id, thus we look for the 3rd element of id2symbol$symbol, "symbol4".

key difficulty

Back the original problem, you can find that its fairly easy to “replace” "id4" with "symbol4", "id5" with "symbol5", etc (thanks to match()). But how can you “replace” the "id1" and "id2" inside "id1 | id2"?

That is what we meet exactly, as in the 9th line of probe2ensembl.

probe2ensembl %>% dplyr::slice(9)
# A tibble: 1 x 2
  ID          Ensembl                                                
  <chr>       <chr>                                                  
1 200082_s_at ENSG00000171863 /// ENSG00000183405 /// ENSG00000213326

If you think it’s a piece of cake, you may have some misunderstanding:

  • Computer is very foolish, it can’t convert "id1 | id2" to "symbol1 | symbol2" as you can easily achieve even without thinking. In programming, the only way is to search "id1" in "id1 | id2" and replace with "symbol" if it find one, then search "id2", "id3", etc. This will cause a severe poor performance.

  • As for replacing all "id" with "symbol", I use id2symbol just to simplify the problem, the id-symbol map in the real world is usually something like:

    hgnc::ensembl2symbol %>% dplyr::sample_n(3)
    # A tibble: 3 x 2
      ensembl         symbol 
      <chr>           <chr>  
    1 ENSG00000182612 TSPAN10
    2 ENSG00000184258 CDR1   
    3 ENSG00000207577 MIR587 


Inspired by reshape2, I choose to melt the wide map

probe2id_wide <- probe2id

# A tibble: 2 x 2
  probe  id       
  <chr>  <chr>    
1 probe1 id1 | id2
2 probe2 id3      

to a long map.

probe2id_long <- probe2id_wide %>% melt_map("probe", "id", " \\| ")

# A tibble: 3 x 2
  probe  id   
  <chr>  <chr>
1 probe1 id1  
2 probe1 id2  
3 probe2 id3  

Then map id to symbol to get a new long map, following the way we solve the simpler problem abobe.

probe2symbol_long <- probe2id_long %>% 
    dplyr::transmute(probe, symbol = id2symbol$symbol[match(id, id2symbol$id)])

# A tibble: 3 x 2
  probe  symbol 
  <chr>  <chr>  
1 probe1 symbol1
2 probe1 symbol2
3 probe2 symbol3

Finally cast to a new wide map

probe2symbol_wide <- probe2symbol_long %>% cast_map("probe", "symbol", " /// ")

probe2symbol_wide   # now it's identical to probe2symbol
# A tibble: 2 x 2
  probe  symbol             
  <chr>  <chr>              
1 probe1 symbol1 /// symbol2
2 probe2 symbol3            

In short, A-B wide map -> A-B long map -> A-C long map -> A-C wide map.

real world

In above discussion, I abstract away many details to focus on core idea. Things get more complicated in real world:

  • The separator is not definite (it can be , , \\\), so I use reverse match.
  • I omit where id2symbol comes from (at least not falls from sky). Actually I melt hgnc_complete_set.txt.gz to create entrez2symbol, ensembl2symbol, etc.
  • I use symbol = id2symbol$symbol[match(id, id2symbol$id)] in the above code for universality, but it looks quiet obscured even though I have explained in great detail. Thus I add some syntax sugar (?as_symbol), so that you can see the simple and beautiful code in the beginning.

Armed by with above weapon, the package can serves as the workhorse of rGEO to transform any user-supplied GPL file to standard chip file ready for GSEA.

  1. how I create probe2ensembl

    # read probe annotation from a GSE SOFT file
    soft_table <- system.file("extdata/GSE19161_family.soft.gz", package = "rGEO") %>% 
        rGEO::parse_gse_soft(verbose = F) %>% {.$table}
    # subset part of the table for the purpose of demostration
    probe2ensembl <- soft_table %>% dplyr::select(1, 9) %>% dplyr::slice(41:50)
  2. Ensembl IDs which don’t have corresponding HUGO symbol are discarded.↩︎

  3. I deliberately shuffle the rows of id2symbol to show that id2symbol just need to provide the correct relationship between id and symbol, i.e, it doesn’t necessarily maintain the same order as probe2id.↩︎