Inspect & map identifiers#

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.

Bionty enables this by mapping metadata on the versioned ontologies using inspect().

For terms that are not directly mappable, we offer (also see /lookup):

import bionty as bt
import pandas as pd

Inspect and mapping synonyms of gene identifiers#

To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.

data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "ENSGcorrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")

df_orig

	gene symbol	hgnc id
ensembl_gene_id
ENSG00000148584	A1CF	HGNC:24086
ENSG00000121410	A1BG	HGNC:5
ENSG00000188389	FANCD1	HGNC:1101
ENSGcorrupted	corrupted	corrupted

First we can check whether any of our values are mappable against the ontology reference.

Tip: available fields are accessible via gene_bionty.fields

gene_bionty = bt.Gene()

gene_bionty

Gene
Species: human
Source: ensembl, release-108

📖 Gene.df(): ontology reference table
🔎 Gene.lookup(): autocompletion of ontology terms
🎯 Gene.fuzzy_match(): fuzzy match against ontology terms
🧐 Gene.inspect(): check if identifiers are mappable
👽 Gene.map_synonyms(): map synonyms to standardized names
🔗 Gene.ontology: Pronto.Ontology object

gene_bionty.inspect(df_orig.index, gene_bionty.ensembl_gene_id)

✅ 3 terms (75.0%) are mapped.

🔶 1 terms (25.0%) are not mapped.

{'mapped': ['ENSG00000148584', 'ENSG00000121410', 'ENSG00000188389'],
 'not_mapped': ['ENSGcorrupted']}

The same procedure is available for gene symbols. First, we inspect which symbols are mappable against the ontology.

gene_bionty.inspect(df_orig["gene symbol"], gene_bionty.symbol)

🔶 The identifiers contain synonyms!

💡 To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'

✅ 2 terms (50.0%) are mapped.

🔶 2 terms (50.0%) are not mapped.

{'mapped': ['A1CF', 'A1BG'], 'not_mapped': ['FANCD1', 'corrupted']}

Apparently 2 of the gene symbols are mappable. Bionty further warns us that some of our symbols can be mapped into standardized symbols.

Mapping synonyms returns a list of standardized terms:

mapped_symbol_synonyms = gene_bionty.map_synonyms(
    df_orig["gene symbol"], gene_bionty.symbol
)

mapped_symbol_synonyms

['A1CF', 'A1BG', 'BRCA2', 'corrupted']

Optionally, only returns a mapper of {synonym : standardized name}:

gene_bionty.map_synonyms(df_orig["gene symbol"], gene_bionty.symbol, return_mapper=True)

{'FANCD1': 'BRCA2'}

We can use the standardized symbols as the new index:

df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms

df_curated

	ensembl_gene_id	gene symbol	hgnc id
A1CF	ENSG00000148584	A1CF	HGNC:24086
A1BG	ENSG00000121410	A1BG	HGNC:5
BRCA2	ENSG00000188389	FANCD1	HGNC:1101
corrupted	ENSGcorrupted	corrupted	corrupted

You may return a DataFrame with a boolean column indicating if the identifiers are mappable:

gene_bionty.inspect(df_curated.index, gene_bionty.symbol, return_df=True)

✅ 3 terms (75.0%) are mapped.

🔶 1 terms (25.0%) are not mapped.

	__mapped__
A1CF	True
A1BG	True
BRCA2	True
corrupted	False

Standardize and look up unmapped CellMarker identifiers#

Depending on how the data was collected and which terminology was used, it is not always possible to curate values. Some values might have used a different standard or be corrupted.

This section will demonstrate how to look up unmatched terms and curate them using CellMarker.

First, we take an example DataFrame whose index containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.

markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127a",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let’s instantiate the CellMarker ontology with the default database and version.

cellmarker_bionty = bt.CellMarker()

cellmarker_bionty

CellMarker
Species: human
Source: cellmarker, 2.0

📖 CellMarker.df(): ontology reference table
🔎 CellMarker.lookup(): autocompletion of ontology terms
🎯 CellMarker.fuzzy_match(): fuzzy match against ontology terms
🧐 CellMarker.inspect(): check if identifiers are mappable
👽 CellMarker.map_synonyms(): map synonyms to standardized names
🔗 CellMarker.ontology: Pronto.Ontology object

Now let’s check which cell markers from the file can be found in the reference:

cellmarker_bionty.inspect(markers.index, cellmarker_bionty.name)

🔶 The identifiers contain synonyms!

💡 To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'

✅ 7 terms (50.0%) are mapped.

🔶 7 terms (50.0%) are not mapped.

{'mapped': ['CCR7', 'CD14', 'CD8', 'CD45RA', 'CD4', 'CD3', 'CD66b'],
 'not_mapped': ['KI67',
  'CD127a',
  'PD1',
  'Invalid-1',
  'Invalid-2',
  'Siglec8',
  'Time']}

Logging suggests we map synonyms:

synonyms_mapper = cellmarker_bionty.map_synonyms(
    markers.index, cellmarker_bionty.name, return_mapper=True
)

Now we mapped 3 additional terms:

synonyms_mapper

{'KI67': 'Ki67', 'PD1': 'PD-1', 'Siglec8': 'SIGLEC8'}

Let’s replace the synonyms with standardized names in the markers DataFrame:

markers.rename(index=synonyms_mapper, inplace=True)

From the logging, it can be seen that 4 terms were not found in the reference!

Among them Time, Invalid-1 and Invalid-2 are non-marker channels which won’t be curated by cell marker.

cellmarker_bionty.inspect(markers.index, cellmarker_bionty.name)

✅ 10 terms (71.4%) are mapped.

🔶 4 terms (28.6%) are not mapped.

{'mapped': ['Ki67',
  'CCR7',
  'CD14',
  'CD8',
  'CD45RA',
  'CD4',
  'CD3',
  'PD-1',
  'CD66b',
  'SIGLEC8'],
 'not_mapped': ['CD127a', 'Invalid-1', 'Invalid-2', 'Time']}

We don’t really find CD127a, let’s check in the lookup with auto-completion:

lookup = cellmarker_bionty.lookup()

lookup.cd127

CellMarker(name='CD127', ncbi_gene_id='3575', gene_symbol='IL7R', gene_name='interleukin 7 receptor', uniprotkb_id='P16871', synonyms=None)

Indeed we find it should be cd127, we had a typo there with cd127a.

Now let’s fix the markers so all of them can be linked:

Tip

Using the .lookup instead of passing a string helps eliminate possible typos!

curated_df = markers.rename(index={"CD127a": lookup.cd127.name})

Optionally, run a fuzzy match:

cellmarker_bionty.fuzzy_match("CD127a", return_ranked_results=True).head(5)

	ncbi_gene_id	gene_symbol	gene_name	uniprotkb_id	synonyms	__ratio__
name
CD127	3575	IL7R	interleukin 7 receptor	P16871	None	90.909091
CD167a	None	None	None	None	None	83.333333
CD107a	3916	LAMP1	lysosomal associated membrane protein 1	A0A024RDY3	None	83.333333
CD172a	None	None	None	None	None	83.333333
CD120a	7132	TNFRSF1A	TNF receptor superfamily member 1A	P19438	None	83.333333

OK, now we can try to run curate again and all cell markers are linked!

cellmarker_bionty.inspect(curated_df.index, cellmarker_bionty.name)

✅ 11 terms (78.6%) are mapped.

🔶 3 terms (21.4%) are not mapped.

{'mapped': ['Ki67',
  'CCR7',
  'CD14',
  'CD8',
  'CD45RA',
  'CD4',
  'CD3',
  'CD127',
  'PD-1',
  'CD66b',
  'SIGLEC8'],
 'not_mapped': ['Invalid-1', 'Invalid-2', 'Time']}