Skip to content

Interact with Inception and the Inception API

Inception is an annotation platform which is used in blackbar to generate training data in order to build named entity recognition models for anonymization of texts.

Inception provides an API which allows to interact with the environment (see the inception docs).

  • Connectivity to the API is done by using the python packages docker and podman and by setting the environment variables INCEPTION_HOST, INCEPTION_USERNAME and INCEPTION_PASSWORD
  • Once these are set, you can perform the operations
from rlike import *
environ = {
    "INCEPTION_HOST": "https://inception.datatailor.be/",
    "INCEPTION_USERNAME": "XXXXXXXXXX", 
    "INCEPTION_PASSWORD": "XXXXXXXXXX"}
Sys_setenv(environ)

Connect with the API / list the projects / export

blackbar.inception.inception_client(host=None, user=None, password=None)

blackbar.inception.inception_list_projects(client=None)

Get a list of projects in Inception using the Inception API

Parameters:

Name Type Description Default
client Pycaprio

A Pycaprio client as defined in pycaprio

None

Returns:

Type Description
DataFrame

a pandas dataframe with columns project_id, project_name

Examples:

>>> from blackbar import inception_list_projects
>>> x = inception_list_projects()

blackbar.inception.inception_export(client=None, id=None, folder=os.getcwd(), format=None)

Export a project from Inception in XMI format as a zip file

Parameters:

Name Type Description Default
client Pycaprio

A Pycaprio client as defined in pycaprio

None
id int

The project_id of the Inception project as indicated by inception_list_projects

None
folder str

The folder where the zip file will be extraced to. Defaults to the current working directory

getcwd()
format str

The output format. Defaults to XMI format

None

Returns:

Type Description
str

A str with the path to a file where the zip file will have the name of the project

Examples:

>>> from blackbar import inception_list_projects, inception_export
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> path = inception_export(id = project_id)
>>> path = inception_export(id = x['project_id'][0])
>>> path = inception_export(id = x['project_id'][1])

blackbar.inception.inception_import_project(client=None, path=None)

Import an Inception project which was exported as a zip file

Parameters:

Name Type Description Default
client Pycaprio

A Pycaprio client as defined in pycaprio

None
path str

string with the path to a zip file

None

Returns:

Type Description

a dict with elements: project_id and project_name

Examples:

>>> from blackbar import inception_list_projects, inception_export
>>> x = inception_list_projects()
>>> project_id = x['project_id'][x['project_name'] == 'blackbar-example']
>>> project_id = int(project_id.iloc[0])
>>> path = inception_export(id = project_id)
>>> proj = inception_import_project(path = path)

blackbar.inception.inception_read_eventlog(path)

Read the enventlog from an Inception export

Parameters:

Name Type Description Default
path str

Path to a .zip file extracted with inception_export

required

Returns:

Type Description
DataFrame

A pandas dataframe with columns 'created', 'details', 'event', 'id', 'user', 'document_name', 'annotator'

Examples:

>>> from blackbar import inception_list_projects, inception_export, inception_read_eventlog
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> path = inception_export(id = project_id)
>>> events = inception_read_eventlog(path)

List the annotations and read in the data

blackbar.inception.inception_list_documents(client=None, id=None, trace=0, text=True, n_max=0, encoding=None, encoding_errors='replace')

Get a list of documents for a project in Inception using the Inception API

Parameters:

Name Type Description Default
client Pycaprio

A Pycaprio client as defined in pycaprio

None
id int

The project_id of the Inception project as indicated by inception_list_projects

None
trace int

Integer, if bigger than zero, shows the trace of the extraction every trace number of documents which are extracted

0
text bool

Boolean, indicating to fetch as well the text of the document. Defaults to True.

True
n_max int

Integer, if bigger than zero, get only the first n_max documents. Defaults to all documents.

0
encoding str

string, indicating to decode the text. Possible values are None (return bytes), 'utf-8', 'latin-1'

None
encoding_errors str

what to do in case of encoding errors. Either 'strict', 'replace', 'ignore'. Defaults to 'replace'

'replace'

Returns:

Type Description
DataFrame

A pandas dataframe with columns project_id, doc_id, text, label and state. In case text is set to false, does not include the text of the document.

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> docs = inception_list_documents(id = project_id)
>>> #docs = inception_list_documents(id = x['project_id'][0], trace = 10)
>>> #docs = inception_list_documents(id = x['project_id'][x['project_name'] == 'penningkohieren-aalst'], text = False)
>>> #docs = inception_list_documents(id = x['project_id'][x['project_name'] == 'penningkohieren-aalst'], text = True, trace = 10)

blackbar.inception.inception_list_annotations(client=None, id=None, document=None, trace=0)

Get the annotations performed on documents for a project in Inception using the Inception API

Parameters:

Name Type Description Default
client Pycaprio

A Pycaprio client as defined in pycaprio

None
id int

The project_id of the Inception project as indicated by inception_list_projects

None
document int

The document id's for which the annotations need to be extracted. If not provided, will fetch all documents.

None
trace int

Integer, if bigger than zero, shows the trace of the extraction every trace number of documents which are extracted

0

Returns:

Type Description
DataFrame

A pandas dataframe with columns project_id, document_id, user_name, annotation_state and timestamp containing all annotations which are not in the 'NEW' state

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'penningkohieren-aalst']['project_id'].iloc[0])
>>> docs = inception_list_documents(id = project_id, text = False)
>>> docs = list(docs["doc_id"])
>>> progress = inception_list_annotations(id = project_id, document = [88, 89])
>>> progress = inception_list_annotations(id = project_id)

blackbar.inception.inception_types(x)

Get the types available in a Inception CAS or TypeSystem object

Parameters:

Name Type Description Default
x Cas or TypeSystem

A CAS object or a TypeSystem object

required

Returns:

Type Description
DataFrame

A a pandas dataframe with columns name, description, features

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["annotation"] = inception_cas(x = progress, as_xmi = True)

blackbar.inception.inception_cas(client=None, x=None, as_xmi=False, trace=0)

Get the detailed CAS (Common Analysis System) annotations for a set of annotations

Parameters:

Name Type Description Default
client Pycaprio

A Pycaprio client as defined in pycaprio

None
x Dataframe

A pandas dataframe with columns document_id and user_name as returned by inception_list_annotations

None
as_xmi bool

Boolean, indicating if the resulting object needs to be read in with read_xmi

False
trace int

Integer, if bigger than zero, shows the trace of the extraction every trace number of documents which are extracted

0

Returns:

Type Description
List[Cas]

A list with the deserialised annotations of type cassis.cas.Cas

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["anno_raw"] = inception_cas(x = progress, as_xmi = False)
>>> progress["anno"] = inception_cas(x = progress, as_xmi = True)

blackbar.inception.read_xmi(data, type='cas')

Read an CAS (Common Analysis System) XMI file exported from Inception. The CAS (Common Analysis System) is a data structure representing an object to be enriched with annotations.

Parameters:

Name Type Description Default
data bytes

a set of annotations as returned by client.api.annotation

required
type str

a str with only possible value: 'cas'

'cas'

Returns:

Type Description
Cas

The deserialized CAS of type cassis.cas.Cas

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["anno_raw"] = inception_cas(x = progress, as_xmi = False)
>>> progress["anno"] = [read_xmi(el) for el in progress["anno_raw"]]

Utilities to get the labelled entities

blackbar.blackbar.blackbar_inception_entities(cas=None, entity_type='custom.Span', entity_type_attr='label', trace=0, type='document-entities')

Extract the the tokens with the entities from a Inception CAS (Common Analysis System). These are entities which should be defined as 'custom.Span' and have an extra field: 'entiteitsdetail' or 'entity_detail'

Parameters:

Name Type Description Default
cas Cas

an object of type cassis.cas.Cas as returned by read_xmi

None
entity_type str

the Cas type where the span is stored in. Defaults to "custom.Span"

'custom.Span'
entity_type_attr str

the Cas attribute type where the content of the span is stored in. Defaults to "label"

'label'
trace int

logging argument. Defaults to not to printing if the "custom.Span" has not been found.

0
type str

either 'document-entities' to get all entities across sentences or 'sentence-entities' to get entities which are not crossing sentence boundaries. Defaults to 'document-entities'

'document-entities'

Returns:

Type Description

A pandas DataFrame with tokens containing the annotated data with columns: doc_id_inception, sentence, sentence_start, sentence_end, token, start, end, entity, entity_detail

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas, blackbar_inception_entities
>>> x = inception_list_projects()
>>> project_id = int(x['project_id'][x['project_name'] == 'deid-sample202112'])
>>> project_id = int(x['project_id'][x['project_name'] == 'dieetadvies'])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["anno"] = inception_cas(x = progress, as_xmi = True)
>>> anno = blackbar_inception_entities(progress["anno"][0], type = "document-entities")
>>> anno = blackbar_inception_entities(progress["anno"][0], type = "sentences-entities")

blackbar.modelling.blackbar_inception_annotations(projects=None, annotation_state=None, entity_type='custom.Span', entity_type_attr='label', n_max=0, trace=0, encoding=None, encoding_errors='replace', docid_type='default')

Get the annotations performed in Inception on projects

Parameters:

Name Type Description Default
projects [List[str], str, None]

a list of projects

None
annotation_state [List[str], None]

a list of annotation states - e.g. ["COMPLETE", "IN-PROGRESS"] to limit the annotation entities to

None
entity_type str

the Cas type where the span is stored in. Defaults to "custom.Span"

'custom.Span'
entity_type_attr str

the Cas attribute type where the content of the span is stored in. Defaults to "label"

'label'
n_max int

Integer, if bigger than zero, get only the first n_max documents. Defaults to all documents.

0
trace int

logging argument, passed on to inception_list_documents and inception_list_annotations. Defaults to 0

0
encoding str

string, indicating to decode the text. Possible values are None (return bytes), 'utf-8', 'latin-1'. Passed on to inception_list_documents.

None
encoding_errors str

what to do in case of encoding errors. Either 'strict', 'replace', 'ignore'. Defaults to 'replace'. Passed on to inception_list_documents.

'replace'
docid_type str

string - either 'default' or 'label'. Indicating to keep the doc_id (default) or recode it to the label.

'default'

Returns:

Type Description

A dictionary with elements 'docs', 'anno' and 'entities' where

  • docs is a pandas dataframe with columns 'project_id', 'doc_id', 'state', 'text', 'label'
  • anno is a pandas dataframe with columns 'project_id', 'doc_id', 'annotation_nr', 'annotation_state', 'annotation_date', 'annotation_timepoint', 'user_name', 'cas'
  • entities is a pandas dataframe with columns 'project_id', 'doc_id', 'annotation_nr', 'user_name', 'sentence', 'sentence_start', 'sentence_end', 'token', 'start', 'end', 'entity', 'entity_detail'

Examples:

>>> from blackbar import inception_list_projects
>>> x = inception_list_projects()
>>> anno = blackbar_inception_annotations(['testuzb'])
>>> anno = blackbar_inception_annotations(['testuzb'], annotation_state = ["COMPLETE", "IN-PROGRESS"])
>>> anno = blackbar_inception_annotations(['testuzb'], annotation_state = ["COMPLETE", "IN-PROGRESS"], encoding = "utf-8")

blackbar.blackbar.blackbar_inception_read_export(path, type='curation', entity_type='custom.Span', entity_type_attr='label')

Read an exported zip file from Inception containing XMI annotations/curations of a named entity project

Parameters:

Name Type Description Default
path str

A str with the path to a zip file containing the annotations and curations

required
type str

A str indicating to read in either the 'curation' or the 'annotation'. Defaults to 'curation'.

'curation'
entity_type str

the Cas type where the span is stored in. Defaults to "custom.Span"

'custom.Span'
entity_type_attr str

the Cas attribute type where the content of the span is stored in. Defaults to "label"

'label'

Returns:

Type Description

A dictionary with elements

  • source: 'inception'
  • project: a dictionary with elements project, users, documents, anno, tagset
  • docs: a pandas dataframe with columns 'project_name', 'document_name', 'user_name', 'document_state', 'annotation_state', 'timestamp'
  • anno: a pandas dataframe with columns 'project_id', 'doc_id', 'document_name', 'text', 'user_name', 'cas'
  • entities: a pandas dataframe with columns 'project_id', 'doc_id', 'doc_id_inception', 'user_name', 'sentence', 'sentence_start', 'sentence_end', 'token', 'start', 'end', 'entity', 'entity_detail'

Examples:

>>> from blackbar import inception_list_projects, inception_export
>>> x = inception_list_projects()
>>> #project_id = x['project_id'][x['project_name'] == 'deid-sample202112']
>>> project_id = x['project_id'][x['project_name'] == 'getuigenissen']
>>> project_id = x['project_id'][x['project_name'] == 'testuzb']
>>> path = inception_export(id = project_id)
>>> # path = "radiologie.zip"
>>> # path = "blackbar-example_project_2024-11-11_1626.zip"
>>> d = blackbar_inception_read_export(path, type = "curation")
>>> d = blackbar_inception_read_export(path, type = "annotation")
>>> # d["docs"]
>>> # d["anno"]
>>> # d["entities"]

blackbar.blackbar.line_spans(x)

Split text into newlines which are considered as sentences and identify the locations of these sentences in x

Parameters:

Name Type Description Default
x str

a text with string

required

Returns:

Type Description
DataFrame

A pandas DataFrame with columns sentence_id, start, end, sentence, text, nchar, is_space indicating the location of the sentence in x

Examples:

>>> from blackbar import line_spans
>>> x = "I want to break free\nGive me more"
>>> locs = line_spans(x)
>>> locs['sentence'][0]
'I want to break free'
>>> locs['sentence'][1]
'Give me more'

blackbar.blackbar.token_spans(doc, detailed=False)

Extract a spacy parsed document as a pandas DataFrame

Parameters:

Name Type Description Default
doc Doc

A spacy parsed document

required
detailed bool

Boolean indicating to do detailed

False

Returns:

Type Description
DataFrame

A pandas DataFrame with columns token, term_id, lemma, start, end, upos, xpos, dep, dep_term_id, entity, entity_iob

DataFrame

and optionally (if detailed is set to True) columns spaces, is_stop, is_alpha, is_digit, is_punct, is_title, like_email, like_num, like_url, is_sent_end, is_sent_start

Examples:

>>> from blackbar import token_spans
>>> from rlike import substr
>>> import spacy
>>> text = "I want to break free\nGive me more"
>>> nlp = spacy.load("nl_core_news_md")
>>> doc = nlp(text, disable = ['ner'])   
>>> x = token_spans(doc)
>>> doc = nlp(text)   
>>> x = token_spans(doc, detailed = True)
>>> list(x["token"]) == substr(text, start = list(x["start"]), end = list(x["end"]))
True

blackbar.blackbar.token_entity_spans(doc, doc_id=None)

Extract a locations of the detected entities of a spacy parsed document as a pandas DataFrame

Parameters:

Name Type Description Default
doc Doc

A spacy parsed document

required
doc_id str

an identifier of the document which will be put into column doc_id

None

Returns:

Type Description
DataFrame

A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similarity, start, end, length where term_search and similarity will always be empty

Examples:

>>> from blackbar import token_entity_spans
>>> from blackbar.data import blackbar_example
>>> from rlike import substr
>>> import spacy
>>> nlp = spacy.load("nl_core_news_md")
>>> text = blackbar_example()
>>> doc = nlp(text)   
>>> x = token_entity_spans(doc)
>>> x = token_entity_spans(doc, doc_id = "abc-123-xyz")
>>> list(x["term_detected"]) == substr(text, start = list(x["start"]), end = list(x["end"]))
True

blackbar.blackbar.merge_chunkranges(data, text, include_end=False)

Combine overlapping chunk ranges by finding overlapping ranges. Overlapping ranges are handled as follows. We first order the ranges by the starting positions. If the start value of the chunk falls within the previous chunk range (smaller than the previous end value - it overlaps) we extend the previous end value with the new end value and keep the longest search string (term_search) as we assume the detected chunk ranges are coming from a Smith-Waterman alignment.

Parameters:

Name Type Description Default
data DataFrame

A data frame with columns doc_id, label_, term_search, start, end, similarity indicating the location of detected entities, where the end value is the end position in the text + 1

required
text str

a text string based on which the entities in data were found

required
include_end bool

bool indicating to include the end

False

Returns:

Type Description

A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similarity, start, end, length indicating

Examples:

>>> from pandas import DataFrame
>>> from rlike import substr
>>> from blackbar.data import blackbar_example
>>> text = blackbar_example()
>>> ent = {'doc_id': text, 'similarity': 0, 'term_search': ['A', 'AA', 'B', 'B', 'C', 'CC', 'D', 'DD'], 'start': [10, 10+5, 93, 242, 343 + 4, 343, 384, 384+4], 'end': [19, 19, 98, 247, 343 + 7, 359, 400, 400+2], 'label_': ['Date', 'Date', 'Date', 'Date', 'Name', 'Name', 'Name', 'Name']} 
>>> ent = DataFrame(ent)
>>> substr(ent["doc_id"], start = list(ent["start"]), end = list(ent["end"] - 1))
['10/4/2021', '2021', '30/03', '09/04', 'Jan', 'Dr. Jan Janssens', 'Dr. Jan Janssens', 'Jan Janssens |']
>>> chunkranges = merge_chunkranges(ent, text)
>>> list(chunkranges["term_detected"])
['10/4/2021', '30/03', '09/04', 'Dr. Jan Janssens', 'Dr. Jan Janssens |']