Interact with Inception and the Inception API

Inception is an annotation platform which is used in blackbar to generate training data in order to build named entity recognition models for anonymization of texts.

Inception provides an API which allows to interact with the environment (see the inception docs).

Connectivity to the API is done by using the python packages docker and podman and by setting the environment variables INCEPTION_HOST, INCEPTION_USERNAME and INCEPTION_PASSWORD
Once these are set, you can perform the operations

from rlike import *
environ = {
    "INCEPTION_HOST": "https://inception.datatailor.be/",
    "INCEPTION_USERNAME": "XXXXXXXXXX", 
    "INCEPTION_PASSWORD": "XXXXXXXXXX"}
Sys_setenv(environ)

Connect with the API / list the projects / export

`blackbar.inception.inception_client(host=None, user=None, password=None)`

`blackbar.inception.inception_list_projects(client=None)`

Get a list of projects in Inception using the Inception API

Parameters:

Name	Type	Description	Default
`client`	`Pycaprio`	A Pycaprio client as defined in pycaprio	`None`

Returns:

Type	Description
`DataFrame`	a pandas dataframe with columns project_id, project_name

Examples:

>>> from blackbar import inception_list_projects
>>> x = inception_list_projects()

`blackbar.inception.inception_export(client=None, id=None, folder=os.getcwd(), format=None)`

Export a project from Inception in XMI format as a zip file

Parameters:

Name	Type	Description	Default
`client`	`Pycaprio`	A Pycaprio client as defined in pycaprio	`None`
`id`	`int`	The project_id of the Inception project as indicated by inception_list_projects	`None`
`folder`	`str`	The folder where the zip file will be extraced to. Defaults to the current working directory	`getcwd()`
`format`	`str`	The output format. Defaults to XMI format	`None`

Returns:

Type	Description
`str`	A str with the path to a file where the zip file will have the name of the project

Examples:

>>> from blackbar import inception_list_projects, inception_export
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> path = inception_export(id = project_id)
>>> path = inception_export(id = x['project_id'][0])
>>> path = inception_export(id = x['project_id'][1])

`blackbar.inception.inception_import_project(client=None, path=None)`

Import an Inception project which was exported as a zip file

Parameters:

Name	Type	Description	Default
`client`	`Pycaprio`	A Pycaprio client as defined in pycaprio	`None`
`path`	`str`	string with the path to a zip file	`None`

Returns:

Type	Description
	a dict with elements: project_id and project_name

Examples:

>>> from blackbar import inception_list_projects, inception_export
>>> x = inception_list_projects()
>>> project_id = x['project_id'][x['project_name'] == 'blackbar-example']
>>> project_id = int(project_id.iloc[0])
>>> path = inception_export(id = project_id)
>>> proj = inception_import_project(path = path)

`blackbar.inception.inception_read_eventlog(path)`

Read the enventlog from an Inception export

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to a .zip file extracted with inception_export	required

Returns:

Type	Description
`DataFrame`	A pandas dataframe with columns 'created', 'details', 'event', 'id', 'user', 'document_name', 'annotator'

Examples:

>>> from blackbar import inception_list_projects, inception_export, inception_read_eventlog
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> path = inception_export(id = project_id)
>>> events = inception_read_eventlog(path)

List the annotations and read in the data

`blackbar.inception.inception_list_documents(client=None, id=None, trace=0, text=True, n_max=0, encoding=None, encoding_errors='replace')`

Get a list of documents for a project in Inception using the Inception API

Parameters:

Name	Type	Description	Default
`client`	`Pycaprio`	A Pycaprio client as defined in pycaprio	`None`
`id`	`int`	The project_id of the Inception project as indicated by inception_list_projects	`None`
`trace`	`int`	Integer, if bigger than zero, shows the trace of the extraction every trace number of documents which are extracted	`0`
`text`	`bool`	Boolean, indicating to fetch as well the text of the document. Defaults to True.	`True`
`n_max`	`int`	Integer, if bigger than zero, get only the first n_max documents. Defaults to all documents.	`0`
`encoding`	`str`	string, indicating to decode the text. Possible values are None (return bytes), 'utf-8', 'latin-1'	`None`
`encoding_errors`	`str`	what to do in case of encoding errors. Either 'strict', 'replace', 'ignore'. Defaults to 'replace'	`'replace'`

Returns:

Type	Description
`DataFrame`	A pandas dataframe with columns project_id, doc_id, text, label and state. In case text is set to false, does not include the text of the document.

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> docs = inception_list_documents(id = project_id)
>>> #docs = inception_list_documents(id = x['project_id'][0], trace = 10)
>>> #docs = inception_list_documents(id = x['project_id'][x['project_name'] == 'penningkohieren-aalst'], text = False)
>>> #docs = inception_list_documents(id = x['project_id'][x['project_name'] == 'penningkohieren-aalst'], text = True, trace = 10)

`blackbar.inception.inception_list_annotations(client=None, id=None, document=None, trace=0)`

Get the annotations performed on documents for a project in Inception using the Inception API

Parameters:

Name	Type	Description	Default
`client`	`Pycaprio`	A Pycaprio client as defined in pycaprio	`None`
`id`	`int`	The project_id of the Inception project as indicated by inception_list_projects	`None`
`document`	`int`	The document id's for which the annotations need to be extracted. If not provided, will fetch all documents.	`None`
`trace`	`int`	Integer, if bigger than zero, shows the trace of the extraction every trace number of documents which are extracted	`0`

Returns:

Type	Description
`DataFrame`	A pandas dataframe with columns project_id, document_id, user_name, annotation_state and timestamp containing all annotations which are not in the 'NEW' state

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'penningkohieren-aalst']['project_id'].iloc[0])
>>> docs = inception_list_documents(id = project_id, text = False)
>>> docs = list(docs["doc_id"])
>>> progress = inception_list_annotations(id = project_id, document = [88, 89])
>>> progress = inception_list_annotations(id = project_id)

`blackbar.inception.inception_types(x)`

Get the types available in a Inception CAS or TypeSystem object

Parameters:

Name	Type	Description	Default
`x`	`Cas or TypeSystem`	A CAS object or a TypeSystem object	required

Returns:

Type	Description
`DataFrame`	A a pandas dataframe with columns name, description, features

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["annotation"] = inception_cas(x = progress, as_xmi = True)

`blackbar.inception.inception_cas(client=None, x=None, as_xmi=False, trace=0)`

Get the detailed CAS (Common Analysis System) annotations for a set of annotations

Parameters:

Name	Type	Description	Default
`client`	`Pycaprio`	A Pycaprio client as defined in pycaprio	`None`
`x`	`Dataframe`	A pandas dataframe with columns document_id and user_name as returned by inception_list_annotations	`None`
`as_xmi`	`bool`	Boolean, indicating if the resulting object needs to be read in with read_xmi	`False`
`trace`	`int`	Integer, if bigger than zero, shows the trace of the extraction every trace number of documents which are extracted	`0`

Returns:

Type	Description
`List[Cas]`	A list with the deserialised annotations of type cassis.cas.Cas

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["anno_raw"] = inception_cas(x = progress, as_xmi = False)
>>> progress["anno"] = inception_cas(x = progress, as_xmi = True)

`blackbar.inception.read_xmi(data, type='cas')`

Read an CAS (Common Analysis System) XMI file exported from Inception. The CAS (Common Analysis System) is a data structure representing an object to be enriched with annotations.

Parameters:

Name	Type	Description	Default
`data`	`bytes`	a set of annotations as returned by client.api.annotation	required
`type`	`str`	a str with only possible value: 'cas'	`'cas'`

Returns:

Type	Description
`Cas`	The deserialized CAS of type cassis.cas.Cas

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["anno_raw"] = inception_cas(x = progress, as_xmi = False)
>>> progress["anno"] = [read_xmi(el) for el in progress["anno_raw"]]

Utilities to get the labelled entities

`blackbar.blackbar.blackbar_inception_entities(cas=None, entity_type='custom.Span', entity_type_attr='label', trace=0, type='document-entities')`

Extract the the tokens with the entities from a Inception CAS (Common Analysis System). These are entities which should be defined as 'custom.Span' and have an extra field: 'entiteitsdetail' or 'entity_detail'

Parameters:

Name	Type	Description	Default
`cas`	`Cas`	an object of type cassis.cas.Cas as returned by read_xmi	`None`
`entity_type`	`str`	the Cas type where the span is stored in. Defaults to "custom.Span"	`'custom.Span'`
`entity_type_attr`	`str`	the Cas attribute type where the content of the span is stored in. Defaults to "label"	`'label'`
`trace`	`int`	logging argument. Defaults to not to printing if the "custom.Span" has not been found.	`0`
`type`	`str`	either 'document-entities' to get all entities across sentences or 'sentence-entities' to get entities which are not crossing sentence boundaries. Defaults to 'document-entities'	`'document-entities'`

Returns:

Type	Description
	A pandas DataFrame with tokens containing the annotated data with columns: doc_id_inception, sentence, sentence_start, sentence_end, token, start, end, entity, entity_detail

Examples:

>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas, blackbar_inception_entities
>>> x = inception_list_projects()
>>> project_id = int(x['project_id'][x['project_name'] == 'deid-sample202112'])
>>> project_id = int(x['project_id'][x['project_name'] == 'dieetadvies'])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["anno"] = inception_cas(x = progress, as_xmi = True)
>>> anno = blackbar_inception_entities(progress["anno"][0], type = "document-entities")
>>> anno = blackbar_inception_entities(progress["anno"][0], type = "sentences-entities")

`blackbar.modelling.blackbar_inception_annotations(projects=None, annotation_state=None, entity_type='custom.Span', entity_type_attr='label', n_max=0, trace=0, encoding=None, encoding_errors='replace', docid_type='default')`

Get the annotations performed in Inception on projects

Parameters:

Name	Type	Description	Default
`projects`	`[List[str], str, None]`	a list of projects	`None`
`annotation_state`	`[List[str], None]`	a list of annotation states - e.g. ["COMPLETE", "IN-PROGRESS"] to limit the annotation entities to	`None`
`entity_type`	`str`	the Cas type where the span is stored in. Defaults to "custom.Span"	`'custom.Span'`
`entity_type_attr`	`str`	the Cas attribute type where the content of the span is stored in. Defaults to "label"	`'label'`
`n_max`	`int`	Integer, if bigger than zero, get only the first n_max documents. Defaults to all documents.	`0`
`trace`	`int`	logging argument, passed on to inception_list_documents and inception_list_annotations. Defaults to 0	`0`
`encoding`	`str`	string, indicating to decode the text. Possible values are None (return bytes), 'utf-8', 'latin-1'. Passed on to inception_list_documents.	`None`
`encoding_errors`	`str`	what to do in case of encoding errors. Either 'strict', 'replace', 'ignore'. Defaults to 'replace'. Passed on to inception_list_documents.	`'replace'`
`docid_type`	`str`	string - either 'default' or 'label'. Indicating to keep the doc_id (default) or recode it to the label.	`'default'`

Returns:

Type	Description
	A dictionary with elements 'docs', 'anno' and 'entities' where
	docs is a pandas dataframe with columns 'project_id', 'doc_id', 'state', 'text', 'label'
	anno is a pandas dataframe with columns 'project_id', 'doc_id', 'annotation_nr', 'annotation_state', 'annotation_date', 'annotation_timepoint', 'user_name', 'cas'
	entities is a pandas dataframe with columns 'project_id', 'doc_id', 'annotation_nr', 'user_name', 'sentence', 'sentence_start', 'sentence_end', 'token', 'start', 'end', 'entity', 'entity_detail'

Examples:

>>> from blackbar import inception_list_projects
>>> x = inception_list_projects()
>>> anno = blackbar_inception_annotations(['testuzb'])
>>> anno = blackbar_inception_annotations(['testuzb'], annotation_state = ["COMPLETE", "IN-PROGRESS"])
>>> anno = blackbar_inception_annotations(['testuzb'], annotation_state = ["COMPLETE", "IN-PROGRESS"], encoding = "utf-8")

`blackbar.blackbar.blackbar_inception_read_export(path, type='curation', entity_type='custom.Span', entity_type_attr='label')`

Read an exported zip file from Inception containing XMI annotations/curations of a named entity project

Parameters:

Name	Type	Description	Default
`path`	`str`	A str with the path to a zip file containing the annotations and curations	required
`type`	`str`	A str indicating to read in either the 'curation' or the 'annotation'. Defaults to 'curation'.	`'curation'`
`entity_type`	`str`	the Cas type where the span is stored in. Defaults to "custom.Span"	`'custom.Span'`
`entity_type_attr`	`str`	the Cas attribute type where the content of the span is stored in. Defaults to "label"	`'label'`

Returns:

Type	Description
	A dictionary with elements
	source: 'inception'
	project: a dictionary with elements project, users, documents, anno, tagset
	docs: a pandas dataframe with columns 'project_name', 'document_name', 'user_name', 'document_state', 'annotation_state', 'timestamp'
	anno: a pandas dataframe with columns 'project_id', 'doc_id', 'document_name', 'text', 'user_name', 'cas'
	entities: a pandas dataframe with columns 'project_id', 'doc_id', 'doc_id_inception', 'user_name', 'sentence', 'sentence_start', 'sentence_end', 'token', 'start', 'end', 'entity', 'entity_detail'

Examples:

>>> from blackbar import inception_list_projects, inception_export
>>> x = inception_list_projects()
>>> #project_id = x['project_id'][x['project_name'] == 'deid-sample202112']
>>> project_id = x['project_id'][x['project_name'] == 'getuigenissen']
>>> project_id = x['project_id'][x['project_name'] == 'testuzb']
>>> path = inception_export(id = project_id)
>>> # path = "radiologie.zip"
>>> # path = "blackbar-example_project_2024-11-11_1626.zip"
>>> d = blackbar_inception_read_export(path, type = "curation")
>>> d = blackbar_inception_read_export(path, type = "annotation")
>>> # d["docs"]
>>> # d["anno"]
>>> # d["entities"]

`blackbar.blackbar.line_spans(x)`

Split text into newlines which are considered as sentences and identify the locations of these sentences in x

Parameters:

Name	Type	Description	Default
`x`	`str`	a text with string	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with columns sentence_id, start, end, sentence, text, nchar, is_space indicating the location of the sentence in x

Examples:

>>> from blackbar import line_spans
>>> x = "I want to break free\nGive me more"
>>> locs = line_spans(x)
>>> locs['sentence'][0]
'I want to break free'
>>> locs['sentence'][1]
'Give me more'

`blackbar.blackbar.token_spans(doc, detailed=False)`

Extract a spacy parsed document as a pandas DataFrame

Parameters:

Name	Type	Description	Default
`doc`	`Doc`	A spacy parsed document	required
`detailed`	`bool`	Boolean indicating to do detailed	`False`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with columns token, term_id, lemma, start, end, upos, xpos, dep, dep_term_id, entity, entity_iob
`DataFrame`	and optionally (if detailed is set to True) columns spaces, is_stop, is_alpha, is_digit, is_punct, is_title, like_email, like_num, like_url, is_sent_end, is_sent_start

Examples:

>>> from blackbar import token_spans
>>> from rlike import substr
>>> import spacy
>>> text = "I want to break free\nGive me more"
>>> nlp = spacy.load("nl_core_news_md")
>>> doc = nlp(text, disable = ['ner'])   
>>> x = token_spans(doc)
>>> doc = nlp(text)   
>>> x = token_spans(doc, detailed = True)
>>> list(x["token"]) == substr(text, start = list(x["start"]), end = list(x["end"]))
True

`blackbar.blackbar.token_entity_spans(doc, doc_id=None)`

Extract a locations of the detected entities of a spacy parsed document as a pandas DataFrame

Parameters:

Name	Type	Description	Default
`doc`	`Doc`	A spacy parsed document	required
`doc_id`	`str`	an identifier of the document which will be put into column doc_id	`None`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similarity, start, end, length where term_search and similarity will always be empty

Examples:

>>> from blackbar import token_entity_spans
>>> from blackbar.data import blackbar_example
>>> from rlike import substr
>>> import spacy
>>> nlp = spacy.load("nl_core_news_md")
>>> text = blackbar_example()
>>> doc = nlp(text)   
>>> x = token_entity_spans(doc)
>>> x = token_entity_spans(doc, doc_id = "abc-123-xyz")
>>> list(x["term_detected"]) == substr(text, start = list(x["start"]), end = list(x["end"]))
True

`blackbar.blackbar.merge_chunkranges(data, text, include_end=False)`

Combine overlapping chunk ranges by finding overlapping ranges. Overlapping ranges are handled as follows. We first order the ranges by the starting positions. If the start value of the chunk falls within the previous chunk range (smaller than the previous end value - it overlaps) we extend the previous end value with the new end value and keep the longest search string (term_search) as we assume the detected chunk ranges are coming from a Smith-Waterman alignment.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	A data frame with columns doc_id, label_, term_search, start, end, similarity indicating the location of detected entities, where the end value is the end position in the text + 1	required
`text`	`str`	a text string based on which the entities in data were found	required
`include_end`	`bool`	bool indicating to include the end	`False`

Returns:

Type	Description
	A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similarity, start, end, length indicating

Examples:

>>> from pandas import DataFrame
>>> from rlike import substr
>>> from blackbar.data import blackbar_example
>>> text = blackbar_example()
>>> ent = {'doc_id': text, 'similarity': 0, 'term_search': ['A', 'AA', 'B', 'B', 'C', 'CC', 'D', 'DD'], 'start': [10, 10+5, 93, 242, 343 + 4, 343, 384, 384+4], 'end': [19, 19, 98, 247, 343 + 7, 359, 400, 400+2], 'label_': ['Date', 'Date', 'Date', 'Date', 'Name', 'Name', 'Name', 'Name']} 
>>> ent = DataFrame(ent)
>>> substr(ent["doc_id"], start = list(ent["start"]), end = list(ent["end"] - 1))
['10/4/2021', '2021', '30/03', '09/04', 'Jan', 'Dr. Jan Janssens', 'Dr. Jan Janssens', 'Jan Janssens |']
>>> chunkranges = merge_chunkranges(ent, text)
>>> list(chunkranges["term_detected"])
['10/4/2021', '30/03', '09/04', 'Dr. Jan Janssens', 'Dr. Jan Janssens |']