Interact with Inception and the Inception API
Inception is an annotation platform which is used in blackbar to generate training data in order to build named entity recognition models for anonymization of texts.
Inception provides an API which allows to interact with the environment (see the inception docs).
- Connectivity to the API is done by using the python packages docker and podman and by setting the environment variables
INCEPTION_HOST
,INCEPTION_USERNAME
andINCEPTION_PASSWORD
- Once these are set, you can perform the operations
from rlike import *
environ = {
"INCEPTION_HOST": "https://inception.datatailor.be/",
"INCEPTION_USERNAME": "XXXXXXXXXX",
"INCEPTION_PASSWORD": "XXXXXXXXXX"}
Sys_setenv(environ)
Connect with the API / list the projects / export
blackbar.inception.inception_client(host=None, user=None, password=None)
blackbar.inception.inception_list_projects(client=None)
Get a list of projects in Inception using the Inception API
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
Pycaprio
|
A Pycaprio client as defined in pycaprio |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
a pandas dataframe with columns project_id, project_name |
Examples:
>>> from blackbar import inception_list_projects
>>> x = inception_list_projects()
blackbar.inception.inception_export(client=None, id=None, folder=os.getcwd(), format=None)
Export a project from Inception in XMI format as a zip file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
Pycaprio
|
A Pycaprio client as defined in pycaprio |
None
|
id
|
int
|
The project_id of the Inception project as indicated by inception_list_projects |
None
|
folder
|
str
|
The folder where the zip file will be extraced to. Defaults to the current working directory |
getcwd()
|
format
|
str
|
The output format. Defaults to XMI format |
None
|
Returns:
Type | Description |
---|---|
str
|
A str with the path to a file where the zip file will have the name of the project |
Examples:
>>> from blackbar import inception_list_projects, inception_export
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> path = inception_export(id = project_id)
>>> path = inception_export(id = x['project_id'][0])
>>> path = inception_export(id = x['project_id'][1])
blackbar.inception.inception_import_project(client=None, path=None)
Import an Inception project which was exported as a zip file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
Pycaprio
|
A Pycaprio client as defined in pycaprio |
None
|
path
|
str
|
string with the path to a zip file |
None
|
Returns:
Type | Description |
---|---|
a dict with elements: project_id and project_name |
Examples:
>>> from blackbar import inception_list_projects, inception_export
>>> x = inception_list_projects()
>>> project_id = x['project_id'][x['project_name'] == 'blackbar-example']
>>> project_id = int(project_id.iloc[0])
>>> path = inception_export(id = project_id)
>>> proj = inception_import_project(path = path)
blackbar.inception.inception_read_eventlog(path)
Read the enventlog from an Inception export
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to a .zip file extracted with inception_export |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas dataframe with columns 'created', 'details', 'event', 'id', 'user', 'document_name', 'annotator' |
Examples:
>>> from blackbar import inception_list_projects, inception_export, inception_read_eventlog
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> path = inception_export(id = project_id)
>>> events = inception_read_eventlog(path)
List the annotations and read in the data
blackbar.inception.inception_list_documents(client=None, id=None, trace=0, text=True, n_max=0, encoding=None, encoding_errors='replace')
Get a list of documents for a project in Inception using the Inception API
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
Pycaprio
|
A Pycaprio client as defined in pycaprio |
None
|
id
|
int
|
The project_id of the Inception project as indicated by inception_list_projects |
None
|
trace
|
int
|
Integer, if bigger than zero, shows the trace of the extraction every trace number of documents which are extracted |
0
|
text
|
bool
|
Boolean, indicating to fetch as well the text of the document. Defaults to True. |
True
|
n_max
|
int
|
Integer, if bigger than zero, get only the first n_max documents. Defaults to all documents. |
0
|
encoding
|
str
|
string, indicating to decode the text. Possible values are None (return bytes), 'utf-8', 'latin-1' |
None
|
encoding_errors
|
str
|
what to do in case of encoding errors. Either 'strict', 'replace', 'ignore'. Defaults to 'replace' |
'replace'
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas dataframe with columns project_id, doc_id, text, label and state. In case text is set to false, does not include the text of the document. |
Examples:
>>> from blackbar import inception_list_projects, inception_list_documents
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> docs = inception_list_documents(id = project_id)
>>> #docs = inception_list_documents(id = x['project_id'][0], trace = 10)
>>> #docs = inception_list_documents(id = x['project_id'][x['project_name'] == 'penningkohieren-aalst'], text = False)
>>> #docs = inception_list_documents(id = x['project_id'][x['project_name'] == 'penningkohieren-aalst'], text = True, trace = 10)
blackbar.inception.inception_list_annotations(client=None, id=None, document=None, trace=0)
Get the annotations performed on documents for a project in Inception using the Inception API
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
Pycaprio
|
A Pycaprio client as defined in pycaprio |
None
|
id
|
int
|
The project_id of the Inception project as indicated by inception_list_projects |
None
|
document
|
int
|
The document id's for which the annotations need to be extracted. If not provided, will fetch all documents. |
None
|
trace
|
int
|
Integer, if bigger than zero, shows the trace of the extraction every trace number of documents which are extracted |
0
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas dataframe with columns project_id, document_id, user_name, annotation_state and timestamp containing all annotations which are not in the 'NEW' state |
Examples:
>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'penningkohieren-aalst']['project_id'].iloc[0])
>>> docs = inception_list_documents(id = project_id, text = False)
>>> docs = list(docs["doc_id"])
>>> progress = inception_list_annotations(id = project_id, document = [88, 89])
>>> progress = inception_list_annotations(id = project_id)
blackbar.inception.inception_types(x)
Get the types available in a Inception CAS or TypeSystem object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Cas or TypeSystem
|
A CAS object or a TypeSystem object |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A a pandas dataframe with columns name, description, features |
Examples:
>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["annotation"] = inception_cas(x = progress, as_xmi = True)
blackbar.inception.inception_cas(client=None, x=None, as_xmi=False, trace=0)
Get the detailed CAS (Common Analysis System) annotations for a set of annotations
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
Pycaprio
|
A Pycaprio client as defined in pycaprio |
None
|
x
|
Dataframe
|
A pandas dataframe with columns document_id and user_name as returned by inception_list_annotations |
None
|
as_xmi
|
bool
|
Boolean, indicating if the resulting object needs to be read in with read_xmi |
False
|
trace
|
int
|
Integer, if bigger than zero, shows the trace of the extraction every trace number of documents which are extracted |
0
|
Returns:
Type | Description |
---|---|
List[Cas]
|
A list with the deserialised annotations of type cassis.cas.Cas |
Examples:
>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["anno_raw"] = inception_cas(x = progress, as_xmi = False)
>>> progress["anno"] = inception_cas(x = progress, as_xmi = True)
blackbar.inception.read_xmi(data, type='cas')
Read an CAS (Common Analysis System) XMI file exported from Inception. The CAS (Common Analysis System) is a data structure representing an object to be enriched with annotations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
bytes
|
a set of annotations as returned by client.api.annotation |
required |
type
|
str
|
a str with only possible value: 'cas' |
'cas'
|
Returns:
Type | Description |
---|---|
Cas
|
The deserialized CAS of type cassis.cas.Cas |
Examples:
>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas
>>> x = inception_list_projects()
>>> #project_id = int(x[x['project_name'] == 'deid-sample202112']['project_id'].iloc[0])
>>> project_id = int(x[x['project_name'] == 'testuzb']['project_id'].iloc[0])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["anno_raw"] = inception_cas(x = progress, as_xmi = False)
>>> progress["anno"] = [read_xmi(el) for el in progress["anno_raw"]]
Utilities to get the labelled entities
blackbar.blackbar.blackbar_inception_entities(cas=None, entity_type='custom.Span', entity_type_attr='label', trace=0, type='document-entities')
Extract the the tokens with the entities from a Inception CAS (Common Analysis System). These are entities which should be defined as 'custom.Span' and have an extra field: 'entiteitsdetail' or 'entity_detail'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cas
|
Cas
|
an object of type cassis.cas.Cas as returned by read_xmi |
None
|
entity_type
|
str
|
the Cas type where the span is stored in. Defaults to "custom.Span" |
'custom.Span'
|
entity_type_attr
|
str
|
the Cas attribute type where the content of the span is stored in. Defaults to "label" |
'label'
|
trace
|
int
|
logging argument. Defaults to not to printing if the "custom.Span" has not been found. |
0
|
type
|
str
|
either 'document-entities' to get all entities across sentences or 'sentence-entities' to get entities which are not crossing sentence boundaries. Defaults to 'document-entities' |
'document-entities'
|
Returns:
Type | Description |
---|---|
A pandas DataFrame with tokens containing the annotated data with columns: doc_id_inception, sentence, sentence_start, sentence_end, token, start, end, entity, entity_detail |
Examples:
>>> from blackbar import inception_list_projects, inception_list_documents, inception_list_annotations, inception_cas, blackbar_inception_entities
>>> x = inception_list_projects()
>>> project_id = int(x['project_id'][x['project_name'] == 'deid-sample202112'])
>>> project_id = int(x['project_id'][x['project_name'] == 'dieetadvies'])
>>> progress = inception_list_annotations(id = project_id)
>>> progress["anno"] = inception_cas(x = progress, as_xmi = True)
>>> anno = blackbar_inception_entities(progress["anno"][0], type = "document-entities")
>>> anno = blackbar_inception_entities(progress["anno"][0], type = "sentences-entities")
blackbar.modelling.blackbar_inception_annotations(projects=None, annotation_state=None, entity_type='custom.Span', entity_type_attr='label', n_max=0, trace=0, encoding=None, encoding_errors='replace', docid_type='default')
Get the annotations performed in Inception on projects
Parameters:
Name | Type | Description | Default |
---|---|---|---|
projects
|
[List[str], str, None]
|
a list of projects |
None
|
annotation_state
|
[List[str], None]
|
a list of annotation states - e.g. ["COMPLETE", "IN-PROGRESS"] to limit the annotation entities to |
None
|
entity_type
|
str
|
the Cas type where the span is stored in. Defaults to "custom.Span" |
'custom.Span'
|
entity_type_attr
|
str
|
the Cas attribute type where the content of the span is stored in. Defaults to "label" |
'label'
|
n_max
|
int
|
Integer, if bigger than zero, get only the first n_max documents. Defaults to all documents. |
0
|
trace
|
int
|
logging argument, passed on to inception_list_documents and inception_list_annotations. Defaults to 0 |
0
|
encoding
|
str
|
string, indicating to decode the text. Possible values are None (return bytes), 'utf-8', 'latin-1'. Passed on to inception_list_documents. |
None
|
encoding_errors
|
str
|
what to do in case of encoding errors. Either 'strict', 'replace', 'ignore'. Defaults to 'replace'. Passed on to inception_list_documents. |
'replace'
|
docid_type
|
str
|
string - either 'default' or 'label'. Indicating to keep the doc_id (default) or recode it to the label. |
'default'
|
Returns:
Type | Description |
---|---|
A dictionary with elements 'docs', 'anno' and 'entities' where |
|
|
|
|
|
|
Examples:
>>> from blackbar import inception_list_projects
>>> x = inception_list_projects()
>>> anno = blackbar_inception_annotations(['testuzb'])
>>> anno = blackbar_inception_annotations(['testuzb'], annotation_state = ["COMPLETE", "IN-PROGRESS"])
>>> anno = blackbar_inception_annotations(['testuzb'], annotation_state = ["COMPLETE", "IN-PROGRESS"], encoding = "utf-8")
blackbar.blackbar.blackbar_inception_read_export(path, type='curation', entity_type='custom.Span', entity_type_attr='label')
Read an exported zip file from Inception containing XMI annotations/curations of a named entity project
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
A str with the path to a zip file containing the annotations and curations |
required |
type
|
str
|
A str indicating to read in either the 'curation' or the 'annotation'. Defaults to 'curation'. |
'curation'
|
entity_type
|
str
|
the Cas type where the span is stored in. Defaults to "custom.Span" |
'custom.Span'
|
entity_type_attr
|
str
|
the Cas attribute type where the content of the span is stored in. Defaults to "label" |
'label'
|
Returns:
Type | Description |
---|---|
A dictionary with elements |
|
|
|
|
|
|
|
|
|
|
Examples:
>>> from blackbar import inception_list_projects, inception_export
>>> x = inception_list_projects()
>>> #project_id = x['project_id'][x['project_name'] == 'deid-sample202112']
>>> project_id = x['project_id'][x['project_name'] == 'getuigenissen']
>>> project_id = x['project_id'][x['project_name'] == 'testuzb']
>>> path = inception_export(id = project_id)
>>> # path = "radiologie.zip"
>>> # path = "blackbar-example_project_2024-11-11_1626.zip"
>>> d = blackbar_inception_read_export(path, type = "curation")
>>> d = blackbar_inception_read_export(path, type = "annotation")
>>> # d["docs"]
>>> # d["anno"]
>>> # d["entities"]
blackbar.blackbar.line_spans(x)
Split text into newlines which are considered as sentences and identify the locations of these sentences in x
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
a text with string |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with columns sentence_id, start, end, sentence, text, nchar, is_space indicating the location of the sentence in x |
Examples:
>>> from blackbar import line_spans
>>> x = "I want to break free\nGive me more"
>>> locs = line_spans(x)
>>> locs['sentence'][0]
'I want to break free'
>>> locs['sentence'][1]
'Give me more'
blackbar.blackbar.token_spans(doc, detailed=False)
Extract a spacy parsed document as a pandas DataFrame
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc
|
Doc
|
A spacy parsed document |
required |
detailed
|
bool
|
Boolean indicating to do detailed |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with columns token, term_id, lemma, start, end, upos, xpos, dep, dep_term_id, entity, entity_iob |
DataFrame
|
and optionally (if detailed is set to True) columns spaces, is_stop, is_alpha, is_digit, is_punct, is_title, like_email, like_num, like_url, is_sent_end, is_sent_start |
Examples:
>>> from blackbar import token_spans
>>> from rlike import substr
>>> import spacy
>>> text = "I want to break free\nGive me more"
>>> nlp = spacy.load("nl_core_news_md")
>>> doc = nlp(text, disable = ['ner'])
>>> x = token_spans(doc)
>>> doc = nlp(text)
>>> x = token_spans(doc, detailed = True)
>>> list(x["token"]) == substr(text, start = list(x["start"]), end = list(x["end"]))
True
blackbar.blackbar.token_entity_spans(doc, doc_id=None)
Extract a locations of the detected entities of a spacy parsed document as a pandas DataFrame
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc
|
Doc
|
A spacy parsed document |
required |
doc_id
|
str
|
an identifier of the document which will be put into column doc_id |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similarity, start, end, length where term_search and similarity will always be empty |
Examples:
>>> from blackbar import token_entity_spans
>>> from blackbar.data import blackbar_example
>>> from rlike import substr
>>> import spacy
>>> nlp = spacy.load("nl_core_news_md")
>>> text = blackbar_example()
>>> doc = nlp(text)
>>> x = token_entity_spans(doc)
>>> x = token_entity_spans(doc, doc_id = "abc-123-xyz")
>>> list(x["term_detected"]) == substr(text, start = list(x["start"]), end = list(x["end"]))
True
blackbar.blackbar.merge_chunkranges(data, text, include_end=False)
Combine overlapping chunk ranges by finding overlapping ranges. Overlapping ranges are handled as follows. We first order the ranges by the starting positions. If the start value of the chunk falls within the previous chunk range (smaller than the previous end value - it overlaps) we extend the previous end value with the new end value and keep the longest search string (term_search) as we assume the detected chunk ranges are coming from a Smith-Waterman alignment.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
DataFrame
|
A data frame with columns doc_id, label_, term_search, start, end, similarity indicating the location of detected entities, where the end value is the end position in the text + 1 |
required |
text
|
str
|
a text string based on which the entities in data were found |
required |
include_end
|
bool
|
bool indicating to include the end |
False
|
Returns:
Type | Description |
---|---|
A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similarity, start, end, length indicating |
Examples:
>>> from pandas import DataFrame
>>> from rlike import substr
>>> from blackbar.data import blackbar_example
>>> text = blackbar_example()
>>> ent = {'doc_id': text, 'similarity': 0, 'term_search': ['A', 'AA', 'B', 'B', 'C', 'CC', 'D', 'DD'], 'start': [10, 10+5, 93, 242, 343 + 4, 343, 384, 384+4], 'end': [19, 19, 98, 247, 343 + 7, 359, 400, 400+2], 'label_': ['Date', 'Date', 'Date', 'Date', 'Name', 'Name', 'Name', 'Name']}
>>> ent = DataFrame(ent)
>>> substr(ent["doc_id"], start = list(ent["start"]), end = list(ent["end"] - 1))
['10/4/2021', '2021', '30/03', '09/04', 'Jan', 'Dr. Jan Janssens', 'Dr. Jan Janssens', 'Jan Janssens |']
>>> chunkranges = merge_chunkranges(ent, text)
>>> list(chunkranges["term_detected"])
['10/4/2021', '30/03', '09/04', 'Dr. Jan Janssens', 'Dr. Jan Janssens |']