Frequently Asked Questions

Why is the project named blackbar?

It refers to the black bar which is put in front of the eyes of people when they are anonymized in an interview, on a television broadcast or in a comic strip.

How good is the pseudonymization and the anonymization?

Based on tests done in a local hospital where we let people label if a text was pseudonymized or not, persons can no longer see the difference between a real text and a pseudonymized text.
If you perform anonymization using NLP only methods you generally have around 92%-98% of the entities which are retrieved (recall) with a lower recall for addresses as these are the complex ones
If you have internally the information of the persons involved in the texts (addresses of patients and doctors and the information which doctors are treating a patient) which allows to use the Smith-Waterman algorithm, that increases the recall to 98-99% on addresses/names such that about 1-2% of the names/address in texts are not detected.

Do you have an existing model which we can use to anonymize?

Yes, you can get these upon request to speed up your setup, contact us here

Can we run this locally without cloud access?

Yes. Your data does not need to be in the cloud, you can run the software on your premises or in the cloud, whichever you prefer.

Regarding public internet access.

The server where you run the anonymization/pseudonymization should have access to the Prefect scheduler at api.prefect.io if you run use Prefect Cloud as the scheduler.
The server which runs the applications should have access as well to the container registries at registry.datatailor.be and ghcr.io/bnosac to obtain the software.
If you build the Docker images yourself instead of getting it from the above registries, you need access to the git repositories listed in the architecture section.

In which database can we store the texts

We have tested the different scenarios where text is stored in an IRIS database, a SQLite database and a PostgreSQL database. Other database are possible but have not been tested.

What do the textCvt codes mean?

The different status fields are there to keep track of the progress as if you have large volumes of data, it allows to indicated what is already done for each set of text.

0: anonymization is in progress
1: anonymization is done
2: pseudonymization is done
x: other codes for you to keep track
NULL: no anonymization is done / redo anonymization & pseudonymization

Can we have access to the code.

Certainly, you can contact us here to request a token. If you want to clone the repositories to look at the code, you can use the token e.g. as follows.

git clone https://<username>:${BLACKBAR_GITHUB_PAT}@github.com/bnosac/blackbar-py.git
git clone https://<username>:${BLACKBAR_GITHUB_PAT}@github.com/bnosac/textalignment.git
git clone https://<username>:${BLACKBAR_GITHUB_PAT}@github.com/bnosac/rlike.git
git clone https://<username>:${BLACKBAR_GITHUB_PAT}@github.com/bnosac/blackbar-docker.git

Can we make suggestions for improvements

Certainly, contact us here