Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
For ease of use, we developed a simple FullStack application (a Flask-based API as a BackEnd and jQuery FrontEnd) to analyze job descriptions and predict relevant occupations, skills, and qualifications using the entity-linking model.
First, activate the virtual environment as explained here. Then, run the following command in python in the root
directory:
Run the Flask application:
python app/server/matching.py
Or set the Flask application environment variable and use the Flask command:
export FLASK_APP=app/server/matching.py
flask run --host=0.0.0.0 --port=5001
Open the browser and navigate to http://127.0.0.1:5001/
.
Paste a job description into the provided text area.
Click the "Analyze Job" button to send the job description to the /match
endpoint.
View the results under "Predicted Occupations," "Predicted Skills," and "Predicted Qualifications."
This app is just for demonstration purposes. If you wish to deploy this model, use a more reliable/secure strategy.
If you encounter any bugs or want to contribute to this project please use the following templates to open an issue on GitHub.
Bug report
Create a report to help us improve
ApostolosBenisis
Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
Go to '...'
Click on '....'
Scroll down to '....'
See error
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]
Smartphone (please complete the following information):
Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]
Additional context Add any other context about the problem here.
Feature request
Suggest an idea for this project
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
The Tabiya Livelihoods Classifier provides an easy-to-use implementation of the entity-linking paradigm to support job description heuristics.
Using state-of-the-art transformer neural networks this tool can extract 5 entity types: Occupation, Skill, Qualification, Experience, and Domain. For the Occupations and Skills, ESCO-related entries are retrieved. The procedure consists of two discrete steps, entity extraction and similarity vector search.
The tool is intended for specialists in workforce analytics, recruitment technologies, and human capital management, as well as researchers focused on labor markets and job data. It caters to organizations and professionals who deal with analyzing, categorizing, and optimizing job descriptions or resumes at scale. Ideal users include those in HR technology, career advisory services, and data-driven talent solutions, particularly those seeking to enhance precision in identifying roles, skills, and qualifications for improved decision-making in hiring or workforce planning. Technical users interested in integrating standardized frameworks into their analysis will also find this tool highly relevant.
Prerequisites\
A recent version of git (e.g. ^2.37 )
Note: to install Poetry consult the Poetry documentation
Note: Install poetry system-wide (not in a virtualenv).
This tool uses Git LFS for handling large files. Before using it you need to install and set up Git LFS on your local machine. See https://git-lfs.com/ for installation instructions.
After Git LFS is set up, follow these steps to clone the repository:
git clone https://github.com/tabiya-tech/tabiya-livelihoods-classifier.git
If you already cloned the repository without Git LFS, run:
git lfs pull
Set up virtualenv
In the root directory of the backend project (so, the same directory as this README file), run the following commands:
# create a virtual environment
python3 -m venv venv
# activate the virtual environment
source venv/bin/activate
# Use the version of the dependencies specified in the lock file
poetry lock --no-update
# Install missing and remove unreferenced packages
poetry install --sync
Note: Install the dependencies for the training using:
# Use the version of the dependencies specified in the lock file poetry lock --no-update # Install missing and remove unreferenced packages poetry install --sync --with train
Note: Before running any tasks, activate the virtual environment so that the installed dependencies are available:
# activate the virtual environment source venv/bin/activate
To deactivate the virtual environment, run:
# deactivate the virtual environment deactivate
Activate Python and download the NLTK punctuation package to use the sentence tokenizer. You only need to download punkt
it once.
python <<EOF
import nltk
nltk.download('punkt')
EOF
The tool uses the following environment variable:
HF_TOKEN
: To use the project, you need access to the HuggingFace 🤗 entity extraction model. Contact the administrators via [[email protected]]. From there, you must create a read access token to use the model. Find or create your read access token here. The backend supports the use of a .env
file to set the environment variable. Create a .env
file in the root directory of the backend project and set the environment variables as follows:
# .env file
HF_TOKEN=<YOUR_HF_TOKEN>
ATTENTION: The .env file should be kept secure and not shared with others as it contains sensitive information.
The inference pipeline extracts occupations and skills from a job description and matches them to the most similar entities in the ESCO taxonomy.
First, activate the virtual environment as explained here.
Then, start python interpreter in the root directory
and run the following commands:
Load the EntityLinker
class and create an instance of the class, then perform inference on any text with the following code:
from inference.linker import EntityLinker
pipeline = EntityLinker(k=5)
text = 'We are looking for a Head Chef who can plan menus.'
extracted = pipeline(text)
print(extracted)
After running the commands above, you should see the following output:
[
{'type': 'Occupation', 'tokens': 'Head Chef', 'retrieved': ['head chef', 'industrial head chef', 'head pastry chef', 'chef', 'kitchen chef']},
{'type': 'Skill', 'tokens': 'plan menus', 'retrieved': ['plan menus', 'plan patient menus', 'present menus', 'plan schedule', 'plan engineering activities']}
]
You can use the French version of the Entity Linker using the following code:
from inference.linker import FrenchEntityLinker
pipeline = FrenchEntityLinker(entity_model = 'tabiya/camembert-large-job-ner', similarity_model = 'intfloat/multilingual-e5-base')
text = 'Nous recherchons un chef de cuisine capable de planifier les menus.'
extracted = pipeline(text)
print(extracted)
You should see the following output:
[
{'type': 'Occupation', 'tokens': 'chef de cuisine', 'retrieved': ['chef de cuisine', 'chef de marque', 'chef mécanicien', 'chef cuisinier/cheffe cuisinière', 'chef de train']},
{'type': 'Skill', 'tokens': 'planifier les menus', 'retrieved': ['planifier les menus', 'présenter des menus', 'établir les menus des patients', 'préparer des plannings', 'préparer des plats préparés']}
]
Load the Evaluator
class and print the results:
from inference.evaluator import Evaluator
results = Evaluator(entity_type='Skill', entity_model='tabiya/roberta-base-job-ner', similarity_model='all-MiniLM-L6-v2', crf=False, evaluation_mode=True)
print(results.output)
This class inherits from the EntityLinker
, with the main difference being the 'entity_type'
flag.
4 GB CPU/GPU RAM
The code runs on GPU if available. Ensure your machine has CUDA installed if running on GPU.
In this page we aim to give further details about the classes and functions located to the GItHub repository.
py
Creates a pipeline of an entity recognition transformer and a sentence transformer for embedding text.
entity_model : str, default='tabiya/roberta-base-job-ner' Path to a pre-trained AutoModelForTokenClassification
model or an AutoModelCrfForNer
model. This model is used for entity recognition within the input text.
similarity_model : str, default='all-MiniLM-L6-v2' Path or name of a sentence transformer model used for embedding text. The sentence transformer is used to compute embeddings for the extracted entities and the reference sets. The model 'all-mpnet-base-v2' is available but not in cache, so it should be used with the parameter from_cache=False
at least the first time.
crf : bool, default=False A flag to indicate whether to use an AutoModelCrfForNer
model instead of a standard AutoModelForTokenClassification
. CRF
(Conditional Random Field) models are used when the task requires sequential predictions with dependencies between the outputs.
evaluation_mode : bool, default=False If set to True
, the linker will return the cosine similarity scores between the embeddings. This mode is useful for evaluating the quality of the linkages.
k : int, default=32 Specifies the number of items to retrieve from the reference sets. This parameter limits the number of top matches to consider when linking entities.
from_cache : bool, default=True If set to True
, the precomputed embeddings are loaded from cache to save time. If set to False
, the embeddings are computed on-the-fly, which requires GPU access for efficiency and can be time-consuming.
output_format : str, default='occupation' Specifies the format of the output for occupations, either occupation
, preffered_label
, esco_code
, uuid
or all
to get all the columns. The uuid
is also available for the skills.
text : str An arbitrary job vacancy-related string.
linking : bool, default=True Specify whether the model performs the entity linking to the taxonomy.
French version of the entity linker. In order to use, we need to rewrite the reference databases to the French version of ESCO.
inference/evaluator.py
Evaluator class that inherits the Entity Linker. It computes the queries, corpus, inverted corpus and relevant docs for the InformationRetrievalEvaluator, performs entity linking and computes the Information Retrieval Metrics.
entity_type: str Occupation, Skill, or Qualification to determine the exact evaluation set to be used.
util/transformersCRF.py
Implemented from here.
A class that creates a linear Conditional Random Field model.
Configuration class that inherits from PretrainedConfig HuggingFace class.
A general class that inherits from PreTrainedModel HuggingFace class. The model_type is detected automatically.
model_type: str Possible options include BertCrfForNer
, RobertaCrfForNer
and DebertaCrfForNer.
Custom class used for configuring BERT for CRF.
BERT-based CRF model that inherits from PreTrainedModel HuggingFace class.
Same as PreTrainedModel HuggingFace.
Same as PreTrainedModel HuggingFace except for
special_tokens_mask
default: None. We use this option from HuggingFace as a small hack to implement the special_mask needed for CRF.
Custom class used for configuring RoBERTa for CRF.
RoBERTa-based CRF model that inherits from PreTrainedModel HuggingFace class.
Same as PreTrainedModel HuggingFace.
Same as PreTrainedModel HuggingFace except for
special_tokens_mask
default: None. We use this option from HuggingFace as a small hack to implement the special_mask needed for CRF.
Custom class used for configuring RoBERTa for CRF.
RoBERTa-based CRF model that inherits from PreTrainedModel HuggingFace class.
Same as PreTrainedModel HuggingFace.
Same as PreTrainedModel HuggingFace except for
special_tokens_mask
default: None. We use this option from HuggingFace as a small hack to implement the special_mask needed for CRF.
util/utilfunctions.py
Configuration class for the training hyperparameters.
A class that loads the tensors in the CPU.
Train your entity extraction model using PyTorch.
First, activate the virtual environment as explained here.
Configure the necessary hyperparameters in the config.json file. The defaults are:
{
"model_name": "bert-base-cased",
"crf": false,
"dataset_path": "tabiya/job_ner_dataset",
"label_list": ["O", "B-Skill", "B-Qualification", "I-Domain", "I-Experience", "I-Qualification", "B-Occupation", "B-Domain", "I-Occupation", "I-Skill", "B-Experience"],
"model_max_length": 128,
"batch_size": 32,
"learning_rate": 1e-4,
"epochs": 4,
"weight_decay": 0.01,
"save": false,
"output_path": "bert_job_ner"
}
To train the model, run the following script in the train
directory:
python train.py
The training script is based on the official HuggingFace token classification tutorial.
Configure the necessary hyperparameters in the sbert_train
function in the sbert_train.py file:
sbert_train(model_id='all-MiniLM-L6-v2', dataset_path='your/dataset/path', output_path='your/output/path')
To train the similarity model, run the following script in the train
directory:
python sbert_train.py
The dataset should be formatted as a CSV file with two columns, such as 'title' and 'esco_label', where each row contains a pair of related textual data points to be used during the training process. Make sure there are no missing values in your dataset to ensure successful training of the model. Here's an example of how your CSV file might look:
Senior Conflict Manager
public institution director
etc
etc
More information can be found here.
1. What is the Tabiya Livelihoods Classifier? The Tabiya Livelihoods Classifier is a tool that leverages advanced transformer-based neural networks to extract and categorize key entities from job descriptions. It supports tasks like occupation and skill classification using frameworks like ESCO.
2. Who can benefit from using this tool? It is designed for HR professionals, recruiters, career advisors, labor market researchers, and developers working on job-matching technologies or workforce analytics.
3. What types of entities can the tool extract? The classifier identifies and categorizes five entity types: Occupation, Skill, Qualification, Experience, and Domain.
4. Is this tool compatible with any specific standards or frameworks? Yes, it retrieves ESCO-related entries for Occupations and Skills, aligning with widely used European job classification systems. With minimal work, other taxonomies like O*Net, could be integrated.
5. How does the Tabiya Livelihoods Classifier work? The process involves two main steps:
Entity Extraction: Identifies relevant entities in job descriptions.
Similarity Vector Search: Matches extracted entities to entries in pre-defined frameworks or datasets.
6. Does the tool use machine learning models? Yes, it utilizes transformer-based models, which represent the state-of-the-art in natural language processing.
7. Can I customize the classifier for specific industries or datasets? The tool supports customization, allowing users to adapt the similarity search or integrate custom datasets to suit specific domains and use cases.
8. What is the difference between entity extraction and similarity vector search? Entity extraction identifies relevant entities from text, such as a job title or skill. Similarity vector search then matches these entities to related entries in a knowledge base, like ESCO, for standardization.
9. How do I install and use the classifier? Detailed installation and setup instructions are available in the user guide.
10. Can the classifier be integrated into existing HR systems? Yes, it is designed to be easily integrated into workflows or systems through APIs or library functions.
11. Are there any prerequisites for using this tool? A working knowledge of Python is recommended for setup and integration. Familiarity with natural language processing concepts is beneficial but not mandatory.
12. How accurate is the entity classification? The classifier achieves state-of-the-art entity recognition results based on the dataset released by Green et al. Albeit, as with any machine learning model, the Entity Linker is not perfect. If you encounter bugs or inappropriate use-cases, please open an issue on GitHub!
13. Does the tool handle multilingual job descriptions? We are currently working on developing a method of expanding the tool's capabilities for multiple languages. As of right now, the tool supports only the English and French languages.
14. Are there limitations on the size of input text? The model uses the NLTK sentence tokenizer function to handle large texts, so theoretically, there is no limit to the input text size. In the current version, the BERT-based models used for entity extraction have a limit of 128 tokens (roughly 100 words). You can use the training script to retrain the model to fit your needs.
15. Can I contribute to or extend the tool? Yes, developers are welcome to customize and extend the tool. Refer to the contributing guide in the documentation for guidelines.
16. Where can I find support or report issues? Support is available through the official repository or customer service channels. Issues can be reported on the GitHub issues page or via email.
17. Are updates and new features planned? Yes, the tool is actively maintained, with plans for additional features and improved integrations based on user feedback.
Location: inference/files/occupations_augmented.csv
Source:
Description: ESCO (European Skills, Competences, Qualifications and Occupations) is the European multilingual classification of Skills, Competences, and Occupations. This dataset includes information relevant to the occupations.
License: Creative Commons Attribution 4.0 International see DATA_LICENSE for details.
Modifications: The columns retained are alt_label
, preferred_label
, esco_code
, and uuid
. Each alternative label has been separated into individual rows.
Location: inference/files/skills.csv
Source:
Description: ESCO (European Skills, Competences, Qualifications and Occupations) is the European multilingual classification of Skills, Competences and Occupations. This dataset includes information relevant to the skills.
License: Creative Commons Attribution 4.0 International see Data License for details.
Modifications: The columns retained are preferred_label
and uuid
.
Location: inference/files/qualifications.csv
Source:
Description: This dataset contains EQF (European Qualifications Framework) relevant information extracted from the official EQF comparison website. It includes data strings, country information, and EQF levels. Non-English text was ignored.
License: Please refer to the original source for .
Modifications: Non-English text was removed, and the remaining information was formatted into a structured database.
For the French version of the tool, we use the French version of ESCO v1.1.1, as well as, a translation of the qualifications, using the Google Translation API.
Location:
Source:
Description: This dataset provides a comprehensive benchmark suite for Entity Recognition (ER) in job descriptions. Developed to fill the significant gap in resources for extracting key entities like skills from job descriptions, the dataset features 18.6k annotated entities across five categories: Skill, Qualification, Experience, Occupation, and Domain.
License: CC-BY-NC-4.0
Modifications: No modifications were made to the original dataset. It was only converted to HuggingFace format.
Location: TBD
Source:
Description:
The hahu_test.csv
file is the original file provided by Hahu Jobs with the following fields:
title: The title of the job position, indicating the specific role and/or position within the organization.
esco_label: The preferred or alternative label provided by ESCO, matching the corresponding ESCO code.
esco_code: The ESCO code associated with the job, facilitating standardized classification and comparison across different job listings.
License: CC-BY-NC-4.0
Modifications: Extracted Occupation title and relevant ESCO code and matched with preferred and alternative labels.
Location: inference/files/eval/redacted_hahu_test_with_id.csv
Source:
Description: This dataset consists of 542 entries chosen at random from the 11 general classification system of the Ethiopian hahu jobs platform. 50 entries were selected from each class to create the final dataset.
License: Creative Commons Attribution 4.0 International see Data License for details.
Modifications: No modifications were made to the selected entries.
Location:
inference/files/eval/house_test_annotations.csv
inference/files/eval/house_validation_annotations.csv
inference/files/eval/tech_test_annotations.csv
inference/files/eval/tech_validation_annotations.csv
Source: Provided by
Description: The dataset includes the HOUSE and TECH extensions of the SkillSpan Dataset. In the original work by Decorte et al., the test and development entities of the SkillSpan Dataset were annotated into the ESCO model.
License: MIT, Please refer to the original source.
Modifications: The datasets were used as provided without further modifications.
Location: inference/files/eval/qualification_mapping.csv
Source: Extended from the Qualifications
Description: This dataset maps the Green Benchmark Qualifications to the appropriate EQF levels. Two annotators tagged the qualifications, resulting in a Cohen's Kappa agreement of 0.45, indicating moderate agreement.
License: Creative Commons Attribution 4.0 International see Data License for details.
Modifications: Extended the dataset to include EQF level mappings and the annotations were verified by two annotators.
To use these datasets, ensure you comply with the original dataset's license and terms of use. Any modifications made should be documented and attributed appropriately to your project.