All pages
Powered by GitBook
1 of 8

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Web Application

For ease of use, we developed a simple FullStack application (a Flask-based API as a BackEnd and jQuery FrontEnd) to analyze job descriptions and predict relevant occupations, skills, and qualifications using the entity-linking model.

Usage

First, activate the virtual environment as explained here. Then, run the following command in python in the root directory:

Running the API

Run the Flask application:

python app/server/matching.py

Or set the Flask application environment variable and use the Flask command:

export FLASK_APP=app/server/matching.py
flask run --host=0.0.0.0 --port=5001

Example Usage

  1. Open the browser and navigate to http://127.0.0.1:5001/.

  2. Paste a job description into the provided text area.

  3. Click the "Analyze Job" button to send the job description to the /match endpoint.

  4. View the results under "Predicted Occupations," "Predicted Skills," and "Predicted Qualifications."

This app is just for demonstration purposes. If you wish to deploy this model, use a more reliable/secure strategy.

Contributing Guide

If you encounter any bugs or want to contribute to this project please use the following templates to open an issue on GitHub.

Bug report template

name
about
title
labels
assignees

Bug report

Create a report to help us improve

ApostolosBenisis

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'

  2. Click on '....'

  3. Scroll down to '....'

  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]

  • Browser [e.g. chrome, safari]

  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]

  • OS: [e.g. iOS8.1]

  • Browser [e.g. stock browser, safari]

  • Version [e.g. 22]

Additional context Add any other context about the problem here.

Add feature template

name
about
title
labels
assignees

Feature request

Suggest an idea for this project

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Livelihoods Classifier

The Tabiya Livelihoods Classifier provides an easy-to-use implementation of the entity-linking paradigm to support job description heuristics.

Using state-of-the-art transformer neural networks this tool can extract 5 entity types: Occupation, Skill, Qualification, Experience, and Domain. For the Occupations and Skills, ESCO-related entries are retrieved. The procedure consists of two discrete steps, entity extraction and similarity vector search.

Model's Architecture

Target Audience

The tool is intended for specialists in workforce analytics, recruitment technologies, and human capital management, as well as researchers focused on labor markets and job data. It caters to organizations and professionals who deal with analyzing, categorizing, and optimizing job descriptions or resumes at scale. Ideal users include those in HR technology, career advisory services, and data-driven talent solutions, particularly those seeking to enhance precision in identifying roles, skills, and qualifications for improved decision-making in hiring or workforce planning. Technical users interested in integrating standardized frameworks into their analysis will also find this tool highly relevant.

You can find all related code on .

Tabiya's GitHub page
Job Entity Linking Pipeline

Getting Started

Installation

Prerequisites\

  • A recent version of git (e.g. ^2.37 )

  • Python 3.10 or higher

  • Poerty 1.8 or higher

    Note: to install Poetry consult the Poetry documentation

    Note: Install poetry system-wide (not in a virtualenv).

  • Git LFS

Using Git LFS

This tool uses Git LFS for handling large files. Before using it you need to install and set up Git LFS on your local machine. See https://git-lfs.com/ for installation instructions.

After Git LFS is set up, follow these steps to clone the repository:

git clone https://github.com/tabiya-tech/tabiya-livelihoods-classifier.git

If you already cloned the repository without Git LFS, run:

git lfs pull

Install the dependencies

Set up virtualenv

In the root directory of the backend project (so, the same directory as this README file), run the following commands:

# create a virtual environment
python3 -m venv venv

# activate the virtual environment
source venv/bin/activate
# Use the version of the dependencies specified in the lock file
poetry lock --no-update
# Install missing and remove unreferenced packages
poetry install --sync

Note: Install the dependencies for the training using:

# Use the version of the dependencies specified in the lock file
poetry lock --no-update
# Install missing and remove unreferenced packages
poetry install --sync --with train

Note: Before running any tasks, activate the virtual environment so that the installed dependencies are available:

# activate the virtual environment
source venv/bin/activate

To deactivate the virtual environment, run:

# deactivate the virtual environment
deactivate

Activate Python and download the NLTK punctuation package to use the sentence tokenizer. You only need to download punkt it once.

python <<EOF
import nltk
nltk.download('punkt')
EOF

Environment Variable & Configuration

The tool uses the following environment variable:

  • HF_TOKEN: To use the project, you need access to the HuggingFace 🤗 entity extraction model. Contact the administrators via [[email protected]]. From there, you must create a read access token to use the model. Find or create your read access token here. The backend supports the use of a .env file to set the environment variable. Create a .env file in the root directory of the backend project and set the environment variables as follows:

# .env file
HF_TOKEN=<YOUR_HF_TOKEN>

ATTENTION: The .env file should be kept secure and not shared with others as it contains sensitive information.

QuickStart Guide

Inference Pipeline

The inference pipeline extracts occupations and skills from a job description and matches them to the most similar entities in the ESCO taxonomy.

Usage

First, activate the virtual environment as explained here.

Then, start python interpreter in the root directory and run the following commands:

Load the EntityLinker class and create an instance of the class, then perform inference on any text with the following code:

from inference.linker import EntityLinker
pipeline = EntityLinker(k=5)
text = 'We are looking for a Head Chef who can plan menus.'
extracted = pipeline(text)
print(extracted)

After running the commands above, you should see the following output:

[
  {'type': 'Occupation', 'tokens': 'Head Chef', 'retrieved': ['head chef', 'industrial head chef', 'head pastry chef', 'chef', 'kitchen chef']},
  {'type': 'Skill', 'tokens': 'plan menus', 'retrieved': ['plan menus', 'plan patient menus', 'present menus', 'plan schedule', 'plan engineering activities']}
]

French version

You can use the French version of the Entity Linker using the following code:

from inference.linker import FrenchEntityLinker
pipeline = FrenchEntityLinker(entity_model = 'tabiya/camembert-large-job-ner', similarity_model = 'intfloat/multilingual-e5-base')

text = 'Nous recherchons un chef de cuisine capable de planifier les menus.'
extracted = pipeline(text)
print(extracted)

You should see the following output:

[
  {'type': 'Occupation', 'tokens': 'chef de cuisine', 'retrieved': ['chef de cuisine', 'chef de marque', 'chef mécanicien', 'chef cuisinier/cheffe cuisinière', 'chef de train']}, 
  {'type': 'Skill', 'tokens': 'planifier les menus', 'retrieved': ['planifier les menus', 'présenter des menus', 'établir les menus des patients', 'préparer des plannings', 'préparer des plats préparés']}
]

Running the evaluation tests

Load the Evaluator class and print the results:

from inference.evaluator import Evaluator

results = Evaluator(entity_type='Skill', entity_model='tabiya/roberta-base-job-ner', similarity_model='all-MiniLM-L6-v2', crf=False, evaluation_mode=True)
print(results.output)

This class inherits from the EntityLinker, with the main difference being the 'entity_type' flag.

If you want to run evaluations on custom datasets, you will need to make modifications to the _load_dataset function, located on the evaluation.py file. Please refer to the original evaluation datasets as described here. If you have any trouble, please open an issue on GitHub.

Minimum Hardware

  • 4 GB CPU/GPU RAM

The code runs on GPU if available. Ensure your machine has CUDA installed if running on GPU.

Advanced Topics

In this page we aim to give further details about the classes and functions located to the GItHub repository.

inference/linker.py

class EntityLinker

Creates a pipeline of an entity recognition transformer and a sentence transformer for embedding text.

Initialization Parameters

entity_model : str, default='tabiya/roberta-base-job-ner' Path to a pre-trained AutoModelForTokenClassification model or an AutoModelCrfForNer model. This model is used for entity recognition within the input text.

similarity_model : str, default='all-MiniLM-L6-v2' Path or name of a sentence transformer model used for embedding text. The sentence transformer is used to compute embeddings for the extracted entities and the reference sets. The model 'all-mpnet-base-v2' is available but not in cache, so it should be used with the parameter from_cache=False at least the first time.

crf : bool, default=False A flag to indicate whether to use an AutoModelCrfForNer model instead of a standard AutoModelForTokenClassification. CRF (Conditional Random Field) models are used when the task requires sequential predictions with dependencies between the outputs.

evaluation_mode : bool, default=False If set to True, the linker will return the cosine similarity scores between the embeddings. This mode is useful for evaluating the quality of the linkages.

k : int, default=32 Specifies the number of items to retrieve from the reference sets. This parameter limits the number of top matches to consider when linking entities.

from_cache : bool, default=True If set to True, the precomputed embeddings are loaded from cache to save time. If set to False, the embeddings are computed on-the-fly, which requires GPU access for efficiency and can be time-consuming.

output_format : str, default='occupation' Specifies the format of the output for occupations, either occupation, preffered_label, esco_code, uuid or all to get all the columns. The uuid is also available for the skills.

Calling Parameters

text : str An arbitrary job vacancy-related string.

linking : bool, default=True Specify whether the model performs the entity linking to the taxonomy.

class FrenchEntityLinker

French version of the entity linker. In order to use, we need to rewrite the reference databases to the French version of ESCO.

inference/evaluator.py

class Evaluator(EntityLinker)

Evaluator class that inherits the Entity Linker. It computes the queries, corpus, inverted corpus and relevant docs for the InformationRetrievalEvaluator, performs entity linking and computes the Information Retrieval Metrics.

Initialization Parameters

entity_type: str Occupation, Skill, or Qualification to determine the exact evaluation set to be used.

util/transformersCRF.py

class CRF(nn.Module)

Implemented from here.

A class that creates a linear Conditional Random Field model.

class AutoModelForCrfPretrainedConfig(PretrainedConfig)

Configuration class that inherits from PretrainedConfig HuggingFace class.

class AutoModelCrfForNer(PreTrainedModel)

A general class that inherits from PreTrainedModel HuggingFace class. The model_type is detected automatically.

model_type: str Possible options include BertCrfForNer, RobertaCrfForNer and DebertaCrfForNer.

class BERT_CRF_Config(PretrainedConfig)

Custom class used for configuring BERT for CRF.

class BertCrfForNer(PreTrainedModel)

BERT-based CRF model that inherits from PreTrainedModel HuggingFace class.

Same as PreTrainedModel HuggingFace.

Forward Parameters

Same as PreTrainedModel HuggingFace except for

special_tokens_mask default: None. We use this option from HuggingFace as a small hack to implement the special_mask needed for CRF.

class ROBERTA_CRF_Config(PretrainedConfig)

Custom class used for configuring RoBERTa for CRF.

class RobertaCrfForNer(PreTrainedModel)

RoBERTa-based CRF model that inherits from PreTrainedModel HuggingFace class.

Same as PreTrainedModel HuggingFace.

Forward Parameters

Same as PreTrainedModel HuggingFace except for

special_tokens_mask default: None. We use this option from HuggingFace as a small hack to implement the special_mask needed for CRF.

class DEBERTA_CRF_Config(PretrainedConfig)

Custom class used for configuring RoBERTa for CRF.

class DebertaCrfForNer(PreTrainedModel)

RoBERTa-based CRF model that inherits from PreTrainedModel HuggingFace class.

Same as PreTrainedModel HuggingFace.

Forward Parameters

Same as PreTrainedModel HuggingFace except for

special_tokens_mask default: None. We use this option from HuggingFace as a small hack to implement the special_mask needed for CRF.

util/utilfunctions.py

class Config

Configuration class for the training hyperparameters.

class CPU_Unpickler

A class that loads the tensors in the CPU.

Training

Train your entity extraction model using PyTorch.

First, activate the virtual environment as explained here.

Train an Entity Extraction Model

Configure the necessary hyperparameters in the config.json file. The defaults are:

{
    "model_name": "bert-base-cased",
    "crf": false,
    "dataset_path": "tabiya/job_ner_dataset",   
    "label_list": ["O", "B-Skill", "B-Qualification", "I-Domain", "I-Experience", "I-Qualification", "B-Occupation", "B-Domain", "I-Occupation", "I-Skill", "B-Experience"],
    "model_max_length": 128,
    "batch_size": 32,
    "learning_rate": 1e-4,
    "epochs": 4,
    "weight_decay": 0.01,
    "save": false,
    "output_path": "bert_job_ner"
}

To train the model, run the following script in the train directory:

python train.py

The training script is based on the official HuggingFace token classification tutorial.

Train an Entity Similarity Model

Configure the necessary hyperparameters in the sbert_train function in the sbert_train.py file:

sbert_train(model_id='all-MiniLM-L6-v2', dataset_path='your/dataset/path', output_path='your/output/path')

To train the similarity model, run the following script in the train directory:

python sbert_train.py

The dataset should be formatted as a CSV file with two columns, such as 'title' and 'esco_label', where each row contains a pair of related textual data points to be used during the training process. Make sure there are no missing values in your dataset to ensure successful training of the model. Here's an example of how your CSV file might look:

title
esco_label

Senior Conflict Manager

public institution director

etc

etc

More information can be found here.

FAQs

General Usage

1. What is the Tabiya Livelihoods Classifier? The Tabiya Livelihoods Classifier is a tool that leverages advanced transformer-based neural networks to extract and categorize key entities from job descriptions. It supports tasks like occupation and skill classification using frameworks like ESCO.

2. Who can benefit from using this tool? It is designed for HR professionals, recruiters, career advisors, labor market researchers, and developers working on job-matching technologies or workforce analytics.

3. What types of entities can the tool extract? The classifier identifies and categorizes five entity types: Occupation, Skill, Qualification, Experience, and Domain.

4. Is this tool compatible with any specific standards or frameworks? Yes, it retrieves ESCO-related entries for Occupations and Skills, aligning with widely used European job classification systems. With minimal work, other taxonomies like O*Net, could be integrated.


Technical Functionality

5. How does the Tabiya Livelihoods Classifier work? The process involves two main steps:

  1. Entity Extraction: Identifies relevant entities in job descriptions.

  2. Similarity Vector Search: Matches extracted entities to entries in pre-defined frameworks or datasets.

6. Does the tool use machine learning models? Yes, it utilizes transformer-based models, which represent the state-of-the-art in natural language processing.

7. Can I customize the classifier for specific industries or datasets? The tool supports customization, allowing users to adapt the similarity search or integrate custom datasets to suit specific domains and use cases.

8. What is the difference between entity extraction and similarity vector search? Entity extraction identifies relevant entities from text, such as a job title or skill. Similarity vector search then matches these entities to related entries in a knowledge base, like ESCO, for standardization.


Integration and Setup

9. How do I install and use the classifier? Detailed installation and setup instructions are available in the user guide.

10. Can the classifier be integrated into existing HR systems? Yes, it is designed to be easily integrated into workflows or systems through APIs or library functions.

11. Are there any prerequisites for using this tool? A working knowledge of Python is recommended for setup and integration. Familiarity with natural language processing concepts is beneficial but not mandatory.


Performance and Limitations

12. How accurate is the entity classification? The classifier achieves state-of-the-art entity recognition results based on the dataset released by Green et al. Albeit, as with any machine learning model, the Entity Linker is not perfect. If you encounter bugs or inappropriate use-cases, please open an issue on GitHub!

13. Does the tool handle multilingual job descriptions? We are currently working on developing a method of expanding the tool's capabilities for multiple languages. As of right now, the tool supports only the English and French languages.

14. Are there limitations on the size of input text? The model uses the NLTK sentence tokenizer function to handle large texts, so theoretically, there is no limit to the input text size. In the current version, the BERT-based models used for entity extraction have a limit of 128 tokens (roughly 100 words). You can use the training script to retrain the model to fit your needs.


Support and Customization

15. Can I contribute to or extend the tool? Yes, developers are welcome to customize and extend the tool. Refer to the contributing guide in the documentation for guidelines.

16. Where can I find support or report issues? Support is available through the official repository or customer service channels. Issues can be reported on the GitHub issues page or via email.

17. Are updates and new features planned? Yes, the tool is actively maintained, with plans for additional features and improved integrations based on user feedback.

Datasets

Reference Sets

Occupations

  • Location: inference/files/occupations_augmented.csv

  • Source:

  • Description: ESCO (European Skills, Competences, Qualifications and Occupations) is the European multilingual classification of Skills, Competences, and Occupations. This dataset includes information relevant to the occupations.

  • License: Creative Commons Attribution 4.0 International see DATA_LICENSE for details.

  • Modifications: The columns retained are alt_label, preferred_label, esco_code, and uuid. Each alternative label has been separated into individual rows.

Skills

  • Location: inference/files/skills.csv

  • Source:

  • Description: ESCO (European Skills, Competences, Qualifications and Occupations) is the European multilingual classification of Skills, Competences and Occupations. This dataset includes information relevant to the skills.

  • License: Creative Commons Attribution 4.0 International see Data License for details.

  • Modifications: The columns retained are preferred_label and uuid.

Qualifications

  • Location: inference/files/qualifications.csv

  • Source:

  • Description: This dataset contains EQF (European Qualifications Framework) relevant information extracted from the official EQF comparison website. It includes data strings, country information, and EQF levels. Non-English text was ignored.

  • License: Please refer to the original source for .

  • Modifications: Non-English text was removed, and the remaining information was formatted into a structured database.

For the French version of the tool, we use the French version of ESCO v1.1.1, as well as, a translation of the qualifications, using the Google Translation API.

Training Sets

Entity Extraction

  • Location:

  • Source:

  • Description: This dataset provides a comprehensive benchmark suite for Entity Recognition (ER) in job descriptions. Developed to fill the significant gap in resources for extracting key entities like skills from job descriptions, the dataset features 18.6k annotated entities across five categories: Skill, Qualification, Experience, Occupation, and Domain.

  • License: CC-BY-NC-4.0

  • Modifications: No modifications were made to the original dataset. It was only converted to HuggingFace format.

Entity Similarity

  • Location: TBD

  • Source:

  • Description:

    The hahu_test.csv file is the original file provided by Hahu Jobs with the following fields:

    • title: The title of the job position, indicating the specific role and/or position within the organization.

    • esco_label: The preferred or alternative label provided by ESCO, matching the corresponding ESCO code.

    • esco_code: The ESCO code associated with the job, facilitating standardized classification and comparison across different job listings.

  • License: CC-BY-NC-4.0

  • Modifications: Extracted Occupation title and relevant ESCO code and matched with preferred and alternative labels.

Evaluation Sets

Hahu Test

  • Location: inference/files/eval/redacted_hahu_test_with_id.csv

  • Source:

  • Description: This dataset consists of 542 entries chosen at random from the 11 general classification system of the Ethiopian hahu jobs platform. 50 entries were selected from each class to create the final dataset.

  • License: Creative Commons Attribution 4.0 International see Data License for details.

  • Modifications: No modifications were made to the selected entries.

House and Tech

  • Location:

    • inference/files/eval/house_test_annotations.csv

    • inference/files/eval/house_validation_annotations.csv

    • inference/files/eval/tech_test_annotations.csv

    • inference/files/eval/tech_validation_annotations.csv

  • Source: Provided by

  • Description: The dataset includes the HOUSE and TECH extensions of the SkillSpan Dataset. In the original work by Decorte et al., the test and development entities of the SkillSpan Dataset were annotated into the ESCO model.

  • License: MIT, Please refer to the original source.

  • Modifications: The datasets were used as provided without further modifications.

Qualification Mapping

  • Location: inference/files/eval/qualification_mapping.csv

  • Source: Extended from the Qualifications

  • Description: This dataset maps the Green Benchmark Qualifications to the appropriate EQF levels. Two annotators tagged the qualifications, resulting in a Cohen's Kappa agreement of 0.45, indicating moderate agreement.

  • License: Creative Commons Attribution 4.0 International see Data License for details.

  • Modifications: Extended the dataset to include EQF level mappings and the annotations were verified by two annotators.

Access and Usage

To use these datasets, ensure you comply with the original dataset's license and terms of use. Any modifications made should be documented and attributed appropriately to your project.

For datasets requiring access tokens, such as those from HuggingFace 🤗, please contact the maintainers.

ESCO dataset - v1.1.1
ESCO dataset - v1.1.1
Official European Union EQF comparison website
license information
job_ner_dataset
Green Benchmark corpus
hahu-occupation-titles
hahu_test
Decorte et al.
Green Benchmark