Improving Multi-Label Classification of Similar Languages by Semantics-Aware Word Embeddings

Paper Authors

The Quyen Ngo¹, Thi Anh Phuong Nguyen², My Linh Ha¹, Thi Minh Huyen Nguyen¹, Phuong Le-Hong¹

  • ¹ Vietnam National University, Hanoi, Vietnam
  • ² Institute of Information Technology, Vietnam Academy of Science and Technology, Hanoi, Vietnam
Access Paper

Table of Content


Abstract

“The VLP team participated in the DSL-ML shared task of the VarDial 2024 workshop which aims to distinguish texts in similar languages. This paper presents our approach to solving the problem and discusses our experimental and official results. We propose to integrate semantics-aware word embeddings which are learned from ConceptNet into a bidirectional long short-term memory network. This approach achieves good performance – our system is ranked in the top two or three of the best performing teams for the task.”

Who are the VLP team?

Not much info I can found about them, but VLP is probably short for Vietnamese Language Processing. They are most likely in association with Vietnam National University, Ha Noi.

What is DSL-ML?

DSL-ML stands for “Discriminating between Similar Languages with Multi-Label classification”. The DSL-ML shared task at the VarDial 2024 workshop challenged participants to create models that can accurately classify texts in similar languages. The task provided datasets from five macro-languages—English, Spanish, Portuguese, French, and Romanian—with multi-label annotations. Participants’ systems were evaluated based on their macro-average F1 scores across these datasets. VarDial 2024.

In the DSL-ML shared task at VarDial 2024, participants were provided with datasets for five language groups:

  • BCMS (Bosnian, Croatian, Montenegrin, Serbian).
  • EN (American and British English).
  • ES (Argentinian and Peninsular Spanish).
  • PT (Brazilian and European Portuguese).
  • FR (Belgian, Canadian, French, and Swiss French).

Each dataset contained instances with multi-label annotations, where each instance consisted of a text segment and its corresponding label(s). The data files were formatted with two tab-separated columns: the first column contained the label(s), and the second column contained the text. Multiple labels were separated by commas. GitHub - DSL-ML-2024.

How do semantics-aware word embeddings differ from normal word embeddings?

Semantics-aware word embeddings incorporate structured semantic knowledge from external resources, such as knowledge graphs, into the embedding space. This integration allows the embeddings to capture deeper relationships and nuances between words beyond their co-occurrence in text. In contrast, traditional word embeddings, like Word2Vec, rely solely on statistical information from large text corpora, which might miss certain semantic relationships.

What is ConceptNet?

ConceptNet is a multilingual knowledge graph that connects words and phrases with labeled edges, representing various semantic relationships between them. It serves as a valuable resource for enhancing word embeddings by providing structured semantic information, enabling models to understand context and relationships more effectively. ConceptNet Blog.


Brief difference between BiLSTM and vanilla LSTM: A vanilla Long Short-Term Memory (LSTM) network processes sequences in a single direction, maintaining information from the past to the present. In contrast, a Bidirectional LSTM (BiLSTM) processes sequences in both forward and backward directions, capturing context from both past and future states. This bidirectional approach allows BiLSTMs to understand the context more comprehensively, especially beneficial in tasks where the meaning of a word depends on both preceding and succeeding words.

graph TD %% Labels for Sections V["**Vanilla LSTM**"]:::label B["**Bidirectional LSTM**"]:::label %% Vanilla LSTM (Unidirectional) V --> A[Input Sequence] -->|Forward Pass| LSTM1[LSTM Layer] --> H1[Hidden State] --> Y[Output] %% BiLSTM (Bidirectional) B --> A2[Input Sequence] -->|Forward Pass| LSTM2[LSTM Layer Forward] --> H2[Hidden State Forward] A2 -->|Backward Pass| LSTM3[LSTM Layer Backward] --> H3[Hidden State Backward] H2 & H3 --> Y2[Final Output] %% Styling classDef rounded fill:#000,stroke:#fff,stroke-width:2px,rx:10,ry:10; classDef roundedLabel fill:#000,stroke:#ffffff,stroke-width:2px,rx:10,ry:10,font-weight:bold; class A,A2,LSTM1,LSTM2,LSTM3,H1,H2,H3,Y,Y2, rounded; class V,B roundedLabel;
LSTM (Unidirectional) and BiLSTM (Bidirectional)

A macro-language is like a fruit category, such as citrus fruits.

  • Just as oranges, lemons, limes, and grapefruits all fall under the “citrus” category but have distinct flavors, Mandarin, Cantonese, and Hakka are all part of the “Chinese” macro-language but are distinct languages.
  • Some citrus fruits are more similar and interchangeable (like oranges and tangerines), just as some dialects within a macro-language are mutually intelligible.
  • Others, like grapefruits and limes, are more distinct but still belong to the same broad category—just like Arabic varieties, which can be quite different yet are grouped under one macro-language. In the DSL-ML task, classifying texts in macro-languages is like distinguishing between citrus fruits—it’s easy when they are very different (like Spanish vs. Portuguese) but much harder when they are closely related (like Brazilian Portuguese vs. European Portuguese).

1. Introduction

The Problem at Hand

The problem at hand involves distinguishing between similar languages and language varieties through a computational lens, focusing on diatopic language variation. This task aims to evaluate system’s ability to classify text instances into multi-label categories based on five macro-languages: BCMS, EN, ES, Portuguese, and FR.

Analogy for the DSL-ML Shared Task

Imagine you’re a music streaming app trying to classify songs into specific genres and sub-genres.

The Challenge

  • Some genres are very distinct (Jazz vs. Heavy Metal)—just like English vs. Japanese.
  • Some genres are very similar (Classical vs. Baroque, or Deep House vs. Tech House)—just like Argentinian Spanish vs. Peninsular Spanish.
  • Some songs may belong to multiple sub-genres at the same time (a song could be both “Synthwave” and “80s Pop”)—just like a text may belong to both “Brazilian Portuguese” and “European Portuguese” depending on certain linguistic traits.

How This Relates to DSL-ML

  • Multi-label classification is needed because a single text might belong to multiple language varieties (e.g., a document could have both American and British English spelling).
  • Macro-languages like BCMS (Bosnian, Croatian, Montenegrin, Serbian) share a lot of similarities, making it difficult to separate them—similar to how music sub-genres blend into each other.
  • The evaluation metric (macro-F1 score) ensures fair assessment across all language varieties, just like a good genre classifier should recognize niche genres without being biased toward the most common ones.

VLP Team’s Approach

The team aimed to improve multi-label classification of similar languages by enhancing their model with semantics-aware word embeddings from ConceptNet, a structured knowledge graph that captures real-world meanings and relationships between words. By incorporating ConceptNet-based embeddings, the system improves its understanding of language nuances beyond just syntax, making it more effective at distinguishing closely related languages and dialects. This approach leverages both linguistic structure (BiLSTM) and real-world meaning (ConceptNet), leading to better classification performance in the DSL-ML shared task.


2. Methods

The team designed a system that adhered strictly to the closed track constraints, AVOIDING pre-trained models and specialized / prohibited datasets while enhancing their BiLSTM network with ConceptNet embeddings. This approach allowed them to leverage semantic knowledge without violating the task’s rules, aiming for a purely data-driven and interpretable method for language variety classification.

2.1. Bidirectional LSTM Model

They implemented a Stacked Bidirectional LSTM architecture, where the hidden states from both left-to-right and right-to-left passes at layer \(k\) are concatenated and fed into layer \((k + 1)\).

For embeddings, they experimented with two approaches:

  • Random Initialization – likely training both the tokenizer/vectorizer and LSTM from scratch.
  • Precomputed ConceptNet Numberbatch Embeddings – details provided in the next section.

After the final BiLSTM layer, a feedforward network with a softmax activation was used for multiway classification:

\[P(y_j \mid \vec{v}_j) = \text{softmax}(W \vec{v}_j + \vec{b})\]

Where \(W\) and \(b\) are learned parameters, thus they proposed the architecture:

graph LR E["Embedding Layer (w)"] --> B["Stacked BiLSTM (h)"] --> D1["Dense (d, ReLU)"] --> D2["Dense (Softmax)"] %% Styling classDef rounded fill:#000,stroke:#fff,stroke-width:2px,rx:10,ry:10; class E,B,D1,D2 rounded;
VLP Team's Proposed network architecture

Where the hyperparameters \(w\), \(h\) and \(d\) are the word embedding size, the recurrent hidden size and the dense hidden size.

All models were trained using Adam with a learning rate of \(5 \times 10^{-5}\), batch size 32, and early stopping when accuracy stagnated for three epochs on the development set. The max sequence length was set to 40 tokens. Each experiment was repeated five times, and results were averaged.

During training, they optimized the cross-entropy loss to maximize correct predictions. While the official evaluation used per-class F1, weighted F1, and macro-averaged F1, they relied on accuracy for model tuning, despite its limitations on imbalanced datasets.

2.2. ConceptNet Numberbatch

ConceptNet is a freely available semantic network designed to help computers understand word meanings. It has been used to create word embeddings similar to Word2Vec, GloVe, and FastText but with unique advantages. ConceptNet embeddings are free, multilingual, and aligned across languages while aiming to avoid harmful biases.

graph TD %% Nodes A[ConceptNet]:::main -->|is a| B[Semantic Network] A -->|is a| C[Knowledge Graph]:::blue C -->|has| D[Commonsense Knowledge]:::gray D -->|part of| E[Artificial Intelligence]:::gray C -->|is used for| F[Natural Language Understanding]:::gray F -->|part of| G[Word Embedding]:::yellow G -->|part of| E %% Additional Relation A -->|motivated by goal| H[Let computer understand what people already know]:::green %% Styling classDef main fill:#6A6AF2,stroke:#000,stroke-width:2px,rx:10,ry:10; classDef blue fill:#457FD3,stroke:#000,stroke-width:2px,rx:10,ry:10; classDef gray fill:#6F6F9B,stroke:#000,stroke-width:2px,rx:10,ry:10; classDef yellow fill:#CFC573,stroke:#000,stroke-width:2px,rx:10,ry:10; classDef green fill:#6ADA8F,stroke:#000,stroke-width:2px,rx:10,ry:10; class A main; class B,C blue; class D,E,F gray; class G yellow; class H green;
An illustration of ConceptNet in graph

ConceptNet Numberbatch is a set of word embeddings trained on ConceptNet, incorporating semi-structured commonsense knowledge rather than relying solely on contextual word usage. This multilingual embedding space allows words from different languages to share a unified representation, making it well-suited for cross-lingual tasks.

These embeddings have shown strong performance in word similarity tasks and have been applied to natural language inference and dependency parsing. In this work, the team demonstrate their effectiveness in classifying similar languages.

3. Results

Their models were tested on development and test datasets in English, Spanish, French, and Portuguese languages. The LSTM-r model with randomly initialized word embeddings peaked at 61.17% accuracy on the development set, but overfitted on the training set. The LSTM-c model using ConceptNet embeddings had better performance on the development set (66.57%) but underperformed on Portuguese due to a technical issue. The ConceptNet embeddings slightly improved performance on English and French test sets.

4. Conclusion

Their model, leveraging ConceptNet embeddings, effectively distinguishes similar languages despite its simplicity. Currently, they use a multi-class classification approach but plan to explore multi-label classification in future work. They suggest integrating contrastive learning to improve representation learning by bringing similar instances closer in embedding space. Additionally, combining knowledge graph-based embeddings (e.g., WordNet, ConceptNet) with contrastive learning could enhance performance. While they tested XLM-R for this task, initial results were underwhelming compared to their method, warranting further investigation into the effectiveness of large language models for similar language classification.