The Limits of Cross-Lingual Transfer: Evaluating SignCLIP on LIS

Abstract. We present the first study of the multilingual transfer ability of SignCLIP for Italian Sign Language (LIS) across zero-shot, few-shot, and fine-tuning. We find that its pretraining induces negative zero-shot transfer. In contrast, few-shot results confirm robust sign embeddings. We find monolingual fine-tuning highly effective on small datasets, achieving top results with Global Noise-Contrastive Estimation (GlobalNCE) and parameter-efficient ProLIP, compared to InfoNCE.

Introduction

Sign Languages are the primary means of communication for millions of deaf individuals worldwide [1], [2]. Isolated Sign Language Recognition (ISLR) remains an open research area at the intersection of computer vision (CV) and natural language processing (NLP) [1], [3]. Similar to spoken language research, there is a large discrepancy in the efficacy of state-of-the-art solutions between high-resource and low-resource languages. Unlike the widely studied American (ASL), British (BSL) and Chinese Sign Language (CSL), Italian Sign Language (LIS) remains under-resourced, lacking a large-scale, annotated corpora required for the training of a deep neural network that can recognise it effectively [2].

To overcome this limitation, recent research has pivoted toward transfer learning and few-shot recognition, leveraging models pre-trained on large multilingual datasets [4]. One of these models is SignCLIP, which utilises contrastive learning to project spoken language text and sign language videos into a shared embedding space. It is pre-trained on Spreadthesign, a dataset containing approximately 500,000 video clips in up to 44 different sign languages [1], including LIS. However, downstream evaluations for LIS recognition and text-video retrieval tasks were entirely omitted in its original benchmarks [1].

Consequently, the applicability of these multi-lingual priors to low-resource LIS datasets remains unexamined. To address this research gap, we present the first investigation for LIS that explores the performance of zero-shot, few-shot, and fine-tuning paradigms using SignCLIP as a foundation model, evaluating both cross-modal Video-Text retrieval and ISLR. For this evaluation, we used two datasets: A3LIS-147, introduced in [5], and SignIT [2]. These datasets enable complementary evaluations by contrasting a controlled, balanced multi-signer environment with domain-specific vocabulary against naturalistic, unbalanced, core-vocabulary signs, respectively.

Our main findings are as follows:

Related works

Italian sign language recognition

LIS ISLR research has largely targeted small-scale, controlled settings, utilising the A3LIS-147 dataset as the primary benchmark [6], [7]. Early approaches with Hidden Markov Models (HMMs) have been improved upon by more recent work reaching an accuracy of 80.4% with fully-supervised CNN models (Inception3D and SlowFast) [6].

The above work suffers from several structural limitations in the context of scalable SLR. The architectures employed are incapable of adapting to out-of-dictionary vocabulary without retraining. Furthermore, they are optimised for clean artifact-free datasets, potentially suffering in performance with out-of-distribution noisy data seen during real-world deployment [8].

To address the latter, the SignIT dataset was recently introduced to benchmark LIS ISLR on real-world data. Baseline evaluations of the SignIT dataset demonstrate that current state-of-the-art approaches struggle to effectively classify LIS signs at the gloss level, as opposed to the categorical level [2]1.

Zero-shot, few-shot and cross-lingual recognition

Early Zero-Shot SLR attempts struggled due to cross-lingual complexities between signs and natural language, as well as high variation in sign execution [10], [11], resulting in a pivot towards few-shot, visual retrieval paradigms [4].

Bilge et al. introduced Few-Shot Sign Language Recognition (FSSLR) via a meta-learning framework across sign languages, proving sparse source examples can generalise to unseen target languages. They discovered “synonym” subsets between languages failed to yield higher performance, suggesting signs are heavily diversified rather than net-iconic [4].

Similarly, Vandendriessche et al. (2025) embedded pose key points for distance-based visual retrieval, enabling one-shot ISLR that generalises to out-of-domain vocabularies without any retraining. Both frameworks operate entirely within a visual domain; achieving high cross-lingual transferability, but lack any inherent coupling to natural language text or semantic meaning [8].

In contrast, Cheng et al. utilise contrastive learning in CiCo to model retrieval as a cross-lingual problem, successfully aligning a single sign language video modality directly to a spoken language text space (e.g., ASL to English). It trains a domain-agnostic sign encoder before the domain-aware retrieval. [12].

SignCLIP, multilingual corpora, and multilingual sign language

SignCLIP aligns multilingual signs to a single text space in English (as a matter of efficiency). Their work relies on the ‘Iconicity Hypothesis’ - that universal motion primitives are semantically shared across sign languages, and adapts the distributional hypothesis to sign language. The model captures the core meaning of a sign as a ‘cluster centre’ in the embedding space, preserving the individual variance of different signers. However, mapping these clusters is made difficult by the Spreadthesign corpus, which is skewed to only one video per sign per language [1].

SignCLIP uses cross-lingual contrastive learning with prefixed language identifying tokens, e.g ‘<en> <ase> {word}’ for ASL. Ultimately, the authors note that the model’s zero-shot performance on out-of-domain data is deficient, and they posit that few-shot learning or fine-tuning is necessary to achieve noticeable performance [1].

The authors do not investigate the underlying architectural or semantic mechanisms that cause this failure, leaving the specific limitations of their cross-modal alignment unexamined.2

Datasets

A3LIS-147 characteristics

SignIT characteristics

Preprocessing

We follow the same pipeline used for the training of the frozen SignCLIP backbone [1], including:

Methodology

We investigate whether SignCLIP’s multilingual pretraining generalises to LIS, a language present in the pretraining corpus but excluded from the original evaluation. Our approach tests this through three phases: Zero-shot evaluation, to assess the frozen multilingual prior’s native LIS structure; few-shot adaptation, to evaluate whether this structure supports recognition from a minimal number of examples using a frozen backbone; and a fine-tuning ablation, to determine the performance ceiling of lightweight fine-tuning whilst preserving cross-modal alignment.

Dual-dataset evaluation

A3LIS-147 and SignIT together evaluate model transfer across three axes:

Zero-shot evaluation

The zero-shot evaluation applies the frozen SignCLIP checkpoint directly to both datasets. Predictions are generated by computing the cosine similarity between the video embedding and the text embedding of the English gloss (prompted as <en> <lis> [gloss]).

We report Recall@1, 5, 10, and Median Rank. To better investigate the “Iconicity Hypothesis” and transfer ability, we perform a per-class analysis stratified by Category, Median Rank, qualitative ASL/BSL similarity (iconicity proxy), and Spreadthesign presence.

Translation. We manually translated A3LIS-147 using Spreadthesign. Remaining out-of-vocabulary (OOV) terms were translated as accurately as possible. We also recreated the unavailable categories. Both are listed in Appendix D.

Few-shot evaluation

We evaluate few-shot ISLR to determine whether the solid results reported in the SignCLIP paper generalise to LIS.

Fine-tuning and loss function ablation on A3LIS-147

We initialise from the baseline checkpoint and fine-tune on A3LIS-147 using a 70/10/20 signer-stratified split. This ensures that our evaluation measures generalisation to unseen signers (see Appendix D for the exact partition). Each configuration is trained for 50 epochs and evaluated across zero-shot retrieval, linear probing, and prototypical retrieval. For all the details about the hyperparameters used, see Appendix B.

The text Transformer 𝑓𝜃𝑡 and CNN backbone 𝑓𝜃CNN are frozen to preserve pre-trained semantic anchors. We unfreeze the visual adaptation parameters Θadapt={𝜃MLP,𝜃𝑣,𝜏}, denoting the video token MLP, video Transformer encoder, and logit-scale temperature, respectively.

Because contrastive models exhibit high sensitivity to objectives and batch scales on low-resource datasets, we conduct an ablation evaluating the following optimisation regimes:

SignIT fine-tuning

The single best-performing fine-tuning regime identified on A3LIS-147 is applied to SignIT (details in Appendix B). To address the dataset’s naturalistic acquisition and long-tailed distribution, we apply light spatial augmentation to preserve semantic meaning, and heavier temporal augmentation (aug_sigma_temporal: 0.25, aug_sigma_spatial: 0.15, aug_sigma_noise: 0.002, aug_p_flip: 0.0,aug_strength_max: 3.5). SignIT’s richer macro-categories, and its previous literature motived additional experiments on category zero-shot and few-shot retrieval. For these experiments, we include recall, precision, and F1 alongside R@1 for better comparison with the original authors.

Experiments

Zero-shot complete-dataset evaluations

Baseline zero-shot evaluations in Table 1 and Table 2 show poor overall performance, in line with Jiang et al. findings for out-of-domain transfer [1]. However, there is stratification between categories. In SignIT, the ‘Food’ domain achieves the highest exact retrieval (10.96% R@1), while ‘Emotions’ demonstrates superior neighbourhood alignment (51.60% R@10). Similarly, A3LIS-147 exhibits a split between early recall (‘Common Life’, 7.22% R@1) and broader neighbourhood density (‘Public Institute’, MedR 42.5). This variance indicates that while overall cross-lingual transfer is weak, the model successfully transfers universal, cross-lingual iconic primitives from the pretraining distribution for specific semantic clusters. For gloss-level and more category details, see Appendix C.

Cat.R@1R@5R@10MedR
Animals0.0410.1490.310823.6
Colors0.05720.23210.435318.2
Emotions0.040.31170.51613.2
Family0.00710.01550.049642.1
Food0.10960.36140.494713.7
Overall0.05060.18760.32624.0
Table 1: SignIT zero-shot category results.
Cat.R@1R@5R@10MedR
Common Life0.07220.21670.272248.9
Education0.04330.12330.1761.0
Highway0.0250.1250.2550.1
Hospital0.02630.08950.168446.0
Public Institute0.04470.13420.242.5
Railway Station0.00830.03330.083347.9
Overall0.03560.11140.173249.2
Table 2: A3LIS zero-shot category results.

Zero-shot medianK stratification, iconicity, and OOV analysis

TierMedR RangePortionCum. MedR
Great1–30.5371.6
Good3.1–150.16787.2
Fair15.1–400.228218.4
Neutral40.1–740.308733.0
Adverse74.1–1480.241649.2
Table 3: A3LIS-147 zero-shot tier stratification.
TierMedR RangePortionCum. MedR
Great1–30.431.9
Good3.1–100.2265.6
Fair10.1–250.36612.3
Neutral25.1–470.22618.5
Adverse47.1–930.14024.0
Table 4: SignIT zero-shot tier stratification.
LIS Sign in STSR@1R@5R@10MedR
No0.04130.11090.154351.1
Yes0.03370.11690.188848.4
Yes, but different0.02860.07860.135747.9
Table 5: A3LIS signs present in pretraining.
Iconicity Proxy (UK/US)R@1R@5R@10MedR
Kind of0.01430.05710.121460.1
No0.0260.11350.169850.5
Yes0.06670.12560.242.0
Table 6: A3LIS iconicity proxy results.

SignCLIP’s cross-lingual alignment induces a structurally bimodal transfer effect. We argue that since the pre-trained text encoder operates in an English-centric semantic space, language prefix identifiers provide insufficient separation. Consequently, the objective forces visually disparate sign videos toward a quasi-singular text anchor. This semantic asymmetry creates an optimisation conflict that marginalises low-resource languages, resulting in negative transfer, evidenced by the adverse tiers in A3LIS-147 (24.16%) and SignIT (14.0%) in Table 3 and Table 4.

For iconic signs, the shared anchor is beneficial (achieving a MedR of 42.0); for non-iconic signs, the anchor provides a weak or adversarial signal, collapsing retrieval accuracy (MedR 60.1) in Table 6. Pre-training exposure does not overcome this issue, Table 5 shows OOV LIS signs marginally outperform in-vocabulary signs at R@1 (4.13% vs. 3.37%).

We believe data scaling is unlikely to resolve these failures. Shared human articulatory constraints result in heavy overlap in the discriminative features between languages, a problem further complicated by high individual signer-variance (Figure 1 in Appendix). Thus, diversification within synonym classes [4] and cross-lingual “false friends” lead to gradient conflicts. Our findings suggest these factors limit zero-shot performance for any architecture imposing a single joint embedding space without language-gated alignment. These issues can be resolved by monolingual fine-tuning (Table 8), likely at the expense of multilingual understanding, but this remains unexamined.

Linear probing on the frozen backbone achieves 66.78% R@1 on A3LIS-147 in Table 7, confirming that the video encoder learns robust representations.

A3LIS fine-tuning ablation

Table 7 shows that GlobalNCE yields the strongest fine-tuning performance on A3LIS and the linear-probe matches previous SOTA [6]. We attribute this to its global negative sampling across distributed batches, providing the critical density of hard negatives required to stabilise contrastive gradients. ProLIP achieves within 0.3% R@1 of GlobalNCE at zero-shot (75.84% vs. 76.17%) while adapting only the final MLP layer and logit scale, making it the preferred regime when compute or overfitting risk is the primary concern.

MethodR@1R@5R@10MedR
BaselineZero0.03690.13090.194640
Proto0.64770.90940.96981
LP0.66780.93290.96981
GlobalNCE16Zero0.76170.92620.96981
Proto0.78860.94300.97321
LP0.80200.94300.97321
PLIP16Zero0.75840.91610.9531
Proto0.77180.93960.95971
LP0.77850.93640.95641
Table 7: A3LIS fine-tuning ablation (Condensed).See Appendix A.2 for the complete ablation over all optimisers.

SignIT few-shot and fine-tuning ablation

Table 8 shows that augmentation of SignIT improves generalisation. Our results trail the LLaVA-OneVision (Acc 0.238 video+pose) of the SignIT authors [2]. We outperform all non-video baselines they evaluated, including pose-only LLaVA (Acc 0.121), establishing a competitive key point-only result.

ModelModeR@1R@5R@10MedR
BaselineZero0.03590.16920.276922.0
Proto0.09740.30770.446213.0
LP0.09230.36410.533310.0
Fine-tuneZero0.13850.43080.55387.0
Proto0.14870.42560.56928.0
LP0.15380.43080.57447.0
Fine-tune + AugZero0.14360.43080.61548.0
Proto0.17440.41030.61038.0
LP0.17440.44620.58978.0
Table 8: SignIT fine-tuning and few-shot ablation.

SignIT macro-category retrieval

Zero-shot on categories achieves an F1-score (0.48) that is competitive with some fully supervised video baselines, such as I3D (0.34 F1)[2]. Because this relies on measuring the distance between visual embeddings and the textual embeddings of broad macro-categories, these results highlight an advantage of contrastive pretraining: the latent space is semantically organised, allowing the model to generalise to categorical distributions it never explicitly encountered during pretraining. Our strongest few-shot linear-probe configuration reaches 64.62% R@1, approaching the performance of SignIT’s best fully supervised MLP (0.726 Accuracy) [2].

ModelModeR@1PrReF1
BaselineZero0.37440.540.340.30
Proto0.41030.39090.39210.3844
LP0.58460.610.520.55
Fine-tuneZero0.48720.480.550.48
Proto0.56410.52190.53710.5251
LP0.64620.640.590.61
Fine-tune + AugZero0.49740.490.520.48
Proto0.59490.55610.57080.5503
LP0.61030.680.570.59
Table 9: SignIT macro-category retrieval, metrics to match original authors

Sign language identification

Random ChanceR@1R@2MedR
0.12500.35100.65232.0 / 8

False positives: lsf - 20, bsl - 688, ngt - 227, and lse - 32.

Table 10: Sign language identification results.

The Sign language identification of Table 10 complicates our earlier finding that in-vocabulary LIS signs do not outperform OOV. This simplified retrieval task suggests that SignCLIP does learn some language separation, as shown by the R@2 (65.23%). However, performance drops sharply at R@1 (35.10%), with substantial confusion between LIS, BSL, and NGT (Appendix A.3). It may be worth investigating if this is due to higher inter-language iconicity.

Conclusion

This work demonstrates that SignCLIP’s contrastive alignment induces a structurally bimodal transfer effect on LIS, beneficial for iconic vocabulary, adverse for non-iconic signs, indicating a geometric limitation of the shared embedding space paradigm rather than a data-scaling problem. Few-shot and fine-tuning strategies mitigate these limitations, confirming that the video encoder learns discriminative representations that zero-shot retrieval cannot exploit without fine-tuning in a monolingual context.

We see two promising directions for future research. Since pretraining exposure to LIS signs does not guarantee positive transfer, fine-tuning on the LIS-specific Spreadthesign subset could be adequate for OOD LIS. A more effective multilingual embedding space requires language-conditioned projections that both allow for iconicity transfer and decouple text anchors for non-iconic glosses across sign languages.

References

Appendix A Extended evaluation

A.1 Leave-one-signer-out baseline linear-probe

MetricMeanStd. Dev.
R@10.71480.0631
R@50.94030.0276
R@100.97250.0176
Figure 1: Visualisation for Leave-one-signer-out evaluation on A3LIS on frozen baseline with linear-probe. Note: Median Rank (MedR) is excluded from the visualised profile as it achieved a stable 1.00 ± 0.00.

Signer variability presented in Figure 1 primarily degrades R@1, seen by its ±6.3% standard deviation. Broader retrieval remains robust. This variance underscores cross-signer generalisation as a persistent difficulty.

A.2 Complete A3LIS fine-tuning ablation

In Section 4.5, we presented a condensed view of our A3LIS-147 fine-tuning ablation, highlighting the performance of the default SignCLIP objective (NCE) against our best-performing GlobalNCE regime. Table 11 presents the comprehensive results across all evaluated loss functions, batch sizes, and sampling strategies.

MethodR@1R@5R@10MedR
BaselineZero0.03690.13090.194640
Proto0.64770.90940.96981
LP0.66780.93290.96981
InfoNCE128Zero0.72480.9060.93961
Proto0.75840.94300.97651
LP0.76170.95970.97991
SupCon32x4Zero0.59120.85910.91281
Proto0.70130.91280.96641
LP0.77850.93960.97651
Cross-Entropy 16Zero0.05030.16110.24533
Proto0.7720.9460.9871
LP0.76510.94630.97991
GlobalNCE 16Zero0.76170.92620.96981
Proto0.78860.94300.97321
LP0.8020.9430.97321
ProLIP 16Zero0.75840.91610.9531
Proto0.77180.93960.95971
LP0.77850.93640.95641
DHN-NCE 64Zero0.70810.89260.92951
Proto0.76510.94970.97321
LP0.76170.95640.97651
Table 11: Complete A3LIS fine-tuning ablation. Baseline = frozen SignCLIP checkpoint; Zero = zero-shot retrieval; Proto = prototype retrieval; LP = linear-probe.

A.3 Sign language identification scores

Target languageCountProportion
<en> <lis>5230.351
<en> <ase>00
<en> <dgs>00
<en> <lsf>200.0134
<en> <bsl>6880.4618
<en> <ngt>2270.1523
<en> <lse>320.0215
<en> <csl>00
Table 12: Sign language identification languages and guesses for A3LIS
Appendix B Fine-tuning configurations

B.1 SignIT with augmentation fine-tuning hyperparameters

ParameterValue
Base Checkpointsignclip_v1_1
Model ArchitectureMMFusionSeparate
Video EncoderMMBertForEncoder (12 layers, dim: 609)
Text EncoderBertModel (bert-base-cased)
Loss FunctionGlobalNCE
OptimiserAdam (𝛽1=0.9,𝛽2=0.98)
Base Learning Rate5.0e-05
LR SchedulerPolynomial Decay (122 warmup updates)
Weight Decay0.02
Gradient Clipping2.0 (Max Norm)
Max Epochs50
Batch Size16
PrecisionFP16 Mixed Precision
Max Sequence LengthVideo: 256 frames / Text: 64 tokens
Pose Componentsreduced_face
Data AugmentationTemporal (𝜎=0.25), Spatial (𝜎=0.15), Noise (𝜎=0.002)

B.2 A3LIS and no augmentation fine-tuning hyperparameters

Note for ProLIP, there are two additional hyperparamters set: prolip_lambda: 0.5, and prolip_lambda_mode: inv_n

ParameterValue
Base Checkpointsignclip_v1_1
Model ArchitectureMMFusionSeparate
Video EncoderMMBertForEncoder (12 layers, dim: 609)
Text EncoderBertModel (bert-base-cased)
Loss Function(depends on experiment)
Video SupCon Weight0.5
OptimiserAdam (𝛽1=0.9,𝛽2=0.98)
Base Learning Rate5.0e-05
LR SchedulerPolynomial Decay (122 warmup updates)
Weight Decay0.01
Gradient Clipping2.0 (Max Norm)
Max Epochs50
Batch Size16
PrecisionFP16 Mixed Precision
Max Sequence LengthVideo: 256 frames / Text: 64 tokens
Pose Componentsreduced_face
Data AugmentationTemporal Augmentation Enabled
Appendix C Zero-shot stratification

C.1 SignIT glosses by median rank

1. Great (1-3): bear, bread, color, watermelon.

2. Good (3.1-10): anger, brown, cake, chocolate, cow, fear, fuchsia, giraffe, grey, joy, light colors, orange, pizza, relatives, rooster, salt, sheep, snail, tiger, vegetable, wine.

3. Fair (10.1-25): apple, banana, bird, blue, butterfly, candy, cat, dark colors, disgust, donkey, family, fish, frog, fruit, grandfather, green, horse, light blue, lion, meat, monkey, parents, pasta, pear, pig, pineapple, pink, purple, rabbit, rice, spider, turtle, yellow, zebra.

4. Neutral / Random (25.1-47): aunt, black, brother-in-law, bull, cousin, crocodile, dad, daughter-in-law, dog, elephant, goat, goose, grandmother, milk, parrot, red, sadness, sky blue, uncle, water, wolf.

5. Perverse (47.1-93): boyfriend, brother, hen, husband, mom, mouse, nephew, sister, snake, son, son-in-law, white, wife.

C.2 A3LIS-147 glosses by median rank

1. Great (1-3): caldo, data, falconara, freddo, giudizio, iniezione, scadenza, senigallia.

2. Good (3.1-15): abitare, affitto, ancona, aperto, avviso, consegnare, dirigente, dolore, emergenza, jesi, macerata, modello, modulo, multa, notte, pomeriggio, presente, pubblica, ritirare_il_numero, sciopero, sostegno, traffico, tratta, vacanze, verde.

3. Fair (15.1-40): acqua, allegare, ambulanza, annullato, arrivo, ascoli, banca, binario, cambio, commissione, compilare, costo, cura, domenica, esame, fermo, giallo, giovedì, giorno, infermiere, infezione, istituto, marche, mattina, medico, operazione, partenza, promosso, provincia, ritardo, s.benedetto, tassa, torino, università.

4. Neutral / Random (40.1-74): abbonamento, allergia, amministrazione, andata, andata_e_ritorno, assente, assistente_alla_comunicazione, bidello, biglietto, bocciato, casa, casello, chiuso, cibo, civitanova, comune, diploma, disinfettare, fano, venerdì, giorni, ieri, laurea, litro, lunedì, martedì, mercoledì, mesi, obliterare, ospedale, pesaro-urbino, posta, rallentamenti, regione, ricevuta, ritorno, roma, rosso, segretario, sera, sindaco, stazione, strada, treno.

5. Perverse (74.1-148): asilo_nido, assessore, assistente, autostrada, domani, elementari, ente_pubblico, entro, flebo, impiegato, interprete, lingua_dei_segni, malattia, mangiare, marca_da_bollo, medie, nota, oggi, obliteratrice, orari, preside, professore, pronto_soccorso, registro, sabato, sala_d’attesa, scuola, scuola_materna, sil, superiori, sportello, studente, tecnico, telefono, ufficio_informazioni, voto.

C.3 SignIT median-rank category proportions

CategoryGreatGoodFairNeutralAdverse
animals161483
colors15731
emotions03110
family01379
food02692
Figure 2: SignIT category distribution across median-rank buckets.

C.4 A3LIS-147 median-rank category proportions

CategoryGreatGoodFairNeutralAdverse
common life25254
education236613
highway03032
hospital13744
public inst.388118
railway station0311175
Figure 3: A3LIS-147 category distribution across median-rank buckets.
Appendix D A3LIS-147 details and splits

The following table provides the full mapping used for our A3LIS-147 analysis, including category classification, presence in the SpreadTheSign (STS) corpus, and our qualitative iconicity proxy (visual similarity to English-speaking sign languages).

ItalianEnglishCategoryIn STS?Iconicity Proxy
abbonamentosubscriptionrailway stationyes but differentno
abitarelivecommon lifeyesno
acquawatercommon lifeyesno
affittorentcommon lifeyesno
allegareattacheducationnoyes
allergiaallergyhospitalyesno
ambulanzaambulancehospitalyesno
amministrazioneadministrationpublic instituteyesyes
anconaanconapublic institutenono
andataone wayrailway stationnono
andata_​e_​ritornoround triprailway stationnono
annullatocancelledrailway stationyesyes
apertoopencommon lifeyesyes
arrivoarrivalrailway stationyesno
ascoliascolipublic institutenono
asilo_​nidoday nurseryeducationyes but differentno
assenteabsenteducationyesno
assessoreassessorpublic institutenono
assistenteassistantpublic instituteyesno
assistente_​alla_​comunicazionecommunication assistantpublic institutenono
autostradamotorwayhighwayyeskind of
avvisonoticeeducationyesyes
bancabankpublic instituteyesno
bidellojanitoreducationnono
bigliettoticketrailway stationyesyes
binarioplatformrailway stationyes but differentno
bocciatofailededucationyes but differentno
caldohotcommon lifeyes but differentno
cambiochangerailway stationnono
casahomecommon lifeyesno
casellotoll gatehighwayyesyes
chiusoclosedcommon lifeyes but differentyes
cibofoodcommon lifeyesyes
civitanovacivitanovapublic institutenono
commissionecommissioneducationyes but differentno
compilarecompilepublic instituteyesno
comunemunicipalitypublic instituteyesno
consegnaredelivercommon lifeyesyes
costocostcommon lifeyeskind of
curacarehospitalyesyes
datadatepublic instituteyesno
diplomadiplomaeducationyesyes
dirigenteexecutivepublic instituteyesyes
disinfettaredisinfecthospitalnono
dolorepainhospitalyes but differentno
domanitomorrowrailway stationyes but differentkind of
domenicasundayrailway stationyesno
elementarielementary schooleducationnono
emergenzaemergencyhospitalyesno
ente_​pubblicopublic bodypublic institutenono
entrowithineducationnono
esameexameducationyesno
falconarafalconarapublic institutenono
fanofanopublic institutenono
fermostillrailway stationnono
flebointravenous driphospitalnono
freddocoldcommon lifeyesyes
gialloyellowhospitalyesno
giornidaysrailway stationyesno
giornodayrailway stationyesno
giovedìthursdayrailway stationyes but differentno
giudiziojudgementeducationnoyes
ieriyesterdayrailway stationyesyes
impiegatoemployeepublic instituteyes but differentno
infermierenursehospitalyes but differentkind of
infezioneinfectionhospitalnono
iniezioneinjectionhospitalnoyes
interpreteinterpreterpublic instituteyesno
inviare_​smsmessagingcommon lifenono
istitutoinstituteeducationyesno
jesijesipublic institutenono
laureagraduationeducationyesno
lingua_​dei_​segnisign languagecommon lifeyes but differentno
litrolitrecommon lifeyesyes
lunedìmondayrailway stationyesno
maceratamaceratapublic institutenono
malattiaillnesshospitalyesno
mangiareeatcommon lifeyesyes
marca_​da_​bollorevenue stamppublic institutenono
marchemarchepublic institutenono
martedìtuesdayrailway stationyesno
mattinamorningrailway stationyeskind of
medicodoctorhospitalyesyes
mediemiddle schooleducationnono
mercoledìwednesdayrailway stationyesno
mesimonthsrailway stationyesno
modellomodelpublic instituteyesno
moduloformpublic instituteyesyes
multafinehighwayyesyes
notanoteeducationyeskind of
nottenightrailway stationyesyes
obliterarestamprailway stationnono
obliteratricestamping machinerailway stationnono
oggitodayrailway stationyesno
operazioneoperationhospitalnono
oraritimesrailway stationnono
ospedalehospitalhospitalyesno
partenzadeparturerailway stationyesno
pesaro-urbinopesaro-urbinopublic institutenono
pomeriggioafternoonrailway stationyesyes
postamailpublic instituteyeskind of
presentepresenteducationyesno
presideheadmastereducationyesno
professoreprofessoreducationyesno
promossopromotededucationnoyes
pronto_​soccorsofirst aidhospitalyesyes
provinciaprovincepublic instituteyes but differentno
pubblicapublicpublic instituteyesyes
rallentamentislowdownshighwaynoyes
regioneregionpublic instituteyeskind of
registrolog bookeducationyesyes
ricevutareceiptpublic institutenono
ritardodelayrailway stationnono
ritirare_​il_​numerotake the numberpublic institutenono
ritornoreturnrailway stationnono
romaromepublic instituteyesno
rossoredhospitalyeskind of
s.benedettos.benedettopublic institutenono
sabatosaturdayrailway stationyesno
sala_​d’attesawaiting roomhospitalyesno
scadenzaexpirationeducationyesno
scioperostrikerailway stationyesyes
scontrinoreceiptpublic instituteyeskind of
scuolaschooleducationyesno
scuola_​maternanursery schooleducationyesno
segretariosecretaryeducationyesno
senigalliasenigalliapublic institutenono
seraeveningrailway stationyeskind of
silsilence signcommon lifenono
sindacomayorpublic instituteyesno
sostegnoaideducationyeskind of
sportelloreception windowpublic instituteyesyes
stazionestationrailway stationyesno
stradastreethighwayyesyes
studentestudenteducationyesno
superiorihigh schooleducationyesyes
tassafeepublic instituteyeskind of
tecnicotechnicianhighwayyesyes
telefonotelephonecommon lifeyesyes
torinoturinpublic institutenono
trafficotraffichighwayyesno
trattasectionhighwaynoyes
trenotrainrailway stationyeskind of
ufficio_​informazioniinformation officepublic institutenono
universitàuniversityeducationyesno
vacanzevacationcommon lifeyesyes
venerdìfridayrailway stationyesno
verdegreenhospitalyesno
votovotingeducationyesyes
Table 13: A3LIS-147 dataset vocabulary, STS presence, and iconicity proxy.
  1. 1SignIT is not the only new dataset, for example, MultiMedaLIS, which explores multimodal inputs, and TGLIS-227, which collects continuous data from RAI television newscasts. [7], [9]. LIS is also present in multi-lingual datasets,i.e Spreadthesign.
  2. 2SignCLIP has also been used for text-alignment for continuous sign language processing, showing notable improvements with language-specific ISLR fine-tuning [13]. Hence, our work is illuminating for future work concerning the alignment of continuous LIS corpora. i.e., TGLIS-227.
  3. 3Our own visual analysis of the dataset reveals that some background text and images corresponding to the signs’ meanings remain unblurred (e.g., a sad face drawing for an emotion sign). We believe this could favour models that use raw-video or frame data, including the LLaVA-OneVision model employed by the authors