Proposing an evaluation for arabic morphological reasoning and derivation generation

#27
by Rorro2024 - opened

Hello Everyone,

I am proposing a new evaluation for LLMs that tests the true inherent understanding of the Arabic language. Essentially this would require that the model under evaluation is not connected to the web and relies only on its weights/thinking to do Arabic related tasks that are advanced and when the model successfully passes the evaluation; indicates its comprehension of the Arabic language.

The evaluation I am proposing is a two-way evaluation:

  1. Given arabic words alone or within sentences, the model is asked to analyze the words via Lemmatization to produce all possible roots.
  2. Given arabic words or roots, the model is asked to produce other words that are derived from the given roots or from the roots of the given words.

This evaluation would require a corresponding dataset. The data set should not be known a priori; otherwise all models can be trained on the known data set and the evaluation becomes a competition on training, rather than the model itself. While training, in my opinion is important, it is not a direct measure of the generalizability of the model unless the dataset is a new dataset that is not part of the training.

If there are people with expertise in LLM evaluations want to collaborate on this type of evaluation, I am more than happy to work with them on this. I can work on preparing a dataset and I am eager to learn about LLM evaluations.

Thank you
Ramzi

Open Arabic LLM Leaderboard org
edited 3 days ago

Hey @Rorro2024 I myself would love to work on such an interesting exercise. Unfortunately i have a very limited bandwidth myself lately. Perhaps we can start on it but take a bit slow ?
On the other hand, i know @basma-b and her team are very much active in arabic evals research, looping her as well in case she's interested.

Thank you @alielfilali01 for your reply and interest. Same here (Re the limited bandwidth). I agree that we could start and slowly progress on it. I possibly can start collecting some words and their roots, although not 100% sure yet on how to design/format the dataset as I collect and prepare it for testing. Some feedback from people who have expertise in LLM's Evals for Arabic would be very interesting and helpful. I guess we will eventually know by doing.

Sign up or log in to comment