Spaces:
Runtime error
Runtime error
| title: MeaningBERT | |
| emoji: π¦ | |
| colorFrom: purple | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 5.9.1 | |
| app_file: app.py | |
| pinned: false | |
| tags: | |
| - evaluate | |
| - metric | |
| description: >- | |
| MeaningBERT is an automatic and trainable metric for assessing meaning | |
| preservation between sentences | |
| See the project's README at | |
| https://github.com/GRAAL-Research/MeaningBERT/tree/main for more information. | |
| # Here is MeaningBERT | |
| MeaningBERT is an automatic and trainable metric for assessing meaning preservation between sentences. MeaningBERT was | |
| proposed in our | |
| article [MeaningBERT: assessing meaning preservation between sentences](https://www.frontiersin.org/articles/10.3389/frai.2023.1223924/full). | |
| Its goal is to assess meaning preservation between two sentences that correlate highly with human judgments and sanity | |
| checks. For more details, refer to our publicly available article. | |
| > This public version of our model uses the best model trained (where in our article, we present the performance results | |
| > of an average of 10 models) for a more extended period (1000 epochs instead of 250). We have observed later that the | |
| > model can further reduce dev loss and increase performance. | |
| ## Sanity Check | |
| Correlation to human judgment is one way to evaluate the quality of a meaning preservation metric. | |
| However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive, since it requires | |
| a large dataset | |
| annotated by several humans. As an alternative, we designed two automated tests: evaluating meaning preservation between | |
| identical sentences (which should be 100% preserving) and between unrelated sentences (which should be 0% preserving). | |
| In these tests, the meaning preservation target value is not subjective and does not require human annotation to | |
| measure. They represent a trivial and minimal threshold a good automatic meaning preservation metric should be able to | |
| achieve. Namely, a metric should be minimally able to return a perfect score (i.e., 100%) if two identical sentences are | |
| compared and return a null score (i.e., 0%) if two sentences are completely unrelated. | |
| ### Identical sentences | |
| The first test evaluates meaning preservation between identical sentences. To analyze the metrics' capabilities to pass | |
| this test, we count the number of times a metric rating was greater or equal to a threshold value Xβ[95, 99] and divide | |
| it by the number of sentences to create a ratio of the number of times the metric gives the expected rating. To account | |
| for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a threshold value of | |
| 100%. | |
| ### Unrelated sentences | |
| Our second test evaluates meaning preservation between a source sentence and an unrelated sentence generated by a large | |
| language model.3 The idea is to verify that the metric finds a meaning preservation rating of 0 when given a completely | |
| irrelevant sentence mainly composed of irrelevant words (also known as word soup). Since this test's expected rating is | |
| 0, we check that the metric rating is lower or equal to a threshold value Xβ[5, 1]. | |
| Again, to account for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use | |
| a threshold value of 0%. | |
| ## Cite | |
| Use the following citation to cite MeaningBERT | |
| ``` | |
| @ARTICLE{10.3389/frai.2023.1223924, | |
| AUTHOR={Beauchemin, David and Saggion, Horacio and Khoury, Richard}, | |
| TITLE={MeaningBERT: assessing meaning preservation between sentences}, | |
| JOURNAL={Frontiers in Artificial Intelligence}, | |
| VOLUME={6}, | |
| YEAR={2023}, | |
| URL={https://www.frontiersin.org/articles/10.3389/frai.2023.1223924}, | |
| DOI={10.3389/frai.2023.1223924}, | |
| ISSN={2624-8212}, | |
| } | |
| ``` | |
| ## License | |
| MeaningBERT is MIT licensed, as found in | |
| the [LICENSE file](https://github.com/GRAAL-Research/risc/blob/main/LICENSE). |