Spaces:
Sleeping
Sleeping
Initial commit
Browse files- README.md +89 -17
- requirements.txt +2 -1
- test_of_time_accuracy.py +81 -32
- tests.py +81 -17
README.md
CHANGED
|
@@ -1,11 +1,13 @@
|
|
| 1 |
---
|
| 2 |
title: Test of Time Accuracy
|
| 3 |
datasets:
|
| 4 |
-
-
|
|
|
|
| 5 |
tags:
|
| 6 |
- evaluate
|
| 7 |
- metric
|
| 8 |
-
|
|
|
|
| 9 |
sdk: gradio
|
| 10 |
sdk_version: 3.19.1
|
| 11 |
app_file: app.py
|
|
@@ -14,37 +16,107 @@ pinned: false
|
|
| 14 |
|
| 15 |
# Metric Card for Test of Time Accuracy
|
| 16 |
|
| 17 |
-
***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
|
| 18 |
-
|
| 19 |
## Metric Description
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
## How to Use
|
| 23 |
-
*Give general statement of how to use the metric*
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
### Inputs
|
| 28 |
-
|
| 29 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
### Output Values
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
| 34 |
|
| 35 |
*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
|
| 36 |
|
| 37 |
#### Values from Popular Papers
|
| 38 |
-
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
|
| 43 |
## Limitations and Bias
|
| 44 |
-
|
|
|
|
|
|
|
| 45 |
|
| 46 |
## Citation
|
| 47 |
-
*Cite the source where this metric was introduced.*
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: Test of Time Accuracy
|
| 3 |
datasets:
|
| 4 |
+
- baharef/ToT
|
| 5 |
+
- aauss/ToT_separate_instructions
|
| 6 |
tags:
|
| 7 |
- evaluate
|
| 8 |
- metric
|
| 9 |
+
- temporal reasoning
|
| 10 |
+
description: "Accuracy metric for the Test of Time benchmark by Bahar et al. (2025)."
|
| 11 |
sdk: gradio
|
| 12 |
sdk_version: 3.19.1
|
| 13 |
app_file: app.py
|
|
|
|
| 16 |
|
| 17 |
# Metric Card for Test of Time Accuracy
|
| 18 |
|
|
|
|
|
|
|
| 19 |
## Metric Description
|
| 20 |
+
|
| 21 |
+
This metric is designed for the **Test of Time (ToT)** benchmark (Bahar et al., 2025). It measures the accuracy of model predictions against reference answers. The metric expects model outputs to be formatted as a JSON object (e.g., `{"answer": "...", "explanation": "..."}`).
|
| 22 |
+
|
| 23 |
+
It performs the following steps:
|
| 24 |
+
|
| 25 |
+
1. Extracts the first valid JSON object from the model's prediction string.
|
| 26 |
+
2. Processes the JSON based on the specified `subset`:
|
| 27 |
+
- **semantic**: Extracts the value of the "answer" field.
|
| 28 |
+
- **arithmetic**: Removes the "explanation" field and compares the remaining dictionary (containing the answer) to the reference.
|
| 29 |
+
3. Compares the processed prediction with the reference to calculate accuracy, which is a dictionary for the artihmetic subset and a string for the semantic subset.
|
| 30 |
|
| 31 |
## How to Use
|
|
|
|
| 32 |
|
| 33 |
+
You can load the metric using the `evaluate` library:
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
import evaluate
|
| 37 |
+
metric = evaluate.load("aauss/test_of_time_accuracy")
|
| 38 |
+
|
| 39 |
+
predictions = [
|
| 40 |
+
'{"explanation": "Some explanation...", "unordered_list": ["London"]}',
|
| 41 |
+
' "Response without opening curly brackets...", "answer": "2005-04-07"}',
|
| 42 |
+
]
|
| 43 |
+
|
| 44 |
+
references = [
|
| 45 |
+
'{"unordered_list": ["London"]}',
|
| 46 |
+
"{'answer': '2005-04-07'}",
|
| 47 |
+
]
|
| 48 |
+
|
| 49 |
+
print(
|
| 50 |
+
metric.compute(
|
| 51 |
+
predictions=predictions,
|
| 52 |
+
references=references,
|
| 53 |
+
subset="arithmetic",
|
| 54 |
+
)
|
| 55 |
+
)
|
| 56 |
+
>>> 0.5
|
| 57 |
+
|
| 58 |
+
print(
|
| 59 |
+
metric.compute(
|
| 60 |
+
predictions=predictions,
|
| 61 |
+
references=references,
|
| 62 |
+
subset="arithmetic",
|
| 63 |
+
return_average=False
|
| 64 |
+
)
|
| 65 |
+
)
|
| 66 |
+
>>> [True, False]
|
| 67 |
+
|
| 68 |
+
predictions = [
|
| 69 |
+
'{"explanation": "Some explanation leading to a wrong answer...", "answer": 1}',
|
| 70 |
+
'{"explanation": "Some explanation ...", "answer": "1985"}'
|
| 71 |
+
]
|
| 72 |
+
|
| 73 |
+
references = ["0", "1985"]
|
| 74 |
+
|
| 75 |
+
print(
|
| 76 |
+
metric.compute(
|
| 77 |
+
predictions=predictions,
|
| 78 |
+
references=references,
|
| 79 |
+
subset="semantic",
|
| 80 |
+
)
|
| 81 |
+
)
|
| 82 |
+
>>> 0.5
|
| 83 |
+
```
|
| 84 |
|
| 85 |
### Inputs
|
| 86 |
+
|
| 87 |
+
- **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string that contains a JSON object (e.g., generated by an LLM).
|
| 88 |
+
- **references** (`list` of `str`): List of reference answers.
|
| 89 |
+
- **subset** (`str`): The subset of the benchmark being evaluated. Must be one of:
|
| 90 |
+
- `"arithmetic"`: Used for arithmetic tasks where the answer might need structure preservation (ignores explanation).
|
| 91 |
+
- `"semantic"`: Used for semantic tasks where only the "answer" value is compared.
|
| 92 |
+
- **return_average** (`bool`, optional): If `True`, returns the average accuracy. If `False`, returns a list of boolean scores (correct/incorrect) for each sample. Defaults to `True`.
|
| 93 |
|
| 94 |
### Output Values
|
| 95 |
|
| 96 |
+
The metric returns a dictionary with the following keys:
|
| 97 |
+
|
| 98 |
+
- **accuracy** (`float` or `list` of `bool`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of booleans indicating correctness per sample if `return_average=False`.
|
| 99 |
|
| 100 |
*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
|
| 101 |
|
| 102 |
#### Values from Popular Papers
|
|
|
|
| 103 |
|
| 104 |
+
Checkout the original [paper](https://openreview.net/pdf?id=44CoQe6VCq) for some reference performances.
|
| 105 |
+
|
| 106 |
|
| 107 |
## Limitations and Bias
|
| 108 |
+
|
| 109 |
+
- The metric relies on `json.JSONDecoder` to find the first JSON object in the prediction string. If the model output is malformed or does not contain a valid JSON, extraction may fail (returning `None`), leading to an incorrect prediction.
|
| 110 |
+
- It strictly expects the extracted JSON to follow the format as described in the task and optionally `explanation` for the logic to work as intended.
|
| 111 |
|
| 112 |
## Citation
|
|
|
|
| 113 |
|
| 114 |
+
Evaluation was not described in more detail in the paper but we can expect that model answers were parsed to allow for a more robust evaluation.
|
| 115 |
+
|
| 116 |
+
```bibtex
|
| 117 |
+
@InProceedings{huggingface:module,
|
| 118 |
+
title = {Test of Time Accuracy},
|
| 119 |
+
authors={Auss Abbood},
|
| 120 |
+
year={2025}
|
| 121 |
+
}
|
| 122 |
+
```
|
requirements.txt
CHANGED
|
@@ -1 +1,2 @@
|
|
| 1 |
-
git+https://github.com/huggingface/evaluate@main
|
|
|
|
|
|
| 1 |
+
git+https://github.com/huggingface/evaluate@main
|
| 2 |
+
pytest
|
test_of_time_accuracy.py
CHANGED
|
@@ -11,30 +11,31 @@
|
|
| 11 |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 12 |
# See the License for the specific language governing permissions and
|
| 13 |
# limitations under the License.
|
| 14 |
-
"""
|
| 15 |
|
| 16 |
-
import
|
| 17 |
-
import
|
|
|
|
| 18 |
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
# TODO: Add BibTeX citation
|
| 21 |
_CITATION = """\
|
| 22 |
@InProceedings{huggingface:module,
|
| 23 |
-
title = {
|
| 24 |
-
authors={
|
| 25 |
-
year={
|
| 26 |
}
|
| 27 |
"""
|
| 28 |
|
| 29 |
-
# TODO: Add description of the module here
|
| 30 |
_DESCRIPTION = """\
|
| 31 |
-
|
| 32 |
"""
|
| 33 |
|
| 34 |
|
| 35 |
# TODO: Add description of the arguments of the module here
|
| 36 |
_KWARGS_DESCRIPTION = """
|
| 37 |
-
|
| 38 |
Args:
|
| 39 |
predictions: list of predictions to score. Each predictions
|
| 40 |
should be a string with tokens separated by spaces.
|
|
@@ -53,43 +54,91 @@ Examples:
|
|
| 53 |
{'accuracy': 1.0}
|
| 54 |
"""
|
| 55 |
|
| 56 |
-
# TODO: Define external resources urls if needed
|
| 57 |
-
BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
|
| 58 |
-
|
| 59 |
|
| 60 |
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
| 61 |
class TestofTimeAccuracy(evaluate.Metric):
|
| 62 |
-
"""
|
|
|
|
|
|
|
| 63 |
|
| 64 |
def _info(self):
|
| 65 |
# TODO: Specifies the evaluate.EvaluationModuleInfo object
|
| 66 |
return evaluate.MetricInfo(
|
| 67 |
-
# This is the description that will appear on the modules page.
|
| 68 |
module_type="metric",
|
| 69 |
description=_DESCRIPTION,
|
| 70 |
citation=_CITATION,
|
| 71 |
inputs_description=_KWARGS_DESCRIPTION,
|
| 72 |
# This defines the format of each prediction and reference
|
| 73 |
-
features=datasets.Features(
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
# Homepage of the module for documentation
|
| 78 |
-
homepage="http://module.homepage",
|
| 79 |
# Additional links to the codebase or references
|
| 80 |
-
codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
|
| 81 |
-
reference_urls=["http://path.to.reference.url/new_module"]
|
| 82 |
)
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
def _compute(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
"""Returns the scores"""
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 12 |
# See the License for the specific language governing permissions and
|
| 13 |
# limitations under the License.
|
| 14 |
+
"""Accuracy metric for the Test of Time benchmark by Bahar et al. (2025)."""
|
| 15 |
|
| 16 |
+
import ast
|
| 17 |
+
import json
|
| 18 |
+
from typing import Literal
|
| 19 |
|
| 20 |
+
import datasets
|
| 21 |
+
import evaluate
|
| 22 |
|
|
|
|
| 23 |
_CITATION = """\
|
| 24 |
@InProceedings{huggingface:module,
|
| 25 |
+
title = {Test of Time Accuracy},
|
| 26 |
+
authors={Auss Abbood},
|
| 27 |
+
year={2025}
|
| 28 |
}
|
| 29 |
"""
|
| 30 |
|
|
|
|
| 31 |
_DESCRIPTION = """\
|
| 32 |
+
The Test of Time (ToT) benchmarks expects models format their answers as a JSON with an explanation field and an answer field that follows a predefined format. The metrics extracts JSONs objects from the model's output, retains only the first JSON, drops the explanation field and compares it with the reference answer.
|
| 33 |
"""
|
| 34 |
|
| 35 |
|
| 36 |
# TODO: Add description of the arguments of the module here
|
| 37 |
_KWARGS_DESCRIPTION = """
|
| 38 |
+
Compares the extracted answer from the model's output with the reference answer.
|
| 39 |
Args:
|
| 40 |
predictions: list of predictions to score. Each predictions
|
| 41 |
should be a string with tokens separated by spaces.
|
|
|
|
| 54 |
{'accuracy': 1.0}
|
| 55 |
"""
|
| 56 |
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
| 59 |
class TestofTimeAccuracy(evaluate.Metric):
|
| 60 |
+
"""Accuracy metric for the Test of Time benchmark by Bahar et al. (2025)."""
|
| 61 |
+
|
| 62 |
+
__test__ = False
|
| 63 |
|
| 64 |
def _info(self):
|
| 65 |
# TODO: Specifies the evaluate.EvaluationModuleInfo object
|
| 66 |
return evaluate.MetricInfo(
|
|
|
|
| 67 |
module_type="metric",
|
| 68 |
description=_DESCRIPTION,
|
| 69 |
citation=_CITATION,
|
| 70 |
inputs_description=_KWARGS_DESCRIPTION,
|
| 71 |
# This defines the format of each prediction and reference
|
| 72 |
+
features=datasets.Features(
|
| 73 |
+
{
|
| 74 |
+
"predictions": datasets.Value("string"),
|
| 75 |
+
"references": datasets.Value("string"),
|
| 76 |
+
"subset": datasets.Value("string"),
|
| 77 |
+
"return_average": datasets.Value("bool"),
|
| 78 |
+
}
|
| 79 |
+
),
|
| 80 |
# Homepage of the module for documentation
|
| 81 |
+
# homepage="http://module.homepage",
|
| 82 |
# Additional links to the codebase or references
|
| 83 |
+
# codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
|
| 84 |
+
# reference_urls=["http://path.to.reference.url/new_module"],
|
| 85 |
)
|
| 86 |
|
| 87 |
+
@staticmethod
|
| 88 |
+
def _extract_first_json_object(s: str) -> dict | None:
|
| 89 |
+
decoder = json.JSONDecoder()
|
| 90 |
+
idx, end = 0, len(s)
|
| 91 |
+
while idx < end:
|
| 92 |
+
try:
|
| 93 |
+
obj, next_idx = decoder.raw_decode(s, idx)
|
| 94 |
+
idx = next_idx
|
| 95 |
+
if isinstance(obj, dict):
|
| 96 |
+
return obj
|
| 97 |
+
except ValueError:
|
| 98 |
+
idx += 1
|
| 99 |
+
return None
|
| 100 |
+
|
| 101 |
+
@staticmethod
|
| 102 |
+
def _pop_explanation(d):
|
| 103 |
+
if isinstance(d, dict):
|
| 104 |
+
d.pop("explanation", None)
|
| 105 |
+
return d
|
| 106 |
+
|
| 107 |
+
@staticmethod
|
| 108 |
+
def _get_answer(d):
|
| 109 |
+
if isinstance(d, dict):
|
| 110 |
+
return d.get("answer", None)
|
| 111 |
+
return d
|
| 112 |
+
|
| 113 |
+
@staticmethod
|
| 114 |
+
def _parse_label(s):
|
| 115 |
+
"""Parses a string that could be a JSON object or a Python dict."""
|
| 116 |
+
try:
|
| 117 |
+
return json.loads(s)
|
| 118 |
+
except json.JSONDecodeError:
|
| 119 |
+
try:
|
| 120 |
+
# Safe: only parses literals, does not execute code
|
| 121 |
+
return ast.literal_eval(s)
|
| 122 |
+
except (ValueError, SyntaxError):
|
| 123 |
+
return None
|
| 124 |
|
| 125 |
+
def _compute(
|
| 126 |
+
self,
|
| 127 |
+
predictions,
|
| 128 |
+
references,
|
| 129 |
+
subset: Literal["arithmetic", "semantic"],
|
| 130 |
+
return_average: bool = True,
|
| 131 |
+
):
|
| 132 |
"""Returns the scores"""
|
| 133 |
+
predictions = [self._extract_first_json_object(p) for p in predictions]
|
| 134 |
+
if subset == "semantic":
|
| 135 |
+
predictions = [self._get_answer(p) for p in predictions]
|
| 136 |
+
elif subset == "arithmetic":
|
| 137 |
+
predictions = [self._pop_explanation(p) for p in predictions]
|
| 138 |
+
references = [self._parse_label(r) for r in references]
|
| 139 |
+
else:
|
| 140 |
+
raise ValueError(f"Invalid subset: {subset}")
|
| 141 |
+
accuracy = [i == j for i, j in zip(predictions, references)]
|
| 142 |
+
if return_average:
|
| 143 |
+
return {"accuracy": sum(accuracy) / len(accuracy)}
|
| 144 |
+
return {"accuracy": accuracy}
|
tests.py
CHANGED
|
@@ -1,17 +1,81 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
"
|
| 9 |
-
"
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
"
|
| 14 |
-
"
|
| 15 |
-
|
| 16 |
-
}
|
| 17 |
-
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import pytest
|
| 2 |
+
|
| 3 |
+
from test_of_time_accuracy import TestofTimeAccuracy
|
| 4 |
+
|
| 5 |
+
arithmetic_test_cases = {
|
| 6 |
+
"predictions": [
|
| 7 |
+
'JSON = {"explanation": "The war began in 360 BC. Since BC years count backwards, adding 8 years to 360 BC means subtracting 8 from 360, resulting in 352 BC.", "answer": "352 BC"}',
|
| 8 |
+
'```json\n{\n "explanation": "The dates provided are March 2012, September 2011, June 2017, September 2019, and June 2015. These correspond to visits to Miami, Sydney, Tokyo, London, and Nairobi respectively. The latest date among these is September 2019, which is associated with London. Therefore, London is the last city visited.",\n "unordered_list": ["London"]\n}\n```',
|
| 9 |
+
' "To find the date of the second most important game, we need to subtract 7 days from the date of the most important game. We can do this by counting back 7 days from April 14, 2005. April 14 - 7 days = April 7, 2005", "answer": "2005-04-07"}',
|
| 10 |
+
],
|
| 11 |
+
"references": [
|
| 12 |
+
'{"answer": "352 BC"}',
|
| 13 |
+
'{"unordered_list": ["London"]}',
|
| 14 |
+
"{'answer': '2005-04-07'}",
|
| 15 |
+
],
|
| 16 |
+
"result": {"accuracy": 2 / 3},
|
| 17 |
+
"per_item_accuracy": [True, True, False],
|
| 18 |
+
}
|
| 19 |
+
|
| 20 |
+
semantic_test_cases = {
|
| 21 |
+
"predictions": [
|
| 22 |
+
' "First, we need to find the third occurrence of E33 being the R53 of E22. We can see that it happened from 1959 to 1962, then from 1967 to 1968, and then from 1982 to 1984. The third occurrence happened from 1982 to 1984. We can then compute the duration by subtracting the start time from the end time.", "answer": 2}',
|
| 23 |
+
' "To find the duration, we need to find the start and end time when E97 was the R71 of E67. From the given facts, we can see that E97 was the R71 of E67 from 1961 to 1961, and also from 1964 to 1964. We need to find the first occurrence, which is from 1961 to 1961.", "answer": 1}',
|
| 24 |
+
'{"explanation": "To find when E92 stopped being the R88 of E11, we need to look at the temporal facts where E92 was the R88 of E11 and find the end time. We see that E92 was the R88 of E11 from 1982 to 1985, and there is no other fact that indicates E92 stopped being the R88 of E11 before 1985. However, we also see that E92 was the R17 of E42 from 1986 to 1992, and E92 was the R88 of E42 from 1977 to 1979, but this is irrelevant to the question. Therefore, E92 stopped being the R88 of E11 in 1985.", "answer": "1985"}',
|
| 25 |
+
],
|
| 26 |
+
"references": ["2", "0", "1985"],
|
| 27 |
+
"result": {"accuracy": 1 / 3},
|
| 28 |
+
"per_item_accuracy": [False, False, True],
|
| 29 |
+
}
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def test_arithmetic_accuracy():
|
| 33 |
+
metric = TestofTimeAccuracy()
|
| 34 |
+
results = metric.compute(
|
| 35 |
+
predictions=arithmetic_test_cases["predictions"],
|
| 36 |
+
references=arithmetic_test_cases["references"],
|
| 37 |
+
subset="arithmetic",
|
| 38 |
+
)
|
| 39 |
+
assert results == arithmetic_test_cases["result"]
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def test_semantic_accuracy():
|
| 43 |
+
metric = TestofTimeAccuracy()
|
| 44 |
+
results = metric.compute(
|
| 45 |
+
predictions=semantic_test_cases["predictions"],
|
| 46 |
+
references=semantic_test_cases["references"],
|
| 47 |
+
subset="semantic",
|
| 48 |
+
)
|
| 49 |
+
assert results == semantic_test_cases["result"]
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def test_per_item_arithmetic_accuracy():
|
| 53 |
+
metric = TestofTimeAccuracy()
|
| 54 |
+
results = metric.compute(
|
| 55 |
+
predictions=arithmetic_test_cases["predictions"],
|
| 56 |
+
references=arithmetic_test_cases["references"],
|
| 57 |
+
subset="arithmetic",
|
| 58 |
+
return_average=False,
|
| 59 |
+
)
|
| 60 |
+
assert results["accuracy"] == arithmetic_test_cases["per_item_accuracy"]
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def test_per_item_semantic_accuracy():
|
| 64 |
+
metric = TestofTimeAccuracy()
|
| 65 |
+
results = metric.compute(
|
| 66 |
+
predictions=semantic_test_cases["predictions"],
|
| 67 |
+
references=semantic_test_cases["references"],
|
| 68 |
+
subset="semantic",
|
| 69 |
+
return_average=False,
|
| 70 |
+
)
|
| 71 |
+
assert results["accuracy"] == semantic_test_cases["per_item_accuracy"]
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def test_invalid_subset():
|
| 75 |
+
metric = TestofTimeAccuracy()
|
| 76 |
+
with pytest.raises(ValueError):
|
| 77 |
+
metric.compute(
|
| 78 |
+
predictions=arithmetic_test_cases["predictions"],
|
| 79 |
+
references=arithmetic_test_cases["references"],
|
| 80 |
+
subset="invalid",
|
| 81 |
+
)
|