Skip to content


metametric provides a Python decorator (@metametric) for automatically deriving a metric given an arbitrary dataclass D. Practically speaking, this instantiates a new Metric object based on the dataclass definition and on the arguments passed to the decorator, and assigns this object to a new metric class attribute (or latent_metric if the class has latent variables). To compute the derived metric for a pair of objects p and r of type D, one then need only call D.metric.score(p,r).

The decorator takes two parameters, normalizer and constraint, which we detail below. We also provide an example of its use.


The normalizer parameter specifies how the raw score computed by the metric should be normalized. As illustrated in the paper associated with this package, these normalizers can be understood in terms of the overlap ( \(\Sigma\)) between a predicted set \(P\) and a reference set \(R\), where \(P, R \subseteq X\) for some set of discrete elements \(X\):

\[\Sigma_\delta(P,R) = \lvert P \cap R \rvert\]

where \(\delta\) is the similarity function used for elements of \(X\).

Currently, the following choices are supported:

  • none (default): No normalization. This is the default and will not apply any normalization to the raw metric score.
  • precision: Standard precision, i.e., the overlap normalized by the size of the predicted set, \(P\). Formally, \(\mathrm{P}(P,R) = \frac{\lvert P \cap R \rvert}{\lvert P \rvert} = \frac{\Sigma_\delta(P,R)}{\Sigma_\delta(P,P)}\).
  • recall: Standard recall, i.e., the overlap normalized by the size of the reference set, \(R\). Formally, \(\mathrm{R}(P,R) = \frac{\lvert P \cap R \rvert}{\lvert R \rvert} = \frac{\Sigma_\delta(P,R)}{\Sigma_\delta(R,R)}\).
  • jaccard: The Jaccard similarity or intersection-over-union, i.e., the overlap of \(P\) and \(R\) normalized by their union. Formally, \(\mathrm{J}(P,R) = \frac{\lvert P \cap R \rvert}{\lvert R \rvert} = \frac{\Sigma_\delta(P,R)}{\Sigma_ \delta(R,R) + \Sigma_\delta(P,P) - \Sigma_\delta(P,R)}\).
  • dice: The dice score more commonly known as \(\rm F_1\) score, i.e., \(\frac{2 * \text{precision} * \text{recall}}{\text{precision} + \text{recall}}\).
  • f{beta}: \(\rm F_\beta\) score, or generalized \(\rm F\) score, where \(\beta\) is a positive real number that indicates the relative weighting of precision vs. recall: \((1 + \beta^2) * \frac{\text{precision} * \text{recall}}{(\beta^2 * \text{precision}) + \text{recall}}\). Note that \(\beta = 1\) recovers the dice score. Any positive float may be used for {beta}, e.g., f0.5, f2, etc.


The constraint parameter specifies restrictions on the matching (i.e. the alignment) between predicted and reference objects of the dataclass's type. The following choices are supported; each choice can be written one of two ways:

  • One-to-One (<-> or 1:1; default): this specifies a partial bijection constraint: each predicted object can be aligned to at most one reference object, and vice-versa. The overwhelming majority of metrics impose this constraint, and so it is the default option.
  • One-to-Many (-> or 1:*): this specifies a (non-bijective) partial function from predicted objects to reference objects: each predicted object can be aligned to at most one reference object, but the same reference object can potentially be aligned to multiple predicted ones.
  • Many-to-One (<- or *:1): this specifies a (non-bijective) partial function from reference objects to predicted objects: each reference object can be aligned to at most one predicted object, but the same predicted object can potentially be aligned to multiple reference ones. (N.B.: while we provide support for this constraint, we aren't aware of actual metrics that impose it.)
  • No Constraints (~ or *:*): this specifies a generic relation: each predicted object can be aligned to multiple reference objects, and vice-versa.

Example: Event Trigger F1

Here, we show an example of how to use the decorator to automatically derive a metric for a dataclass — specifically, \(\rm F_1\) (dice score), commonly used for event extraction.

An event trigger is just a word or phrase (i.e. a mention) in a passage of text that evokes an event, like "kick" or " bombing", and that's associated with some event type. First, we'll define a dataclass for mentions:

@metametric(normalizer="none", constraint="<->")
@dataclass(eq=True, frozen=True)
class Mention:
    left: int  # left character offset of the mention (inclusive)
    right: int  # right character offset of the mention (inclusive)

The dataclass just has two attributes — a left index and a right index — indicating the character offsets of the start and end of mention within the passage of text. (We assume here they are both inclusive, though they need not be.) Note that above the dataclass decorator, we have added the metametric decorator as well, using the default values for the normalizer and constraint parameters (we could just as well have written @metametric(), but have written out the defaults explicitly for clarity). As discussed above, this sets a new metric attribute on the Mention dataclass. In this case, it's just about the simplest metric you could have — an indicator function (or Kronecker delta) that returns 1 iff two mentions have the same left and right offsets, and zero otherwise. Let's try it out:

m1 = Mention(1, 2)
m2 = Mention(1, 2)
m3 = Mention(1, 3)

> Mention.metric.score(m1, m2)  # returns 1.0, since m1 == m2
> Mention.metric.score(m1, m3)  # returns 0.0, since m1 != m3

You might wonder why one would go to all this trouble for such simple functionality. The value of the @metametric decorator becomes more apparent when working with more complex dataclasses, where some fields may themselves be dataclasses. The Trigger dataclass, which is just an event-denoting Mention paired with its type, is an example of this:

@metametric(normalizer="none", constraint="<->")
class Trigger:
    mention: Mention
    type: str

The decorator is the same as above, but the automatically derived metric for Triggers will recursively evaluate the mention field using the automatically derived metric for that dataclass:

# m1, m2, m3 are as defined above
t1 = Trigger(m1, "foo")
t2 = Trigger(m2, "foo")
t3 = Trigger(m3, "foo")

Trigger.metric.score(t1, t2)  # returns 1.0, since m1 == m2 and t1.type == t2.type
Trigger.metric.score(t1, t3)  # returns 0.0, since m1 != m2 (though t1.type == t2.type)

Setting aside the problem of argument extraction, let's imagine that the output for our trigger extraction task is just a collection of Triggers. We can define a final dataclass for storing these outputs:

@metametric(normalizer="f1", constraint="<->")
class TriggerExtractionOutput:
    triggers: Collection[Trigger]

This gives us our trigger \(\rm F_1\) score. We can now compute it as follows:

# the predicted event triggers (supposing our system predicts t1 and t2 only)
# triggers t1, t2, and t3 are as defined above
predicted_triggers = [t1, t2]
predictions = TriggerExtractionOutput(predicted_triggers)

# the reference event triggers (t1, t2, and t3)
reference_triggers = [t1, t2, t3]
references = TriggerExtractionOutput(reference_triggers)

# compute F1 score for predictions against the references
TriggerExtractionOutput.metric.score(predictions, references)  # returns 0.8