Decorator¶
metametric
provides a Python decorator (@metametric
) for automatically deriving a metric given an arbitrary
dataclass D
. Practically speaking, this instantiates a new Metric
object based on the dataclass definition and on
the arguments passed to the decorator, and assigns this object to a new metric
class attribute (or latent_metric
if
the class has latent variables). To compute the derived metric for a pair of objects p
and r
of type D
, one then
need only call D.metric.score(p,r)
.
The decorator takes two parameters, normalizer
and constraint
, which we detail below. We also provide an example of
its use.
Normalizer¶
The normalizer
parameter specifies how the raw score computed by the metric should be normalized. As illustrated in
the paper associated with this package, these normalizers can be understood in terms of the overlap (
\(\Sigma\)) between a predicted set \(P\) and a reference set \(R\), where \(P, R \subseteq X\) for some set of discrete
elements \(X\):
where \(\delta\) is the similarity function used for elements of \(X\).
Currently, the following choices are supported:
none
(default): No normalization. This is the default and will not apply any normalization to the raw metric score.precision
: Standard precision, i.e., the overlap normalized by the size of the predicted set, \(P\). Formally, \(\mathrm{P}(P,R) = \frac{\lvert P \cap R \rvert}{\lvert P \rvert} = \frac{\Sigma_\delta(P,R)}{\Sigma_\delta(P,P)}\).recall
: Standard recall, i.e., the overlap normalized by the size of the reference set, \(R\). Formally, \(\mathrm{R}(P,R) = \frac{\lvert P \cap R \rvert}{\lvert R \rvert} = \frac{\Sigma_\delta(P,R)}{\Sigma_\delta(R,R)}\).jaccard
: The Jaccard similarity or intersection-over-union, i.e., the overlap of \(P\) and \(R\) normalized by their union. Formally, \(\mathrm{J}(P,R) = \frac{\lvert P \cap R \rvert}{\lvert R \rvert} = \frac{\Sigma_\delta(P,R)}{\Sigma_ \delta(R,R) + \Sigma_\delta(P,P) - \Sigma_\delta(P,R)}\).dice
: The dice score more commonly known as \(\rm F_1\) score, i.e., \(\frac{2 * \text{precision} * \text{recall}}{\text{precision} + \text{recall}}\).f{beta}
: \(\rm F_\beta\) score, or generalized \(\rm F\) score, where \(\beta\) is a positive real number that indicates the relative weighting of precision vs. recall: \((1 + \beta^2) * \frac{\text{precision} * \text{recall}}{(\beta^2 * \text{precision}) + \text{recall}}\). Note that \(\beta = 1\) recovers the dice score. Any positive float may be used for{beta}
, e.g.,f0.5
,f2
, etc.
Constraint¶
The constraint
parameter specifies restrictions on the matching (i.e. the alignment) between predicted and reference
objects of the dataclass's type. The following choices are supported; each choice can be written one of two ways:
- One-to-One (
<->
or1:1
; default): this specifies a partial bijection constraint: each predicted object can be aligned to at most one reference object, and vice-versa. The overwhelming majority of metrics impose this constraint, and so it is the default option. - One-to-Many (
->
or1:*
): this specifies a (non-bijective) partial function from predicted objects to reference objects: each predicted object can be aligned to at most one reference object, but the same reference object can potentially be aligned to multiple predicted ones. - Many-to-One (
<-
or*:1
): this specifies a (non-bijective) partial function from reference objects to predicted objects: each reference object can be aligned to at most one predicted object, but the same predicted object can potentially be aligned to multiple reference ones. (N.B.: while we provide support for this constraint, we aren't aware of actual metrics that impose it.) - No Constraints (
~
or*:*
): this specifies a generic relation: each predicted object can be aligned to multiple reference objects, and vice-versa.
Example: Event Trigger F1¶
Here, we show an example of how to use the decorator to automatically derive a metric for a dataclass — specifically, \(\rm F_1\) (dice score), commonly used for event extraction.
An event trigger is just a word or phrase (i.e. a mention) in a passage of text that evokes an event, like "kick" or " bombing", and that's associated with some event type. First, we'll define a dataclass for mentions:
@metametric(normalizer="none", constraint="<->")
@dataclass(eq=True, frozen=True)
class Mention:
left: int # left character offset of the mention (inclusive)
right: int # right character offset of the mention (inclusive)
The dataclass just has two attributes — a left index and a right index — indicating the character offsets of
the start and end of mention within the passage of text. (We assume here they are both inclusive, though they need not
be.) Note that above the dataclass
decorator, we have added the metametric
decorator as well, using the default
values for the normalizer
and constraint
parameters (we could just as well have written @metametric()
, but have
written out the defaults explicitly for clarity). As discussed above, this sets a new metric
attribute
on the Mention
dataclass. In this case, it's just about the simplest metric you could have — an indicator
function (or Kronecker delta) that returns 1 iff two mentions have the
same left
and right
offsets, and zero otherwise. Let's try it out:
m1 = Mention(1, 2)
m2 = Mention(1, 2)
m3 = Mention(1, 3)
> Mention.metric.score(m1, m2) # returns 1.0, since m1 == m2
> Mention.metric.score(m1, m3) # returns 0.0, since m1 != m3
You might wonder why one would go to all this trouble for such simple functionality. The value of the @metametric
decorator becomes more apparent when working with more complex dataclasses, where some fields may themselves be
dataclasses. The Trigger
dataclass, which is just an event-denoting Mention
paired with its type, is an example of
this:
@metametric(normalizer="none", constraint="<->")
@dataclass
class Trigger:
mention: Mention
type: str
The decorator is the same as above, but the automatically derived metric for Trigger
s will recursively evaluate
the mention
field using the automatically derived metric for that dataclass:
# m1, m2, m3 are as defined above
t1 = Trigger(m1, "foo")
t2 = Trigger(m2, "foo")
t3 = Trigger(m3, "foo")
Trigger.metric.score(t1, t2) # returns 1.0, since m1 == m2 and t1.type == t2.type
Trigger.metric.score(t1, t3) # returns 0.0, since m1 != m2 (though t1.type == t2.type)
Setting aside the problem of argument extraction, let's imagine that the output for our trigger extraction task is
just a collection of Trigger
s. We can define a final dataclass
for storing these outputs:
@metametric(normalizer="f1", constraint="<->")
@dataclass
class TriggerExtractionOutput:
triggers: Collection[Trigger]
This gives us our trigger \(\rm F_1\) score. We can now compute it as follows:
# the predicted event triggers (supposing our system predicts t1 and t2 only)
# triggers t1, t2, and t3 are as defined above
predicted_triggers = [t1, t2]
predictions = TriggerExtractionOutput(predicted_triggers)
# the reference event triggers (t1, t2, and t3)
reference_triggers = [t1, t2, t3]
references = TriggerExtractionOutput(reference_triggers)
# compute F1 score for predictions against the references
TriggerExtractionOutput.metric.score(predictions, references) # returns 0.8