Measures
Warning
cohort-extractor is now deprecated. All new projects should use ehrQL to extract data from an OpenSAFELY database.
The Measures framework allows you to extract multiple datasets covering different time periods, and calculates a set of measures for each period.
Measures are expressed as a quotient (i.e., a numerator divided by a denominator) which in practice could be used to calculate proportions, ratios, means, counts, and so on.
For example, we may wish to calculate, for each month in 2020 and each STP (administrative health regions in England), the proportion of patients who were admitted to hospital at least once and the proportion of patients who died.
Without Measures, this would require creating a set of variables (or a study definition) for each month of interest containing: the region; whether or not they were admitted to hospital; whether or not they died, and we would then need to use these datasets to calculate the desired proportion and aggregate the results.
Measures makes this process simple.
Defining measuresπ
Generating measures is a three step process:
- Define a study definition that includes a
measures
variable, which should be a list of calls to theMeasure()
function. - Extract the data by running
generate_cohort
using the--index-date-range
option to cover the range of time periods we want to calculate the measure for. - Calculate the measures by running
generate_measures
which takes the files extracted in step 2 and produces files likemeasure_<measure_id>.csv.gz
.
Define a study definitionπ
The study_definition.py
script for the example above is:
from cohortextractor import StudyDefinition, Measure, patients
study = StudyDefinition(
# Configure the expectations framework
default_expectations={
"date": {"earliest": "2020-01-01", "latest": "today"},
"rate": "exponential_increase",
"incidence": 0.2,
},
index_date="2020-01-01",
population=patients.registered_as_of("index_date"),
stp=patients.registered_practice_as_of(
"index_date",
returning="stp_code",
return_expectations={
"category": {"ratios": {"stp1": 0.1, "stp2": 0.2, "stp3": 0.7}},
"incidence": 1,
},
),
sex=patients.sex(
return_expectations={
"rate": "universal",
"category": {"ratios": {"M": 0.49, "F": 0.51}},
}
),
admitted=patients.admitted_to_hospital(
returning="binary_flag",
between=["index_date", "last_day_of_month(index_date)"],
return_expectations={"incidence": 0.1},
),
died=patients.died_from_any_cause(
between=["index_date", "last_day_of_month(index_date)"],
returning="binary_flag",
return_expectations={"incidence": 0.05},
),
)
measures = [
Measure(
id="hosp_admission_by_stp",
numerator="admitted",
denominator="population",
group_by="stp",
),
Measure(
id="death_by_stp",
numerator="died",
denominator="population",
group_by="stp",
small_number_suppression=True,
),
]
This differs from a normal study definition due to the addition of the measures
object, which is a list of calls to the Measure()
function, for each measure.
The index_date
π
Each month/week start date covered by your study period is passed to the study definition as the index_date
in turn, thus allowing the cohort and variables to vary each month/week as required. Some variables can retain fixed or no date ranges, e.g. a person's ethnicity would not be expected to vary over time.
The Measure()
functionπ
id
is just a string used to identify the measure output file.numerator
anddenominator
are the columns in the dataset that define the measure. They must be numeric or boolean (encoded as 0 or 1).group_by
column can be of any type. To calculate the measure across the entire population, you can setgroup_by="population"
. If not specified,group_by
will default toNone
and the measure will be calculated at the individual patient level.small_number_suppression
is an optional boolean that defaults toFalse
. If set toTrue
, small numbers (greater than zero, less than or equal to five) in numerators and denominators will be replaced with blanks. If a numerator or denominator has been suppressed then the corresponding value will also be blank. In cases where numerators or denominators are suppressed but the total suppressed in a column is not greater than five, additional values will be suppressed to bring the total above five.
You can calculate measures for more than one group in a single measure by specifying multiple variables as follows:
measures = [
Measure(
id="hosp_admission_by_stp_and_sex",
numerator="admitted",
denominator="population",
group_by=["stp", "sex"],
),
Measure(
id="death_by_stp_and_sex",
numerator="died",
denominator=" population",
group_by=["stp", "sex"],
),
]
Extract the dataπ
To run multiple study definitions over a series of dates, use the --index-date-range
option of the generate_cohort
command in cohortextractor
.
Rather than generating a single output CSV file this generates multiple output files across a range of dates, modifying the study's index date each time.
Note
Although the dates defined by the --index-date-range
will replace the index_date
that is defined in the study definition itself, index_date
must still be defined.
Running over multiple index dates can be resource intensive and thus slow to run. It is therefore recommended to: use months rather than weeks if appropriate; use a shorter period for testing if applicable; minimise the population size (e.g. apply all necessary exclusions to your population, e.g. age limits, in the study definition rather than filtering in a later step) and use a compressed output format (e.g. csv.gz
as below).
The range is specified as:
--index-date-range "YYYY-MM-DD to YYYY-MM-DD by (week|month)"
It also accepts today
in place of a date.
Output files are named like: output/input_YYYY-MM-DD.csv.gz
There is also a --skip-existing
option which will cause the cohortextractor to skip the extraction step for any dates where the corresponding file already exists.
This makes it easier to incrementally extract data for new months/weeks without having to re-extract everything.
Example:
cohortextractor:latest generate_cohort --index-date-range "2020-01-01 to 2020-12-01 by month" --output-format=csv.gz
...which will output:
output/input_2020-01-01.csv.gz
output/input_2020-02-01.csv.gz
...
output/input_2020-12-01.csv.gz
Calculate the measuresπ
This is done using the generate_measures
command:
cohortextractor:latest generate_measures
For each defined measure, and for each file extracted in step 2, this generates an output file with the measure calculated for that month or week.
output/measure_hosp_admission_by_stp_2020-01-01.csv.gz
output/measure_death_by_stp_2020-01-01.csv.gz
output/measure_hosp_admission_by_stp_2020-02-01.csv.gz
output/measure_death_by_stp_2020-02-01.csv.gz
...
output/measure_hosp_admission_by_stp_2020-12-01.csv.gz
output/measure_death_by_stp_2020-12-01.csv.gz
Finally, for each measure, it combines all the output into a single file with an additional date
column indicating the date associated with each row.
output/measure_hosp_admission_by_stp.csv.gz
output/measure_death_by_stp.csv.gz
This command also respects the --skip-existing
flag.
This will prevent it from recalculating the measure for any months or weeks which have already been calculated.
However the final step, which combines output across time periods, will always be run.
This command accepts an --output-dir
argument, which defines the directory in which the output measure files will be created.
If the --output-dir
argument is configured, generate_measures
expects the input cohort files to also be in the output directory. To ensure this, include the same --output-dir
argument in the previous generate_cohort
step.
Note
As we saw, you can extract the data in a compressed output format, such as csv.gz
, feather
, dta
, or dta.gz
.
For example:
$ cohortextractor:latest generate_cohort --output-format=csv.gz
If you do, then the generate_measures
command will automatically work with this compressed output format;
you need not pass --output-format
to generate_measures
, in other words.
However, the generate_measures
command will always output CSV files.
Putting it all together in a pipelineπ
To generate the final outputs measure_hosp_admission_by_stp.csv.gz
and measure_death_by_stp.csv.gz
in a project pipeline, you would use the following actions:
generate_study_population:
run: cohortextractor:latest generate_cohort --study-definition study_definition --index-date-range "2020-01-01 to 2020-12-01 by month" --skip-existing --output-dir=output/measures --output-format=csv.gz
outputs:
highly_sensitive:
cohort: output/measures/input_*.csv.gz
generate_measures:
run: cohortextractor:latest generate_measures --study-definition study_definition --skip-existing --output-dir=output/measures --output-format csv.gz
needs: [generate_study_population]
outputs:
moderately_sensitive:
measure_csv: output/measures/measure_*.csv.gz