Quickstart
Notebook Code: Notebook Prose:
This package is meant to get you from chromatogram to quantified peaks as rapidly as possible. Below is a brief example of how to go from a raw, off-the-machine data set to a list of compounds and their absolute concentrations.
Loading and viewing chromatograms
Text files containing chromatograms with time and signal information can be intelligently read into a pandas DataFrame using hplc.io.load_chromatogram()
.
[1]:
# Load the chromatogram as a dataframe
from hplc.io import load_chromatogram
df = load_chromatogram('data/sample.txt',
cols=['R.Time (min)', 'Intensity'])
df.head()
[1]:
R.Time (min) | Intensity | |
---|---|---|
0 | 0.00000 | 0 |
1 | 0.00833 | 0 |
2 | 0.01667 | 0 |
3 | 0.02500 | 0 |
4 | 0.03333 | -1 |
By providing the column names as a dictionary, you can rename the (often annoying) default column names to something easier to work with, such as “time” and “signal” as
[2]:
# Load chromatogram and rename the columns
df = load_chromatogram('data/sample.txt', cols={'R.Time (min)':'time',
'Intensity': 'signal'})
df.head()
[2]:
time | signal | |
---|---|---|
0 | 0.00000 | 0 |
1 | 0.00833 | 0 |
2 | 0.01667 | 0 |
3 | 0.02500 | 0 |
4 | 0.03333 | -1 |
This dataframe can now be loaded passed to the Chromatogram
class, which has a variety of methods for quantification, cropping, and visualization and more.
[3]:
# Instantiate the Chromatogram class with the loaded chromatogram.
from hplc.quant import Chromatogram
chrom = Chromatogram(df)
# Show the chromatogram
chrom.show()
[3]:
[<Figure size 640x480 with 1 Axes>, <Axes: xlabel='time', ylabel='signal'>]
The crop
method allows you to crop the chromatogram in place to restrict the signal to a specific time range.
[4]:
# Crop the chromatogram in place between 8 and 21 min.
chrom.crop([10, 20])
chrom.show()
[4]:
[<Figure size 640x480 with 1 Axes>, <Axes: xlabel='time', ylabel='signal'>]
Note that the crop function operates in place and modifies the loaded as data within the Chromatogram
object.
Detecting and Fitting Peaks
The real meat of the package comes in the deconvolution of signal into discrete peaks and measurement of their properties. This typically involves the automated estimation and subtraction of the baseline, detection of peaks, and fitting of skew-normal distributions to reconstitute the signal. Luckily for you, all of this is done in a single method call Chromatogram.fit_peaks()
[5]:
# Automatically detect and fit the peaks
peaks = chrom.fit_peaks(buffer=0)
peaks
Performing baseline correction: 100%|██████████| 299/299 [00:00<00:00, 2748.97it/s]
Deconvolving mixture: 100%|██████████| 2/2 [00:13<00:00, 6.72s/it]
[5]:
retention_time | scale | skew | amplitude | area | signal_maximum | peak_id | |
---|---|---|---|---|---|---|---|
0 | 10.90 | 0.158768 | 0.691959 | 23380.386567 | 2.805646e+06 | 66064.360632 | 1 |
0 | 13.17 | 0.594721 | 3.905474 | 43163.899447 | 5.179668e+06 | 50341.309150 | 2 |
0 | 14.45 | 0.349615 | -2.995742 | 34698.949105 | 4.163874e+06 | 65264.917575 | 3 |
0 | 15.53 | 0.313999 | 1.621137 | 15061.414974 | 1.807370e+06 | 26771.463046 | 4 |
0 | 16.52 | 0.347276 | 1.990206 | 10936.994323 | 1.312439e+06 | 18651.421484 | 5 |
0 | 17.29 | 0.348001 | 1.703716 | 12525.283275 | 1.503034e+06 | 20381.895605 | 6 |
To see how well the deconvolution worked, you can once again call the show
method to see the composite compound chromatograms.
[6]:
# View the result of the fitting.
chrom.show()
[6]:
[<Figure size 640x480 with 1 Axes>,
<Axes: xlabel='time', ylabel='signal (baseline corrected)'>]
You can also call assess_fit()
to see how well the chromatogram is described by the inferred mixture. This is done through computing a reconstruction score
This is computed for regions with peaks (termed “peak windows”) and regions of background (termed “interpeak windows”) if they are present.
[7]:
# Print out the assessment statistics.
scores = chrom.assess_fit()
scores.head()
-------------------Chromatogram Reconstruction Report Card----------------------
Reconstruction of Peaks
=======================
A+, Success: Peak Window 1 (t: 10.558 - 11.758) R-Score = 0.9984
A+, Success: Peak Window 2 (t: 12.117 - 19.817) R-Score = 0.9952
Signal Reconstruction of Interpeak Windows
==========================================
C-, Needs Review: Interpeak Window 1 (t: 10.000 - 10.550) R-Score = 0.3902 & Fano Ratio = 0.0024
Interpeak window 1 is not well reconstructed by mixture, but has a small Fano factor
compared to peak region(s). This is likely acceptable, but visually check this region.
C-, Needs Review: Interpeak Window 2 (t: 11.767 - 12.108) R-Score = 10^-3 & Fano Ratio = 0.0012
Interpeak window 2 is not well reconstructed by mixture, but has a small Fano factor
compared to peak region(s). This is likely acceptable, but visually check this region.
C-, Needs Review: Interpeak Window 3 (t: 19.825 - 20.000) R-Score = 10^-1 & Fano Ratio = 0.0000
Interpeak window 3 is not well reconstructed by mixture, but has a small Fano factor
compared to peak region(s). This is likely acceptable, but visually check this region.
--------------------------------------------------------------------------------
[7]:
window_id | time_start | time_end | signal_area | inferred_area | signal_variance | signal_mean | signal_fano_factor | reconstruction_score | window_type | applied_tolerance | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 10.00000 | 10.55000 | 9.112041e+03 | 3.555093e+03 | 8.564681e+03 | 135.985685 | 62.982224 | 0.390153 | interpeak | 0.01 | needs review |
1 | 2 | 11.76667 | 12.10833 | 5.779000e+03 | 1.155479e+00 | 4.298102e+03 | 137.571429 | 31.242694 | 0.000200 | interpeak | 0.01 | needs review |
2 | 3 | 19.82500 | 20.00000 | 1.177637e+01 | 1.000001e+00 | 2.048615e-01 | 0.489835 | 0.418226 | 0.084916 | interpeak | 0.01 | needs review |
0 | 1 | 10.55833 | 11.75833 | 2.810059e+06 | 2.805647e+06 | 5.279468e+08 | 19379.710827 | 27242.244966 | 0.998430 | peak | 0.01 | valid |
1 | 2 | 12.11667 | 19.81667 | 1.403344e+07 | 1.396639e+07 | 3.854511e+08 | 15171.281774 | 25406.628617 | 0.995222 | peak | 0.01 | valid |
Quantifying Peaks
If you know the parameters of the linear calibration curve, which relates peak area to a known concentration, you can use the map_compounds
method which will map user provided compound names to peaks.
[8]:
# Define the two peaks of interest and their calibration curves
calibration = {'compound A': {'retention_time': 15.5,
'slope': 10547.6,
'intercept': -205.6,
'unit': 'µM'},
'compound B': {'retention_time': 17.2,
'slope': 26401.2,
'intercept': 54.2,
'unit': 'nM'}}
quant_peaks = chrom.map_peaks(calibration)
quant_peaks
[8]:
retention_time | scale | skew | amplitude | area | signal_maximum | peak_id | compound | concentration | unit | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 15.53 | 0.313999 | 1.621137 | 15061.414974 | 1.807370e+06 | 26771.463046 | 4 | compound A | 171.373146 | µM |
0 | 17.29 | 0.348001 | 1.703716 | 12525.283275 | 1.503034e+06 | 20381.895605 | 6 | compound B | 56.928465 | nM |
Successfully mapping compounds to peak ID’s will also be reflected in the show
method.
[9]:
# Show the chromatogram with mapped compounds
chrom.show()
[9]:
[<Figure size 640x480 with 1 Axes>,
<Axes: xlabel='time', ylabel='signal (baseline corrected)'>]
Deconvolving Overlapping and Subtle Peaks
Often, compounds will have similar retention times leading to heavily overlapping peaks. If one of the compounds dominates the signal, the other compounds may only show up as a “shoulder” on the predominant peak. In this case, automatic peak detection will fail and will fit only a single peak, as in the following example.
[10]:
# Load, fit, and display a chromatogram with heavily overlapping peaks
df = load_chromatogram('data/example_overlap.txt', cols=['time', 'signal'])
chrom = Chromatogram(df)
peaks = chrom.fit_peaks()
chrom.show()
score = chrom.assess_fit()
Performing baseline correction: 100%|██████████| 249/249 [00:00<00:00, 1014.14it/s]
Deconvolving mixture: 100%|██████████| 1/1 [00:00<00:00, 106.46it/s]
-------------------Chromatogram Reconstruction Report Card----------------------
Reconstruction of Peaks
=======================
F, Failed: Peak Window 1 (t: 2.230 - 20.430) R-Score = 0.9788
Peak mixture poorly reconstructs signal. You many need to adjust parameter bounds
or add manual peak positions (if you have a shouldered pair, for example). If
you have a very noisy signal, you may need to increase the reconstruction
tolerance `rtol`.
Signal Reconstruction of Interpeak Windows
==========================================
A+, Success: Interpeak Window 1 (t: 0.000 - 2.220) R-Score = 0.9952 & Fano Ratio = 10^-5
C-, Needs Review: Interpeak Window 2 (t: 20.440 - 24.990) R-Score = 1.6283 & Fano Ratio = 10^-5
Interpeak window 2 is not well reconstructed by mixture, but has a small Fano factor
compared to peak region(s). This is likely acceptable, but visually check this region.
--------------------------------------------------------------------------------
However, if you know there is a second peak, and you know its approximate retention time, you can manually add an approximate peak position, forcing the algorithm to estimate the peak convolution including more than one peak.
[11]:
# Enforce a manual peak position at around 11 time units
peaks = chrom.fit_peaks(known_peaks=[12])
chrom.show()
score = chrom.assess_fit()
Deconvolving mixture: 100%|██████████| 1/1 [00:01<00:00, 1.07s/it]
-------------------Chromatogram Reconstruction Report Card----------------------
Reconstruction of Peaks
=======================
A+, Success: Peak Window 1 (t: 2.230 - 20.430) R-Score = 1.0000
Signal Reconstruction of Interpeak Windows
==========================================
C-, Needs Review: Interpeak Window 1 (t: 0.000 - 2.220) R-Score = 1.0442 & Fano Ratio = 10^-5
Interpeak window 1 is not well reconstructed by mixture, but has a small Fano factor
compared to peak region(s). This is likely acceptable, but visually check this region.
A+, Success: Interpeak Window 2 (t: 20.440 - 24.990) R-Score = 1.0077 & Fano Ratio = 10^-5
--------------------------------------------------------------------------------
Note that even though we provided a guess of ≈ 12 time units for the position of the second peak, the algorithm did not force the peak to be exactly there. If enforced_locations
are provided, these are used as initial guesses when performing the fitting.
This approach can also be used if there is a shallow, isolated peak that is not automatically detected.
[12]:
# Load and fit a sample with a shallow peak
df = load_chromatogram('data/example_shallow.csv', cols=['time', 'signal'])
chrom = Chromatogram(df)
peaks = chrom.fit_peaks(prominence=0.5) # Prominence is to exclude shallow peak
chrom.show()
score = chrom.assess_fit()
Performing baseline correction: 100%|██████████| 249/249 [00:00<00:00, 441.84it/s]
Deconvolving mixture: 100%|██████████| 1/1 [00:00<00:00, 7.30it/s]
-------------------Chromatogram Reconstruction Report Card----------------------
Reconstruction of Peaks
=======================
A+, Success: Peak Window 1 (t: 3.530 - 16.460) R-Score = 1.0000
Signal Reconstruction of Interpeak Windows
==========================================
A+, Success: Interpeak Window 1 (t: 0.000 - 3.520) R-Score = 1.0000 & Fano Ratio = 0
F, Failed: Interpeak Window 2 (t: 16.470 - 79.990) R-Score = 10^-2 & Fano Ratio = 0.0382
Interpeak window 2 is not well reconstructed by mixture and has an appreciable Fano
factor compared to peak region(s). This suggests you have missed a peak in this
region. Consider adding manual peak positioning by passing `known_peaks`
to `fit_peaks()`.
--------------------------------------------------------------------------------
As the second peak is broad, you can provide an estimate of the width of the peak at the half maximum value to provide a better initial guess.
[13]:
# Add the location of the second peak
peaks = chrom.fit_peaks(known_peaks={50 : {'width': 5}}, prominence=0.5)
chrom.show()
score = chrom.assess_fit()
Deconvolving mixture: 100%|██████████| 2/2 [00:00<00:00, 13.19it/s]
-------------------Chromatogram Reconstruction Report Card----------------------
Reconstruction of Peaks
=======================
A+, Success: Peak Window 1 (t: 3.530 - 16.460) R-Score = 1.0000
F, Failed: Peak Window 2 (t: 45.000 - 54.990) R-Score = 1.1050
Peak mixture poorly reconstructs signal. You many need to adjust parameter bounds
or add manual peak positions (if you have a shouldered pair, for example). If
you have a very noisy signal, you may need to increase the reconstruction
tolerance `rtol`.
Signal Reconstruction of Interpeak Windows
==========================================
A+, Success: Interpeak Window 1 (t: 0.000 - 3.520) R-Score = 1.0000 & Fano Ratio = 0
F, Failed: Interpeak Window 2 (t: 16.470 - 44.990) R-Score = 1.0151 & Fano Ratio = 0.0160
Interpeak window 2 is not well reconstructed by mixture and has an appreciable Fano
factor compared to peak region(s). This suggests you have missed a peak in this
region. Consider adding manual peak positioning by passing `known_peaks`
to `fit_peaks()`.
A+, Success: Interpeak Window 3 (t: 55.000 - 79.990) R-Score = 1.0119 & Fano Ratio = 0.0158
--------------------------------------------------------------------------------
Constraining Peaks With Known Parameters
If you have a chromatogram with two completely overlapping peaks, it can be very, very difficult to deconvolve the mixture. However, if you happen to know the parameters of one of the constituents (say, from characterization of an isolated aqueous mixture of that compound), you can apply more stringent bounds to that particular peak. Say for example we have a mixture of two compounds with retention times of 10
and 10.6
and you know that the first peak has an amplitude of 100
units.
If you were to only supply the locations of the known peaks, you would underestimate the contribution from the first peak.
[14]:
# Load a chromatogram that is very heavily overlapping
df = load_chromatogram('data/bounding_example.csv', cols=['time', 'signal'])
chrom = Chromatogram(df)
# Fit the peaks providing only the known retention times
peaks = chrom.fit_peaks(known_peaks = [10, 10.6])
inferred_amplitude = peaks[peaks['peak_id']==1]['amplitude'].values[0]
known_amplitude = 100
# Print a summary statement demonstrating the underestimation
print(f'Inferred amplitude for peak 1 is {inferred_amplitude:0.3f}. Known value is {known_amplitude:0.3f}')
chrom.show()
Performing baseline correction: 100%|██████████| 249/249 [00:00<00:00, 1954.58it/s]
Deconvolving mixture: 100%|██████████| 2/2 [00:00<00:00, 15.32it/s]
Inferred amplitude for peak 1 is 94.971. Known value is 100.000
[14]:
[<Figure size 640x480 with 1 Axes>,
<Axes: xlabel='time', ylabel='signal (baseline corrected)'>]
You can constrain the parameter bounds for the amplitude of peak 1 more narrowly than the other peak by passing a dictionary to the known_peaks
parameter of fit_peaks()
.
[15]:
# Define more stringent bounds as a dictionary with the retention time as the key
# and the parameter bounds as a value.
bounds = {10: {'amplitude':[99, 101]}, # Known parameters for peak 1
10.6 : {} # Allow free inference for second peak
}
peaks = chrom.fit_peaks(known_peaks=bounds)
inferred_amplitude = peaks[peaks['peak_id']==1]['amplitude'].values[0]
# Print a summary statement demonstrating the improvement.
print(f'Inferred amplitude for peak 1 is {inferred_amplitude:0.3f}. Known value is {known_amplitude:0.3f}')
chrom.show()
Deconvolving mixture: 100%|██████████| 2/2 [00:00<00:00, 2.19it/s]
Inferred amplitude for peak 1 is 99.000. Known value is 100.000
[15]:
[<Figure size 640x480 with 1 Axes>,
<Axes: xlabel='time', ylabel='signal (baseline corrected)'>]
Here, we’ve only bounded the amplitude
parameter, but you can also provide bounds for location
, scale
, and skew
.
© Griffin Chure, 2024. This notebook and the code within are released under a Creative-Commons CC-BY 4.0 and GPLv3 license, respectively.