hplc.quant
- class hplc.quant.Chromatogram(file: str | DataFrame, time_window: bool = None, cols: Dict[str, str] = {'signal': 'signal', 'time': 'time'})
Bases:
object
Base class for the processing and quantification of an HPLC chromatogram.
- df
A Pandas DataFrame containing the chromatogram, minimally with columns of time and signal intensity.
- Type:
pandas.core.frame.DataFrame
- window_props
A dictionary of each peak window, labeled as increasing integers in linear order. Each key has its own dictionary with the following keys:
- Type:
dict
- peaks
A Pandas DataFrame containing the inferred properties of each peak including the retention time, scale, skew, amplitude, and total area under the peak across the entire chromatogram.
- Type:
pandas.core.frame.DataFrame
- unmixed_chromatograms
A matrix where each row corresponds to a time point and each column corresponds to the value of the probability density for each individual peak. This is used primarily for plotting in the show method.
- Type:
numpy.ndarray
- quantified_peaks
A Pandas DataFrame with peak areas converted to
- Type:
pandas.core.frame.DataFrame
- scores
A Pandas DataFrame containing the reconstruction scores and Fano factor ratios for each peak and interpeak region. This is generated only afer assess_fit() is called.
- Type:
pandas.core.frame.DataFrame
- assess_fit(rtol: float = 0.01, fano_tol: float = 0.01, verbose: bool = True) DataFrame
Assesses whether the computed reconstruction score is adequate, given a tolerance.
- Parameters:
rtol (float) – The tolerance for a reconstruction to be valid. This is the tolerated deviation from a score of 1 which indicates a perfectly reconstructed chromatogram.
fano_tol (float) – The tolerance away from zero for evaluating the Fano factor of inerpeak windows. See note below.
verbose (bool) – If True, a summary of the fit will be printed to screen indicating problematic regions if detected.
- Returns:
score_df – A DataFrame reporting the scoring statistic for each window as well as for the entire chromatogram. A window value of 0 corresponds to the entire chromatogram. A column accepted with a boolean value represents whether the reconstruction is within tolerance (True) or (False).
- Return type:
pandas.core.frame.DataFrame
Notes
The reconstruction score is defined as
\[R = \frac{\text{area of inferred mixture in window} + 1}{\text{area of observed signal in window} + 1}\]where \(t\) is the total time of the region, \(A\) is the inferred peak amplitude, \(\alpha\) is the inferred skew paramter, \(r_t\) is the inferred peak retention time, \(\sigma\) is the inferred scale parameter and \(S_i\) is the observed signal intensity at time point \(i\). Note that the signal and reconstruction is cast to be positive to compute the score.
A reconstruction score of \(R = 1\) indicates a perfect reconstruction of the chromatogram. For practical purposes, a chromatogram is deemed to be adequately reconstructed if \(R\) is within a tolerance \(\epsilon\) of 1 such that
\[\left| R - 1 \right| \leq \epsilon \Rightarrow \text{Valid Reconstruction}\]Interpeak regions may have a poor reconstruction score due to noise or short durations. To determine if this poor reconstruction score is due to a missed peak, the signal Fano factor of the region is computed as
\[F = \frac{\sigma^2_{S}}{\langle S \rangle}.\]This is compared with the average Fano factor of \(N\) peak windows such that the Fano factor ratio is
\[\frac{F}{\langle F_{peak} \rangle} = \frac{\sigma^2_{S} / \langle S \rangle}{\frac{1}{N} \sum\limits_{i}^N \frac{\sigma_{S,i}^2}{\langle S_i \rangle}}.\]If the Fano factor ratio is below a tolerance fano_tol, then that window is deemed to be noisy and peak-free.
- correct_baseline(window: float = 5, return_df: bool = False, verbose: bool = True, precision: int = 9) DataFrame | None
Performs Sensitive Nonlinear Iterative Peak (SNIP) clipping to estimate and subtract background in chromatogram.
- Parameters:
window (int) – The approximate size of signal objects in the chromatogram in dimensions of time. This is related to the number of iterations undertaken by the SNIP algorithm. This must be greater than 10 times the time step, or else a ValueError will be returned.
return_df (bool) – If True, then chromatograms (before and after background correction) are returned
verbose (bool) – If True, progress will be printed to screen as a progress bar.
precision (int) – The number of decimals to round the subtracted signal to. Default is 9.
- Returns:
corrected_df – If return_df = True, then the original and the corrected chromatogram are returned.
- Return type:
pandas.core.frame.DataFrame
Notes
This implements the SNIP algorithm as presented and summarized in Morhác and Matousek 2008. The implementation here also rounds to 9 decimal places in the subtracted signal to avoid small values very near zero.
- crop(time_window: list[float] | None = None, return_df: bool = False) None | DataFrame
Restricts the time dimension of the DataFrame in place.
- Parameters:
time_window (list [start, end], optional) – The retention time window of the chromatogram to consider for analysis. If None, the entire time range of the chromatogram will be considered.
return_df (bool) – If True, the cropped DataFrame is
- Returns:
cropped_df – If return_df = True, then the cropped dataframe is returned.
- Return type:
pandas DataFrame
- deconvolve_peaks(verbose: bool = True, known_peaks: list = [], param_bounds: Dict[str, list] = {}, integration_window: list = [], max_iter: float = 1000000, optimizer_kwargs: Dict = {}) Dict
Note
In most cases, this function should not be called directly. Instead, it should called through the
fit_peaks()
For each peak window, estimate the parameters of skew-normal distributions which makeup the peak(s) in the window. See “Notes” for information on default parameter bounds.
- Parameters:
verbose (bool) – If True, a progress bar will be printed during the inference.
param_bounds (dict) –
Modifications to the default parameter bounds (see Notes below) as a dictionary for each parameter. A dict entry should be of the form parameter: [lower, upper]. Modifications have the following effects:
Modifications to amplitude bounds are multiplicative of the observed magnitude at the peak position.
Modifications to location are values that are subtracted or added from the peak position for lower and upper bounds, respectively.
Modifications to scale replace the default values.
Modifications to skew replace the default values.
integration_window (list) – The time window over which the integrated peak areas should be computed. If empty, the area will be integrated over the entire duration of the cropped chromatogram.
max_iter (int) – The maximum number of iterations the optimization protocol should take before erroring out. Default value is 10^6.
optimizer_kwargs (dict) – Keyword arguments to be passed to scipy.optimize.curve_fit.
- Returns:
peak_props – A dictionary containing properties of the peak fitting procedure.
- Return type:
dict
Notes
The parameter boundaries are set automatically to prevent run-away estimation into non-realistic regimes that can seriously slow down the inference. The default parameter boundaries for each peak are as follows.
amplitude: The lower and upper peak amplitude boundaries correspond to one-hundredth and one-hundred-times the value of the peak at the peak location in the chromatogram.
location: The lower and upper location bounds correspond to the minimum and maximum time values of the chromatogram.
scale: The lower and upper bounds of the peak standard deviation defaults to the chromatogram time-step and one-half of the chromatogram duration, respectively.
skew: The skew parameter by default is allowed to take any value between (-inf, inf).
- fit_peaks(known_peaks: list = [], tolerance: float = 0.5, prominence: float = 0.01, rel_height: float = 1, approx_peak_width: float = 5, buffer: int = 0, param_bounds: Dict[str, list] = {}, integration_window: list[float] = [], verbose: bool = True, return_peaks: bool = True, correct_baseline: bool = True, max_iter: int = 1000000, precision: int = 9, peak_kwargs: Dict = {}, optimizer_kwargs: Dict = {}) DataFrame
Detects and fits peaks present in the chromatogram
- Parameters:
known_peaks (list or dict) – The approximate locations of peaks whose position is known. If provided as a list, only the locations wil be used as initial guesses. If provided as a dictionary, locations and parameter bounds will be set.
tolerance (float, optional) – If an enforced peak location is within tolerance of an automatically identified peak, the automatically identified peak will be preferred. This parameter is in units of time. Default is one-half time unit.
prominence (float, [0, 1]) – The promimence threshold for identifying peaks. Prominence is the relative height of the normalized signal relative to the local background. Default is 1%. If locations is provided, this is not used.
rel_height (float, [0, 1]) – The relative height of the peak where the baseline is determined. This is used to split into windows and is not used for peak detection. Default is 100%.
approx_peak_width (float, optional) – The approximate width of the signal you want to quantify. This is used as filtering window for automatic baseline correction. If correct_baseline==False, this has no effect. If less than 10-times the time step, an error will be thrown and you will be instructed to either i) increase the approximate peak width or ii) set correct_baseline=False.
buffer (positive int) – The padding of peak windows in units of number of time steps. Default is 100 points on each side of the identified peak window. Must have a value of at least 10.
verbose (bool) – If True, a progress bar will be printed during the inference.
param_bounds (dict, optional) – Parameter boundary modifications to be used to constrain fitting of all peaks. See docstring of
deconvolve_peaks()
for more information.integration_window (list) – The time window over which the integrated peak areas should be computed. If empty, the area will be integrated over the entire duration of the cropped chromatogram.
correct_baseline (bool, optional) – If True, the baseline of the chromatogram will be automatically corrected using the SNIP algorithm. See
correct_baseline()
for more information.return_peaks (bool, optional) – If True, a dataframe containing the peaks will be returned. Default is True.
max_iter (int) – The maximum number of iterations the optimization protocol should take before erroring out. Default value is 10^6.
precision (int) – The number of decimals to round the reconstructed signal to. Default is 9.
peak_kwargs (dict) – Additional arguments to be passed to scipy.signal.find_peaks.
optimizer_kwargs (dict) – Additional arguments to be passed to scipy.optimize.curve_fit.
- Returns:
peak_df – A dataframe containing information for each detected peak. This is only returned if return_peaks == True. The peaks are always stored as an attribute peak_df.
- Return type:
pandas.core.frame.DataFrame
Notes
This function infers the parameters defining skew-norma distributions for each peak in the chromatogram. The fitted distribution has the form
\[I = 2S_\text{max} \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)e^{-\frac{(t - r_t)^2}{2\sigma^2}}\left[1 + \text{erf}\frac{\alpha(t - r_t)}{\sqrt{2\sigma^2}}\right]\]where \(S_\text{max}\) is the maximum signal of the peak, \(t\) is the time, \(r_t\) is the retention time, \(\sigma\) is the scale parameter, and \(\alpha\) is the skew parameter.
- map_peaks(params: Dict[str, Dict[str, float]], loc_tolerance: float = 0.5, include_unmapped: bool = False) DataFrame
Maps user-provided mappings to arbitrarily labeled peaks. If a linear calibration curve is also provided, the concentration will be computed.
- paramsdict
A dictionary mapping each peak to a slope and intercept used for converting peak areas to units of concentraions. Each peak should have a key that is the compound name (e.g. “glucose”). Each key should have another dict as the key with retention_time , slope , and intercept as keys. If only retention_time is given, concentration will not be computed. The key retention_time will be used to map the compound to the peak_id. If unit are provided, this will be added as a column
- loc_tolerancefloat
The tolerance for mapping the compounds to the retention time. The default is 0.5 time units.
- include_unmappedbool
If True, unmapped compounds will remain in the returned peak dataframe, but will be populated with Nan. Default is False.
- Returns:
- peaks – A modified peak table with the compound name and concentration
added as columns.
Notes
Note
As of v0.1.0, this function can only accommodate linear calibration functions.
- Return type:
pandas.core.frame.DataFrame
- show(time_range: list[float] = []) list[Figure, Axes]
Displays the chromatogram with mapped peaks if available.
- Parameters:
time_range (List) – Adjust the limits to show a restricted time range. Should be provided as two floats in the range of [lower, upper]. Note that this does not affect the chromatogram directly as in crop.
- Returns:
fig (matplotlib.figure.Figure) – The matplotlib figure object.
ax (matplotlib.axes._axes.Axes) – The matplotlib axis object.