hplc.quant

class hplc.quant.Chromatogram(file, time_window=None, cols={'signal': 'signal', 'time': 'time'})

Bases: object

Base class for the processing and quantification of an HPLC chromatogram.

df

A Pandas DataFrame containing the chromatogram, minimally with columns of time and signal intensity.

Type:

pandas.core.frame.DataFrame

window_props

A dictionary of each peak window, labeled as increasing integers in linear order. Each key has its own dictionary with the following keys:

Type:

dict

peaks

A Pandas DataFrame containing the inferred properties of each peak including the retention time, scale, skew, amplitude, and total area under the peak across the entire chromatogram.

Type:

pandas.core.frame.DataFrame

unmixed_chromatograms

A matrix where each row corresponds to a time point and each column corresponds to the value of the probability density for each individual peak. This is used primarily for plotting in the show method.

Type:

numpy.ndarray

quantified_peaks

A Pandas DataFrame with peak areas converted to

Type:

pandas.core.frame.DataFrame

scores

A Pandas DataFrame containing the reconstruction scores and Fano factor ratios for each peak and interpeak region. This is generated only afer assess_fit() is called.

Type:

pandas.core.frame.DataFrame

assess_fit(rtol=0.01, fano_tol=0.01, verbose=True)

Assesses whether the computed reconstruction score is adequate, given a tolerance.

Parameters:
  • rtol (float) – The tolerance for a reconstruction to be valid. This is the tolerated deviation from a score of 1 which indicates a perfectly reconstructed chromatogram.

  • fano_tol (float) – The tolerance away from zero for evaluating the Fano factor of inerpeak windows. See note below.

  • verbose (bool) – If True, a summary of the fit will be printed to screen indicating problematic regions if detected.

Returns:

score_df – A DataFrame reporting the scoring statistic for each window as well as for the entire chromatogram. A window value of 0 corresponds to the entire chromatogram. A column accepted with a boolean value represents whether the reconstruction is within tolerance (True) or (False).

Return type:

pandas.core.frame.DataFrame

Notes

The reconstruction score is defined as

\[R = \frac{\text{area of inferred mixture in window} + 1}{\text{area of observed signal in window} + 1}\]

where \(t\) is the total time of the region, \(A\) is the inferred peak amplitude, \(\alpha\) is the inferred skew paramter, \(r_t\) is the inferred peak retention time, \(\sigma\) is the inferred scale parameter and \(S_i\) is the observed signal intensity at time point \(i\). Note that the signal and reconstruction is cast to be positive to compute the score.

A reconstruction score of \(R = 1\) indicates a perfect reconstruction of the chromatogram. For practical purposes, a chromatogram is deemed to be adequately reconstructed if \(R\) is within a tolerance \(\epsilon\) of 1 such that

\[\left| R - 1 \right| \leq \epsilon \Rightarrow \text{Valid Reconstruction}\]

Interpeak regions may have a poor reconstruction score due to noise or short durations. To determine if this poor reconstruction score is due to a missed peak, the signal Fano factor of the region is computed as

\[F = \frac{\sigma^2_{S}}{\langle S \rangle}.\]

This is compared with the average Fano factor of \(N\) peak windows such that the Fano factor ratio is

\[\frac{F}{\langle F_{peak} \rangle} = \frac{\sigma^2_{S} / \langle S \rangle}{\frac{1}{N} \sum\limits_{i}^N \frac{\sigma_{S,i}^2}{\langle S_i \rangle}}.\]

If the Fano factor ratio is below a tolerance fano_tol, then that window is deemed to be noisy and peak-free.

correct_baseline(window=5, return_df=False, verbose=True, precision=9)

Performs Sensitive Nonlinear Iterative Peak (SNIP) clipping to estimate and subtract background in chromatogram.

Parameters:
  • window (int) – The approximate size of signal objects in the chromatogram in dimensions of time. This is related to the number of iterations undertaken by the SNIP algorithm.

  • return_df (bool) – If True, then chromatograms (before and after background correction) are returned

  • verbose (bool) – If True, progress will be printed to screen as a progress bar.

  • precision (int) – The number of decimals to round the subtracted signal to. Default is 9.

Returns:

corrected_df – If return_df = True, then the original and the corrected chromatogram are returned.

Return type:

pandas.core.frame.DataFrame

Notes

This implements the SNIP algorithm as presented and summarized in Morhác and Matousek 2008. The implementation here also rounds to 9 decimal places in the subtracted signal to avoid small values very near zero.

crop(time_window=None, return_df=False)

Restricts the time dimension of the DataFrame in place.

Parameters:
  • time_window (list [start, end], optional) – The retention time window of the chromatogram to consider for analysis. If None, the entire time range of the chromatogram will be considered.

  • return_df (bool) – If True, the cropped DataFrame is

Returns:

cropped_df – If return_df = True, then the cropped dataframe is returned.

Return type:

pandas DataFrame

deconvolve_peaks(verbose=True, known_peaks=[], param_bounds={}, integration_window=[], max_iter=1000000, optimizer_kwargs={})

Note

In most cases, this function should not be called directly. Instead, it should called through the fit_peaks()

For each peak window, estimate the parameters of skew-normal distributions which makeup the peak(s) in the window. See “Notes” for information on default parameter bounds.

Parameters:
  • verbose (bool) – If True, a progress bar will be printed during the inference.

  • param_bounds (dict) –

    Modifications to the default parameter bounds (see Notes below) as a dictionary for each parameter. A dict entry should be of the form parameter: [lower, upper]. Modifications have the following effects:

    • Modifications to amplitude bounds are multiplicative of the observed magnitude at the peak position.

    • Modifications to location are values that are subtracted or added from the peak position for lower and upper bounds, respectively.

    • Modifications to scale replace the default values.

    • Modifications to skew replace the default values.

  • integration_window (list) – The time window over which the integrated peak areas should be computed. If empty, the area will be integrated over the entire duration of the cropped chromatogram.

  • max_iter (int) – The maximum number of iterations the optimization protocol should take before erroring out. Default value is 10^6.

  • optimizer_kwargs (dict) – Keyword arguments to be passed to scipy.optimize.curve_fit.

Returns:

peak_props – A dataframe containing properties of the peak fitting procedure.

Return type:

dict

Notes

The parameter boundaries are set automatically to prevent run-away estimation into non-realistic regimes that can seriously slow down the inference. The default parameter boundaries for each peak are as follows.

  • amplitude: The lower and upper peak amplitude boundaries correspond to one-tenth and ten-times the value of the peak at the peak location in the chromatogram.

  • location: The lower and upper location bounds correspond to the minimum and maximum time values of the chromatogram.

  • scale: The lower and upper bounds of the peak standard deviation defaults to the chromatogram time-step and one-half of the chromatogram duration, respectively.

  • skew: The skew parameter by default is allowed to take any value between (-inf, inf).

fit_peaks(known_peaks=[], tolerance=0.5, prominence=0.01, rel_height=1, approx_peak_width=5, buffer=0, param_bounds={}, integration_window=[], verbose=True, return_peaks=True, correct_baseline=True, max_iter=1000000, precision=9, peak_kwargs={}, optimizer_kwargs={})

Detects and fits peaks present in the chromatogram

Parameters:
  • known_peaks (list or dict) – The approximate locations of peaks whose position is known. If provided as a list, only the locations wil be used as initial guesses. If provided as a dictionary, locations and parameter bounds will be set.

  • tolerance (float, optional) – If an enforced peak location is within tolerance of an automatically identified peak, the automatically identified peak will be preferred. This parameter is in units of time. Default is one-half time unit.

  • prominence (float, [0, 1]) – The promimence threshold for identifying peaks. Prominence is the relative height of the normalized signal relative to the local background. Default is 1%. If locations is provided, this is not used.

  • rel_height (float, [0, 1]) – The relative height of the peak where the baseline is determined. This is used to split into windows and is not used for peak detection. Default is 100%.

  • approx_peak_width (float, optional) – The approximate width of the signal you want to quantify. This is used as filtering window for automatic baseline correction. If correct_baseline==False, this has no effect.

  • buffer (positive int) – The padding of peak windows in units of number of time steps. Default is 100 points on each side of the identified peak window. Must have a value of at least 10.

  • verbose (bool) – If True, a progress bar will be printed during the inference.

  • param_bounds (dict, optional) – Parameter boundary modifications to be used to constrain fitting of all peaks. See docstring of deconvolve_peaks() for more information.

  • integration_window (list) – The time window over which the integrated peak areas should be computed. If empty, the area will be integrated over the entire duration of the cropped chromatogram.

  • correct_baseline (bool, optional) – If True, the baseline of the chromatogram will be automatically corrected using the SNIP algorithm. See correct_baseline() for more information.

  • return_peaks (bool, optional) – If True, a dataframe containing the peaks will be returned. Default is True.

  • max_iter (int) – The maximum number of iterations the optimization protocol should take before erroring out. Default value is 10^6.

  • precision (int) – The number of decimals to round the reconstructed signal to. Default is 9.

  • peak_kwargs (dict) – Additional arguments to be passed to scipy.signal.find_peaks.

  • optimizer_kwargs (dict) – Additional arguments to be passed to scipy.optimize.curve_fit.

Returns:

peak_df – A dataframe containing information for each detected peak. This is only returned if return_peaks == True. The peaks are always stored as an attribute peak_df.

Return type:

pandas.core.frame.DataFrame

Notes

This function infers the parameters defining skew-norma distributions for each peak in the chromatogram. The fitted distribution has the form

\[I = 2S_\text{max} \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)e^{-\frac{(t - r_t)^2}{2\sigma^2}}\left[1 + \text{erf}\frac{\alpha(t - r_t)}{\sqrt{2\sigma^2}}\right]\]

where \(S_\text{max}\) is the maximum signal of the peak, \(t\) is the time, \(r_t\) is the retention time, \(\sigma\) is the scale parameter, and \(\alpha\) is the skew parameter.

map_peaks(params, loc_tolerance=0.5, include_unmapped=False)

Maps user-provided mappings to arbitrarily labeled peaks. If a linear calibration curve is also provided, the concentration will be computed.

paramsdict

A dictionary mapping each peak to a slope and intercept used for converting peak areas to units of concentraions. Each peak should have a key that is the compound name (e.g. “glucose”). Each key should have another dict as the key with retention_time , slope , and intercept as keys. If only retention_time is given, concentration will not be computed. The key retention_time will be used to map the compound to the peak_id. If unit are provided, this will be added as a column

loc_tolerancefloat

The tolerance for mapping the compounds to the retention time. The default is 0.5 time units.

include_unmappedbool

If True, unmapped compounds will remain in the returned peak dataframe, but will be populated with Nan. Default is False.

Returns:

peaks – A modified peak table with the compound name and concentration

added as columns.

Notes

Note

As of v0.1.0, this function can only accommodate linear calibration functions.

Return type:

pandas.core.frame.DataFrame

show(time_range=[])

Displays the chromatogram with mapped peaks if available.

Parameters:

time_range (List) – Adjust the limits to show a restricted time range. Should be provided as two floats in the range of [lower, upper]. Note that this does not affect the chromatogram directly as in crop.

Returns:

  • fig (matplotlib.figure.Figure) – The matplotlib figure object.

  • ax (matplotlib.axes._axes.Axes) – The matplotlib axis object.