straditize.binary module

A module to read in and digitize the pollen diagram

Disclaimer

Copyright (C) 2018-2019 Philipp S. Sommer

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Classes

BarDataReader(*args, **kwargs)

A DataReader for digitizing bar pollen diagrams

DataReader(image[, ax, extent, plot, …])

A class to read in and digitize the data files of the pollen diagram

LineDataReader(image[, ax, extent, plot, …])

A data reader for digitizing line diagrams

RoundedBarDataReader(*args, **kwargs)

A bar data reader that can be used for rounded bars

Functions

groupby_arr(arr)

Groupby a boolean array

only_parent(func)

Call the given func only from the parent reader

class straditize.binary.BarDataReader(*args, **kwargs)[source]

Bases: straditize.binary.DataReader

A DataReader for digitizing bar pollen diagrams

Compared to the base DataReader class, this reader implements a different strategy in digitizing and finding the samples. When digitizing the full diagram, we try to find the distinct bars using the get_bars() method. These bars might have to be splitted manually if they are not easy to distinguish. One key element to distinguish to adjacent bars is the specified tolerance.

The base class works for rectangular bars. If you require rounded bars, use the RoundedBarDataReader

Parameters

tolerance (int) – If x0 is the value in a pixel row y and x1 the value in the next pixel row y+1, then the two pixel rows are considered as belonging to different bars if abs(x1 - x0) > tolerance (see the get_bars() method and the tolerance attribute)

Methods

create_grouper(ds, columns, *args, **kwargs)

Create the grouper that plots the results

digitize([do_split, inplace])

Reimplemented to ignore the rows between the bars

find_potential_samples(col[, min_len, …])

Find the bars in the column

from_dataset(ds, *args, **kwargs)

Create a new DataReader from a xarray.Dataset

get_bars(arr[, do_split])

Find the distinct bars in an array

shift_vertical(pixels)

Shift the columns vertically.

to_dataset([ds])

All the necessary data as a xarray.Dataset

Attributes

min_fract

The minimum fraction of overlap for two bars to be considered as the

nc_meta

dict() -> new empty dictionary

samples_at_boundaries

There should not be samples at the boundaries because the first

tolerance

Tolerance to distinguish bars.

create_grouper(ds, columns, *args, **kwargs)[source]

Create the grouper that plots the results

Parameters
  • ds (xarray.Dataset) – The dataset with the data

  • columns (list of int) – The numbers of the columns for which the grouper should be created

  • fig (matplotlib.figure.Figure) – The matplotlib figure to plot on

  • x0 (float) – The left boundary of the larger Bbox of the stratigraphic diagram

  • y0 (int) – The upper boundary of the larger Bbox of the stratigraphic diagram

  • width (float) – The width of the final axes between 0 and 1

  • height (float) – The height of the final axis between 0 and 1

  • ax0 (matplotlib.axes.Axes) – The larger matplotlib axes whose bounding box shall be used.

  • transformed (bool) – If True, y-axes and x-axes have been translated (see the px2data_x() and px2data_y() methods)

  • colnames (list of str) – The column names to use in the plot

  • **kwargs – any other keyword argument that is passed to the psy_strat.stratplot.StratGroup.from_dataset() method

Returns

The grouper that visualizes the given columns in the fig

Return type

psy_strat.stratplot.StratGroup

digitize(do_split=False, inplace=True)[source]

Reimplemented to ignore the rows between the bars

Parameters
  • do_split (bool) – If True and a bar is 1.7 times longer than the mean, it is splitted into two.

  • inplace (bool) – If True (default), the full_df attribute is updated. Otherwise a DataFrame is returned

find_potential_samples(col, min_len=None, max_len=None, filter_func=None)[source]

Find the bars in the column

This method gets the bars in the given col and returns the distinct indices

Parameters
  • col (int) – The column for which to find the extrema

  • min_len (int) – The minimum length of one extremum. If the width of the interval where we found an extrumum is smaller than that, the extremum is ignored. If None, this parameter does not have an effect (i.e. min_len=1).

  • max_len (int) – The maximum length of one extremum. If the width of the interval where we found an extrumum is greater than that, the extremum is ignored. If None, this parameter does not have an effect.

  • filter_func (function) – A function to filter the extreme. It must accept one argument which is a list of integers representing the indices of the extremum in a

Returns

  • list of list of int of shape (N, 2) – The list of N extremum locations. Each tuple in this list represents an interval a where one extremum might be located

  • list of list of int – The excluded extremum locations that are ignored because we could not find a change of sign in the slope.

See also

find_samples()

classmethod from_dataset(ds, *args, **kwargs)[source]

Create a new DataReader from a xarray.Dataset

Parameters
Returns

The reader recreated from ds

Return type

DataReader

get_bars(arr, do_split=False)[source]

Find the distinct bars in an array

Parameters
  • arr (np.ndarray) – The array to find the bars in

  • do_split (bool) – If True and a bar is 1.7 times longer than the mean, it is splitted into two.

Returns

  • list of list of ints – The list of the distinct positions of the bars

  • list of floats – The heights for each of the bars

  • list of list of ints – The indices of bars that are longer than 1.7 times the mean of the other bars and should be splitted. If do_split is True, they have been splitted already

max_len = None
min_fract = 0.9

The minimum fraction of overlap for two bars to be considered as the same sample (see unique_bars())

min_len = None
nc_meta = {'bars{reader}_bars': {'dims': ('bars{reader}_bar', 'limit'), 'long_name': 'Boundaries of bars', 'units': 'px'}, 'bars{reader}_full_data_orig': {'dims': ('ydata', 'bars{reader}_column'), 'long_name': 'Full digitized data ignoring bars', 'units': 'px'}, 'bars{reader}_max_len': {'dims': (), 'long_name': 'Maximum length of a bar'}, 'bars{reader}_min_fract': {'dims': (), 'long_name': 'Minimum fraction for overlap estimation'}, 'bars{reader}_min_len': {'dims': (), 'long_name': 'Minimum length of a bar'}, 'bars{reader}_nbars': {'dims': 'bars{reader}_column', 'long_name': 'number of bars per column'}, 'bars{reader}_nsplit': {'dims': 'bars{reader}_column', 'long_name': 'number of the splitted bars'}, 'bars{reader}_splitted': {'dims': ('bar_split', 'limit'), 'long_name': 'Boundaries of bars to split', 'units': 'px'}, 'bars{reader}_tolerance': {'dims': (), 'long_name': 'bar distinguishing tolerance'}, 'binary': {'dims': ('reader', 'ydata', 'xdata'), 'long_name': 'Binary images for data readers'}, 'col_map': {'dims': 'column', 'long_name': 'Mapping from column to reader', 'units': 'reader_index'}, 'column_ends': {'dims': 'column', 'long_name': 'Ends of the columns', 'units': 'px'}, 'column_starts': {'dims': 'column', 'long_name': 'Start of the columns', 'units': 'px'}, 'exag_col_map': {'dims': 'column', 'long_name': 'Mapping from column to exaggerated reader', 'units': 'reader_index'}, 'full_data': {'dims': ('ydata', 'column'), 'long_name': 'Full digitized data', 'units': 'px'}, 'hline': {'long_name': 'Horizontal line location', 'units': 'px'}, 'is_exaggerated': {'dims': 'reader', 'long_name': 'Exaggeration factor'}, 'occurences': {'comments': 'The locations where the only an occurence of a taxa is highlighted without value', 'dims': ('occurence', 'xy'), 'long_name': 'taxa occurences'}, 'reader': {'dims': 'reader', 'long_name': 'index of the reader'}, 'reader_cls': {'dims': 'reader', 'long_name': 'The name of the class constructor'}, 'reader_image': {'dims': ('reader', 'ydata', 'xdata', 'rgba'), 'long_name': 'RGBA images for data readers', 'units': 'color'}, 'reader_mod': {'dims': 'reader', 'long_name': 'The module of the reader class'}, 'rough_locs': {'dims': ('sample', 'column', 'limit'), 'long_name': 'Rough locations for samples'}, 'sample': {'long_name': 'Sample location', 'units': 'px'}, 'samples': {'dims': ('sample', 'column'), 'long_name': 'Sample data', 'units': 'px'}, 'shifted': {'dims': 'column', 'long_name': 'Vertical shift per column', 'units': 'px'}, 'vline': {'long_name': 'Vertical line location', 'units': 'px'}, 'xaxis_translation': {'dims': ('reader', 'px_data', 'limit'), 'long_name': 'Pixel to data mapping for x-axis'}}
samples_at_boundaries = False

There should not be samples at the boundaries because the first sample is in the middle of the first bar

shift_vertical(pixels)[source]

Shift the columns vertically.

Parameters

pixels (list of floats) – The y-value for each column for which to shift the values. Note that theses values have to be greater than or equal to 0

to_dataset(ds=None)[source]

All the necessary data as a xarray.Dataset

Parameters

ds (xarray.Dataset) – The dataset in which to insert the data. If None, a new one will be created

Returns

Either the given ds or a new xarray.Dataset instance

Return type

xarray.Dataset

tolerance = 2

Tolerance to distinguish bars. If x0 is the value in a pixel row y and x1 the value in the next pixel row y+1, then the two pixel rows are considered as belonging to different bars if abs(x1 - x0) > tolerance

class straditize.binary.DataReader(image, ax=None, extent=None, plot=True, children=[], parent=None, magni=None, plot_background=False, binary=None)[source]

Bases: straditize.label_selection.LabelSelection

A class to read in and digitize the data files of the pollen diagram

The source image is stored in the image attribute, the binary array of it is stored in the binary attribute. A labeled version created by the skimage.morphology.label() function, is stored in the labels attribute and can regenerated using the reset_labels() method.

Subclasses of this class should reimplement the digitize() method that digitizes the diagram, and the find_potential_samples() method.

There is always one parent reader stored in the parent attribute. This is then the reader that is accessible through the straditize.straditizer.Straditizer.data_reader attribute and holds the references to other readers in it’s children attribute

Parameters
  • image (PIL.Image.Image) – The image of the diagram

  • ax (matplotlib.axes.Axes) – The matplotlib axes to plot on

  • extent (list) – List of four number specifying the extent of the image in it’s source. This extent will be used for the call of matplotlib.pyplot.imshow()

  • children (list of DataReader) – Child readers for other columns in case the newly created instance is the parent reader

  • parent (DataReader) – The parent reader.

  • magni (straditize.magnifier.Magnifier) – The magnifier for the given ax

  • plot_background (bool) – If True (and plot is True), a white, opaque are is plotted below the plot_im

  • binary (None) – The binary version of the given image. If not provided, the to_binary_pil() method is used with the given image

Methods

add_samples(samples[, rough_locs])

Add samples to the found ones

close()

color_labels([categorize])

The labels of the colored array

create_exaggerations_reader(factor[, cls])

Create a new exaggerations reader for this reader

create_grouper(ds, columns, fig, x0, y0, …)

Create the grouper that plots the results

create_variable(ds, vname, data, **kwargs)

Insert the data into a variable in an xr.Dataset

digitize([use_sum, inplace])

Digitize the binary image to create the full dataframe

digitize_exaggerated([fraction, absolute, …])

Merge the exaggerated values into the original digitized result

disable_label_selection(*args, **kwargs)

Disable the label selection

draw_figure()

Draw the matplotlib fig and the magni figure

end_column_selection()

End the column selection and rmove the artists

estimated_column_starts([threshold])

The estimated column starts as numpy.ndarray.

find_potential_samples(col[, min_len, …])

Find potential samples in an array

find_samples([min_fract, pixel_tol])

Find the samples in the diagram

found_extrema_per_row()

Calculate how many columns have a potential sample in each pixel row

from_dataset(ds, *args, **kwargs)

Create a new DataReader from a xarray.Dataset

get_bbox_for_cols(columns, x0, y0, width, height)

Get the boundary boxes for the columns of this reader in the results

get_binary_for_col(col)

Get the binary array for a specific column

get_cross_column_features([min_px])

Get features that are contained in two or more columns

get_disconnected_parts([fromlast, from0, …])

Identify parts in the binary data that are not connected

get_labeled_array()

Create a connectivity-based labeled array of the binary data

get_occurences()

Extract the positions of the occurences from the selection

get_parts_at_column_ends([npixels])

Identify parts in the binary data that touch the next column

get_reader_for_col(col)

Get the reader for a specific column

get_surrounding_slopes(indices, arr)

image_array()

The RGBA values of the colored image

is_obstacle(indices, arr)

Check whether the found extrema is only an obstacle of the picture

mark_as_exaggerations(mask)

Mask the given array as exaggerated

merge_close_samples(locs[, rough_locs, …])

merged_binaries()

Get the binary data from all children and merge them into one array

merged_labels()

Get the labeled binary data from all children merged into one array

new_child_for_cols(columns, cls[, plot])

Create a new child reader for specific columns

plot_background([ax])

Plot a white layer below the plot_im

plot_color_image([ax])

Plot the colored image on a matplotlib axes

plot_full_df([ax])

Plot the lines for the digitized diagram

plot_image([ax])

Plot the binary data image on a matplotlib axes

plot_other_potential_samples([tol, …])

Plot potential samples that are not yet in the samples

plot_potential_samples([excluded, ax, plot_kws])

Plot the ranges for potential samples

plot_results(df[, ax, fig, transformed])

Plot the reconstructed diagram

plot_sample_hlines([ax])

Plot one horizontal line per sample in the sample_locs

plot_samples([ax])

Plot the diagram as lines reconstructed from the samples

px2data_x(coord)

Transform the pixel coordinates into data coordinates

recognize_hlines([fraction, min_lw, max_lw, …])

Recognize horizontal lines in the plot and subtract them

recognize_vlines([fraction, min_lw, max_lw, …])

Recognize horizontal lines in the plot and subtract them

recognize_xaxes([fraction, min_lw, max_lw, …])

Recognize (and potentially remove) x-axes at bottom and top

recognize_yaxes([fraction, min_lw, max_lw, …])

Find (and potentially remove) y-axes in the image

remove_in_children(arr, amask)

Update the child reader images after having removed binary data

remove_plots()

Remove all plotted artists by this reader

reset_column_starts()

Reset the column starts, full_df, shifted

reset_image(image[, binary])

Reset the image for this straditizer

reset_labels()

Reset the labels array

reset_samples()

Reset the samples

resize_axes(grouper, bounds)

Resize the axes based on column boundaries

set_as_parent()

Set this instance as the parent reader

set_hline_locs_from_selection([selection])

Save the locations of horizontal lines

set_vline_locs_from_selection([selection])

Save the locations of vertical lines

shift_vertical(pixels[, draw])

Shift the columns vertically.

show_cross_column_features([min_px, remove])

Highlight and maybe remove cross column features

show_disconnected_parts([fromlast, from0, …])

Highlight or remove disconnected parts

show_parts_at_column_ends([npixels, remove])

Highlight or remove features that touch the column ends

show_small_parts([n, remove])

Highlight and potentially remove small features in the image

start_column_selection([use_all])

Enable the user to select columns

to_binary_pil(image[, threshold])

Convert an image to a binary

to_dataset([ds])

All the necessary data as a xarray.Dataset

to_grey_pil(image[, threshold])

Convert an image to a greyscale image

unique_bars([min_fract, asdict])

Estimate the unique bars

update_image(arr, amask)

Update the image after having removed binary data

update_rgba_image(arr, mask)

Update the RGBA image from the given 3D-array

Attributes

all_column_bounds

The boundaries for the data columns

all_column_ends

1D numpy array with the ends for all column (including child reader)

all_column_starts

1D numpy array with the ends for all column (including child reader)

ax

The matplotlib axes where the plot_im is plotted on

background

White rectangle that represents the background of the binary image.

binary

A 2D numpy array representing the binary version of the image

children

Child readers for specific columns.

column_bounds

The boundaries for the data columns

column_ends

1D numpy array with the ends for each column of this reader

column_starts

1D numpy array with the starts for each column of this reader

columns

The indices of the columns that are handled by this reader

exaggerated_reader

The reader that represents the exaggerations

extent

The extent of the plot_im

fig

The matplotlib figure of the ax

full_df

The full pandas.DataFrame of the digitized image

hline_locs

list or floats. The indexes of horizontal lines

image

PIL.Image.Image of the diagram part with mode RGBA

is_exaggerated

Exaggeration factor that is not 0 if this reader represents exaggeration

iter_all_readers

Iter through the parent reader and it’s children

label_arrs

Built-in mutable sequence.

labels

A connectivity-based labeled version of the binary data

magni

the straditize.magnifier.Magnifier for the ax

magni_background

White rectangle that represents the background of the binary image in the magnifier.

magni_plot_im

magnified plot_im

min_fract

The minimum fraction of overlap for two bars to be considered as the

nc_meta

A mapping from variable name to meta information

non_exaggerated_reader

The reader that represents the exaggerations

num_labels

The maximum label in the labels array

occurences

A set of tuples marking the position of an occurence

occurences_dict

A mapping from column number to an numpy array with the indices of

occurences_value

The value that is given to the occurences in the measurements

parent

Parent reader for this instance.

plot_im

the matplotlib image artist

rough_locs

The pandas.DataFrame with rough locations for the samples.

sample_locs

The pandas.DataFrame with locations and values of the

samples_at_boundaries

a boolean flag that shall indicate if we assume that the first and last

shifted

The number of pixels the columns have been shifted

strat_plot_identifier

str(object=’’) -> str

vline_locs

list or floats. The indexes of vertical lines

xaxis_px

The x indices in column pixel coordinates that are used for x-axes

add_samples(samples, rough_locs=None)[source]

Add samples to the found ones

Parameters
  • samples (series, 1d-array or DataFrame) –

    The samples. If it is series, we assume that the index represents the y-value of the sample and the value the x-position (see xcolumns). In case of a 1d-array, we assume that the data represents the y-values of the samples. In case of a DataFrame, we assume that the columns correspond to columns in the full_df attribute and are True where we have a sample.

    Note that the y-values must be in image coordinates (see extent attribute).

  • rough_locs (DataFrame) – The rough locations of the new samples (see the rough_locs attribute)

See also

samples(), rough_locs(), find_samples(), sample_locs()

property all_column_bounds

The boundaries for the data columns

property all_column_ends

1D numpy array with the ends for all column (including child reader)

See also

all_column_starts

The starts for all column

all_column_bounds

The (start, end)-tuple for all of the columns

column_ends

The ends for this specific reader

reader

property all_column_starts

1D numpy array with the ends for all column (including child reader)

See also

all_column_ends

The ends for all column

all_column_bounds

The (start, end)-tuple for all of the columns

column_starts

The starts for this specific reader

reader

ax = None

The matplotlib axes where the plot_im is plotted on

background = None

White rectangle that represents the background of the binary image. This is only plotted by the parent reader

binary = None

A 2D numpy array representing the binary version of the image

children = []

Child readers for specific columns. Is not empty if and only if the parent attribute is this instance

close()[source]
color_labels(categorize=1)[source]

The labels of the colored array

property column_bounds

The boundaries for the data columns

property column_ends

1D numpy array with the ends for each column of this reader

See also

column_starts

The starts for each column

column_bounds

The (start, end)-tuple for each of the columns

all_column_ends

The ends for all columns, including child

reader

property column_starts

1D numpy array with the starts for each column of this reader

See also

column_ends

The ends for each column

column_bounds

The (start, end)-tuple for each of the columns

all_column_starts

The starts for all columns, including child

reader

property columns

The indices of the columns that are handled by this reader

create_exaggerations_reader(factor, cls=None)[source]

Create a new exaggerations reader for this reader

Parameters
Returns

The new exaggerated reader

Return type

instance of cls

create_grouper(ds, columns, fig, x0, y0, width, height, ax0=None, transformed=True, colnames=None, **kwargs)[source]

Create the grouper that plots the results

Parameters
  • ds (xarray.Dataset) – The dataset with the data

  • columns (list of int) – The numbers of the columns for which the grouper should be created

  • fig (matplotlib.figure.Figure) – The matplotlib figure to plot on

  • x0 (float) – The left boundary of the larger Bbox of the stratigraphic diagram

  • y0 (int) – The upper boundary of the larger Bbox of the stratigraphic diagram

  • width (float) – The width of the final axes between 0 and 1

  • height (float) – The height of the final axis between 0 and 1

  • ax0 (matplotlib.axes.Axes) – The larger matplotlib axes whose bounding box shall be used.

  • transformed (bool) – If True, y-axes and x-axes have been translated (see the px2data_x() and px2data_y() methods)

  • colnames (list of str) – The column names to use in the plot

  • **kwargs – any other keyword argument that is passed to the psy_strat.stratplot.StratGroup.from_dataset() method

Returns

The grouper that visualizes the given columns in the fig

Return type

psy_strat.stratplot.StratGroup

create_variable(ds, vname, data, **kwargs)[source]

Insert the data into a variable in an xr.Dataset

Parameters
  • ds (xarray.Dataset) – The destination dataset

  • vname (str) – The name of the variable in the nc_meta mapping. This name might include {reader} which will then be replaced by the number of the reader in the iter_all_readers attribute

  • data (np.ndarray) – The numpy array to store in the variable specified by vname

  • **kwargs – A mapping from dimension to slicer that should be used to slice the dataset

Returns

The resolved vname that has been used in the dataset

Return type

str

digitize(use_sum=False, inplace=True)[source]

Digitize the binary image to create the full dataframe

Parameters
  • use_sum (bool) – If True, the sum of cells that are not background are used for each column, otherwise the value of the cell is used that has the maximal distance to the column start for each row

  • inplace (bool) – If True (default), the full_df attribute is updated. Otherwise a DataFrame is returned

Returns

The digitization result if inplace is True, otherwise None

Return type

None or pandas.DataFrame

digitize_exaggerated(fraction=0.05, absolute=8, inplace=True, return_mask=False)[source]

Merge the exaggerated values into the original digitized result

Parameters
  • fraction (float between 0 and 1) – The fraction under which the exaggerated data should be used. Set this to 0 to ignore it.

  • absolute (int) – The absolute value under which the exaggerated data should be used. Set this to 0 to ignore it.

  • inplace (bool) – If True (default), the full_df attribute is updated. Otherwise a DataFrame is returned

  • return_mask (bool) – If True, a boolean 2D array is returned indicating where the exaggerations have been used

Returns

  • pandas.DataFrame or None – If inplace is False, the digitized result. Otherwise, if return_mask is True, the mask where the exaggerated results have been used. Otherwise None

  • pandas.DataFrame, optionally – If inplace is False and return_mask is True, a pandas.DataFrame containing the boolean mask where the exaggerated results have been used. Otherwise, this is skipped

disable_label_selection(*args, **kwargs)[source]

Disable the label selection

This will disconnect the pick_event and remove the selection images

Parameters

remove (bool) – Whether to remove the selection image from the plot. If None, the _remove attribute is used

See also

enable_label_selection(), remove_selected_labels()

draw_figure()[source]

Draw the matplotlib fig and the magni figure

end_column_selection()[source]

End the column selection and rmove the artists

estimated_column_starts(threshold=None)[source]

The estimated column starts as numpy.ndarray.

We assume a new column a pixel column $i$ if

  1. the previous pixel column $i-1$ did not contain any data ($D(i-1) = 0$)

  2. THE amount of data points doubled compared to $i-1$ ($D(i) geq 2cdot D(i-1)$)

  3. the amount of data points steadily increases within the next few columns to a value twice as large as the previous column ($D(i+n) geq 2cdot D(i-1)$ with $n>0$ and $D(i+j) geq D(i)$ for all $0 < j geq n$)

Each potential column starts must also be covered by a given threshold.

Parameters

threshold (float between 0 and 1) – The fraction that has to be covered to assume a valid column start. By default, 0.1 (i.e. 10 percent)

Returns

The starts for each column

Return type

np.ndarray

property exaggerated_reader

The reader that represents the exaggerations

property extent

The extent of the plot_im

property fig

The matplotlib figure of the ax

find_potential_samples(col, min_len=None, max_len=None, filter_func=None)[source]

Find potential samples in an array

This method finds extrema in an array and returns the indices where the extremum might be. The algorithm thereby filters out obstacles by first going over the array, making sure, that there is a change of sign in the slope in the found extremum, and if not, ignores it and flattens it out.

Parameters
  • col (int) – The column for which to find the extrema

  • min_len (int) – The minimum length of one extremum. If the width of the interval where we found an extrumum is smaller than that, the extremum is ignored. If None, this parameter does not have an effect (i.e. min_len=1).

  • max_len (int) – The maximum length of one extremum. If the width of the interval where we found an extrumum is greater than that, the extremum is ignored. If None, this parameter does not have an effect.

  • filter_func (function) – A function to filter the extreme. It must accept one argument which is a list of integers representing the indices of the extremum in a

Returns

  • list of list of int of shape (N, 2) – The list of N extremum locations. Each tuple in this list represents an interval a where one extremum might be located

  • list of list of int – The excluded extremum locations that are ignored because we could not find a change of sign in the slope.

See also

find_samples()

find_samples(min_fract=None, pixel_tol=5, *args, **kwargs)[source]

Find the samples in the diagram

This function finds the samples using the find_potential_samples() function. It combines the found extrema from all columns and estimates the exact location using an interpolation of the slope

Parameters
  • min_fract (float) – The minimum fraction between 0 and 1 that two bars have to overlap such that they are considered as representing the same sample. If None, the min_fract attribute is used

  • min_len (int) – The minimum length of one extremum. If the width of the interval where we found an extrumum is smaller than that, the extremum is ignored. If None, this parameter does not have an effect (i.e. min_len=1).

  • max_len (int) – The maximum length of one extremum. If the width of the interval where we found an extrumum is greater than that, the extremum is ignored. If None, this parameter does not have an effect.

  • filter_func (function) – A function to filter the extreme. It must accept one argument which is a list of integers representing the indices of the extremum in a

Returns

  • pandas.DataFrame – The x- and y-locations of the samples. The index is the y-location, the columns are the columns in the full_df.

  • pandas.DataFrame – The rough locations of the samples. The index is the y-location of the columns, the values are lists of the potential sample locations.

found_extrema_per_row()[source]

Calculate how many columns have a potential sample in each pixel row

Returns

A series with one entry per pixel row. The values are the number of columns in the diagram that have a potential sample noted in the rough_locs

Return type

pandas.Series

classmethod from_dataset(ds, *args, **kwargs)[source]

Create a new DataReader from a xarray.Dataset

Parameters
Returns

The reader recreated from ds

Return type

DataReader

property full_df

The full pandas.DataFrame of the digitized image

get_bbox_for_cols(columns, x0, y0, width, height)[source]

Get the boundary boxes for the columns of this reader in the results plot

This method is used by the plot_results() method to get the Bbox for a psy_strat.stratplot.StratGroup grouper

Parameters
  • columns (list of int) – The column numbers to use

  • x0 (float) – The left boundary of the larger Bbox of the stratigraphic diagram

  • y0 (int) – The upper boundary of the larger Bbox of the stratigraphic diagram

  • width (float) – The width of the final axes between 0 and 1

  • height (float) – The height of the final axis between 0 and 1

Returns

The boundary box for the given columns in the matplotlib figure

Return type

matplotlib.transforms.Bbox

See also

plot_results()

get_binary_for_col(col)[source]

Get the binary array for a specific column

get_cross_column_features(min_px=50)[source]

Get features that are contained in two or more columns

Parameters

min_px (int) – The number of pixels that have to be contained in each column

Returns

The 2D boolean mask with the same shape as the binary array that is True if a data pixel is considered as to belong to a cross column feature

Return type

np.ndarray of dtype bool

get_disconnected_parts(fromlast=5, from0=10, cross_column=False)[source]

Identify parts in the binary data that are not connected

Parameters
  • fromlast (int) – A pixel x1 > x0 is considered as disconnected, if it is at least x1 - x0 >= fromlast. If this is 0, it is ignored and only from0 is considered.

  • from0 (int) – A pixel is considered as disconnected if it is more than from0 pixels away from the column start. If this is 0, it is ignored and only fromlast is considered

  • cross_column (bool) – If False, disconnected features are only marked in the column where the disconnection has been detected. Otherwise the entire feature is marked

Returns

The 2D boolean mask with the same shape as the binary array that is True if a data pixel is considered as to be disconnected

Return type

np.ndarray of dtype bool

get_labeled_array()[source]

Create a connectivity-based labeled array of the binary data

get_occurences()[source]

Extract the positions of the occurences from the selection

get_parts_at_column_ends(npixels=2)[source]

Identify parts in the binary data that touch the next column

Parameters

npixels (int) – If a data pixel is less than npixels away from the column end, it is considered to be at the column end and marked

Returns

A boolean mask with the same shape as the binary data that is True where a pixel is considered to be at the column end

Return type

np.ndarray of dtype bool

get_reader_for_col(col)[source]

Get the reader for a specific column

Parameters

col (int) – The column of interest

Returns

Either the reader or None if no reader could be found

Return type

DataReader or None

get_surrounding_slopes(indices, arr)[source]
hline_locs = None

list or floats. The indexes of horizontal lines

image = None

PIL.Image.Image of the diagram part with mode RGBA

image_array()[source]

The RGBA values of the colored image

is_exaggerated = 0

Exaggeration factor that is not 0 if this reader represents exaggeration plots

is_obstacle(indices, arr)[source]

Check whether the found extrema is only an obstacle of the picture

property iter_all_readers

Iter through the parent reader and it’s children

label_arrs = ['binary', 'labels', 'image_array']
labels = None

A connectivity-based labeled version of the binary data

magni = None

the straditize.magnifier.Magnifier for the ax

magni_background = None

White rectangle that represents the background of the binary image in the magnifier. This is only plotted by the parent reader

magni_color_plot_im = None
magni_plot_im = None

magnified plot_im

mark_as_exaggerations(mask)[source]

Mask the given array as exaggerated

Parameters

mask (2D np.ndarray of dtype bool) – A mask with the same shape as the binary array that is True if a cell should be interpreted as the visualization of an exaggeration

merge_close_samples(locs, rough_locs=None, pixel_tol=5)[source]
merged_binaries()[source]

Get the binary data from all children and merge them into one array

Returns

The binary image with the same shape as the binary data

Return type

np.ndarray of dtype int

merged_labels()[source]

Get the labeled binary data from all children merged into one array

Returns

The labeled binary image with the same shape as the label data

Return type

np.ndarray of dtype int

min_fract = 0.9

The minimum fraction of overlap for two bars to be considered as the same sample (see unique_bars())

nc_meta = {'binary': {'dims': ('reader', 'ydata', 'xdata'), 'long_name': 'Binary images for data readers'}, 'col_map': {'dims': 'column', 'long_name': 'Mapping from column to reader', 'units': 'reader_index'}, 'column_ends': {'dims': 'column', 'long_name': 'Ends of the columns', 'units': 'px'}, 'column_starts': {'dims': 'column', 'long_name': 'Start of the columns', 'units': 'px'}, 'exag_col_map': {'dims': 'column', 'long_name': 'Mapping from column to exaggerated reader', 'units': 'reader_index'}, 'full_data': {'dims': ('ydata', 'column'), 'long_name': 'Full digitized data', 'units': 'px'}, 'hline': {'long_name': 'Horizontal line location', 'units': 'px'}, 'is_exaggerated': {'dims': 'reader', 'long_name': 'Exaggeration factor'}, 'occurences': {'comments': 'The locations where the only an occurence of a taxa is highlighted without value', 'dims': ('occurence', 'xy'), 'long_name': 'taxa occurences'}, 'reader': {'dims': 'reader', 'long_name': 'index of the reader'}, 'reader_cls': {'dims': 'reader', 'long_name': 'The name of the class constructor'}, 'reader_image': {'dims': ('reader', 'ydata', 'xdata', 'rgba'), 'long_name': 'RGBA images for data readers', 'units': 'color'}, 'reader_mod': {'dims': 'reader', 'long_name': 'The module of the reader class'}, 'rough_locs': {'dims': ('sample', 'column', 'limit'), 'long_name': 'Rough locations for samples'}, 'sample': {'long_name': 'Sample location', 'units': 'px'}, 'samples': {'dims': ('sample', 'column'), 'long_name': 'Sample data', 'units': 'px'}, 'shifted': {'dims': 'column', 'long_name': 'Vertical shift per column', 'units': 'px'}, 'vline': {'long_name': 'Vertical line location', 'units': 'px'}, 'xaxis_translation': {'dims': ('reader', 'px_data', 'limit'), 'long_name': 'Pixel to data mapping for x-axis'}}

A mapping from variable name to meta information

new_child_for_cols(columns, cls, plot=True)[source]

Create a new child reader for specific columns

Parameters
  • columns (list of int) – The columns for the new reader

  • cls (type) – The DataReader subclass

  • plot (bool) – Plot the binary image

Returns

The new reader for the specified columns

Return type

instance of cls

property non_exaggerated_reader

The reader that represents the exaggerations

property num_labels

The maximum label in the labels array

property occurences

A set of tuples marking the position of an occurence

An occurence, motivated by pollen diagrams, just highlights the existence at a certain point without giving the exact value. In pollen diagrams, these are usually taxa that were found but have a percentage of less than 0.5 %.

This set of tuples (x, y) contains the coordinates of the occurences. The first value in each tuple is the y-value, the second the x-value.

See also

occurences_dict

A mapping from column number to occurences

property occurences_dict

A mapping from column number to an numpy array with the indices of an occurence

occurences_value = -9999

The value that is given to the occurences in the measurements

parent = None

Parent reader for this instance. Might be the instance itself

plot_background(ax=None, **kwargs)[source]

Plot a white layer below the plot_im

Parameters
plot_color_image(ax=None, **kwargs)[source]

Plot the colored image on a matplotlib axes

Parameters
plot_full_df(ax=None, *args, **kwargs)[source]

Plot the lines for the digitized diagram

Parameters
plot_im = None

the matplotlib image artist

plot_image(ax=None, **kwargs)[source]

Plot the binary data image on a matplotlib axes

Parameters
  • ax (matplotlib.axes.Axes) – The matplotlib axes to plot on. If not given, the ax attribute is used and (if this is None, too) a new figure is created

  • **kwargs – Any other keyword that is given to the matplotlib.pyplot.imshow() function

plot_other_potential_samples(tol=1, already_found=None, *args, **kwargs)[source]

Plot potential samples that are not yet in the samples attribute

Parameters
  • tol (int) – The pixel tolerance for a sample. If the distance between a potential sample and all already existing sample is greater than tolerance, the potential sample will be plotted

  • already_found (np.ndarray) – The pixel rows of samples that have already been found. If not specified, the index of the sample_locs is used

  • excluded (bool) – If True, plot the excluded samples instead of the included samples (see the return values in find_potential_samples())

  • ax (matplotlib.axes.Axes) – The matplotlib axes to plot on

  • plot_kws (dict) – Any other keyword argument that is passed to the matplotlib.pyplot.plot() function. By default, this is equal to {'marker': '+'}

  • min_len (int) – The minimum length of one extremum. If the width of the interval where we found an extrumum is smaller than that, the extremum is ignored. If None, this parameter does not have an effect (i.e. min_len=1).

  • max_len (int) – The maximum length of one extremum. If the width of the interval where we found an extrumum is greater than that, the extremum is ignored. If None, this parameter does not have an effect.

  • filter_func (function) – A function to filter the extreme. It must accept one argument which is a list of integers representing the indices of the extremum in a

plot_potential_samples(excluded=False, ax=None, plot_kws={}, *args, **kwargs)[source]

Plot the ranges for potential samples

This method plots the rough locations of potential samples (see find_potential_samples()

Parameters
  • excluded (bool) – If True, plot the excluded samples instead of the included samples (see the return values in find_potential_samples())

  • ax (matplotlib.axes.Axes) – The matplotlib axes to plot on

  • plot_kws (dict) – Any other keyword argument that is passed to the matplotlib.pyplot.plot() function. By default, this is equal to {'marker': '+'}

  • min_len (int) – The minimum length of one extremum. If the width of the interval where we found an extrumum is smaller than that, the extremum is ignored. If None, this parameter does not have an effect (i.e. min_len=1).

  • max_len (int) – The maximum length of one extremum. If the width of the interval where we found an extrumum is greater than that, the extremum is ignored. If None, this parameter does not have an effect.

  • filter_func (function) – A function to filter the extreme. It must accept one argument which is a list of integers representing the indices of the extremum in a

plot_results(df, ax=None, fig=None, transformed=True)[source]

Plot the reconstructed diagram

This method plots the reconstructed diagram using the psy-strat module.

Parameters
Returns

  • psyplot.project.Project – The newly created psyplot project with the plotters

  • list of psy_strat.stratplot.StratGroup instances – The groupers for the different columns

plot_sample_hlines(ax=None, **kwargs)[source]

Plot one horizontal line per sample in the sample_locs

Parameters
plot_samples(ax=None, *args, **kwargs)[source]

Plot the diagram as lines reconstructed from the samples

Parameters
px2data_x(coord)[source]

Transform the pixel coordinates into data coordinates

Parameters

coord (1D np.ndarray) – The coordinate values in pixels

Returns

The numpy array starting from 0 with transformed coordinates

Return type

np.ndarray

Notes

Since the x-axes for stratographic plots are usually interrupted, the return values here are relative and therefore always start from 0

recognize_hlines(fraction=0.3, min_lw=1, max_lw=None, remove=False, **kwargs)[source]

Recognize horizontal lines in the plot and subtract them

This method removes horizontal lines in the data diagram, i.e. rows whose non-background cells cover at least the specified fraction of the row.

Parameters
  • fraction (float) – The fraction (between 0 and 1) that has to be covered to recognize a horizontal line

  • min_lw (int) – The minimum line width for a line

  • max_lw (int) – The maximum line width for a line or None if it should be ignored

  • remove (bool) – If True, they will be removed immediately, otherwise they are displayed using the enable_label_selection() method and can be removed through the remove_selected_labels() method

Other Parameters

``**kwargs`` – Additional keywords are parsed to the enable_label_selection() method in case remove is False

Notes

This method has to be called before the digitize() method!

recognize_vlines(fraction=0.3, min_lw=1, max_lw=None, remove=False, **kwargs)[source]

Recognize horizontal lines in the plot and subtract them

This method removes horizontal lines in the data diagram, i.e. rows whose non-background cells cover at least the specified fraction of the row.

Parameters
  • fraction (float) – The fraction (between 0 and 1) that has to be covered to recognize a horizontal line

  • min_lw (int) – The minimum line width for a line

  • max_lw (int) – The maximum line width for a line or None if it should be ignored

  • remove (bool) – If True, they will be removed immediately, otherwise they are displayed using the enable_label_selection() method and can be removed through the remove_selected_labels() method

Other Parameters

``**kwargs`` – Additional keywords are parsed to the enable_label_selection() method in case remove is False

Notes

This method should be called before the column starts are set

recognize_xaxes(fraction=0.3, min_lw=1, max_lw=None, remove=False, **kwargs)[source]

Recognize (and potentially remove) x-axes at bottom and top

Parameters
  • fraction (float) – The fraction (between 0 and 1) that has to be covered to recognize an x-axis

  • min_lw (int) – The minimum line width of an axis

  • max_lw (int) – Tha maximum line width of an axis. If not specified, it will be ignored

  • remove (bool) – If True, they will be removed immediately, otherwise they are displayed using the enable_label_selection() method and can be removed through the remove_selected_labels() method

recognize_yaxes(fraction=0.3, min_lw=0, max_lw=None, remove=False)[source]

Find (and potentially remove) y-axes in the image

Parameters
  • fraction (float) – The fraction (between 0 and 1) that has to be covered to recognize a y-axis

  • min_lw (int) – The minimum line width of an axis

  • max_lw (int) – Tha maximum line width of an axis. If not specified, the median if the axes widths is taken

  • remove (bool) – If True, they will be removed immediately, otherwise they are displayed using the enable_label_selection() method and can be removed through the remove_selected_labels() method

remove_in_children(arr, amask)[source]

Update the child reader images after having removed binary data

Calls the update_image() and update_rgba_image() methods for all children

remove_plots()[source]

Remove all plotted artists by this reader

reset_column_starts()[source]

Reset the column starts, full_df, shifted and occurences

reset_image(image, binary=False)[source]

Reset the image for this straditizer

Parameters
  • image (PIL.Image.Image) – The new image

  • binary (bool) – If True, then the image is considered as the binary image and the image attribute is not touched

reset_labels()[source]

Reset the labels array

reset_samples()[source]

Reset the samples

resize_axes(grouper, bounds)[source]

Resize the axes based on column boundaries

This method sets the x-limits for the different columns to the given bounds and resizes the axes

Parameters
  • grouper (psy_strat.stratplot.StratGroup) – The grouper that manages the plot

  • bounds (np.ndarray of shape (N, 2)) – The boundaries for the columns handled by the grouper

property rough_locs

The pandas.DataFrame with rough locations for the samples. It has one row per sample in the sample_locs dataframe and ncols * 2 columns, where ncols is the number of columns in the sample_locs.

If the potential sample sample_locs.iloc[i, col] ranges j to k (see the find_potential_samples() method), the cell at rough_locs.iloc[i, col * 2] specifies the first y-pixel (j) and rough_locs.iloc[i, col * 2 + 1] the last y-pixel (+1), i.e. k where this sample might be located

property sample_locs

The pandas.DataFrame with locations and values of the samples

samples_at_boundaries = True

a boolean flag that shall indicate if we assume that the first and last rows shall be a sample if they contain non-zero values

set_as_parent()[source]

Set this instance as the parent reader

set_hline_locs_from_selection(selection=None)[source]

Save the locations of horizontal lines

This methods takes every pixel row in the hline_locs attribute where at least 30% is selected. The digitize method will interpolate at these indices.

set_vline_locs_from_selection(selection=None)[source]

Save the locations of vertical lines

This methods takes every pixel column in the vline_locs attribute where at least 30% is selected.

shift_vertical(pixels, draw=True)[source]

Shift the columns vertically.

Parameters
  • pixels (list of floats) – The y-value for each column for which to shift the values. Note that theses values have to be greater than or equal to 0

  • draw (bool) – If True, the ax is drawn at the end

shifted = None

The number of pixels the columns have been shifted

show_cross_column_features(min_px=50, remove=False, **kwargs)[source]

Highlight and maybe remove cross column features

Parameters
  • min_px (int) – The number of pixels that have to be contained in each column

  • remove (bool) – If True, remove the data in the binary array, etc. If False, the enable_label_selection() method is envoked and the user can select the features to remove

  • select_all (bool) – If True and remove is False, all labels in arr will be selected and the given selection is ignored

  • selection (np.ndarray of dtype bool) – A boolean mask with the same shape as arr that is True where a pixel should be selected. If remove is True, only this mask will be used.

  • img (matplotlib image) – The image for the selection. If not provided, a new image is created

  • set_picker (bool) – If True, connect the matplotlib pick_event to the pick_label() method

show_disconnected_parts(fromlast=5, from0=10, remove=False, **kwargs)[source]

Highlight or remove disconnected parts

Parameters
  • %(DataReader.get_disconnected_parts.parameters.fromlast|from0)s

  • %(DataReader._show_parts2remove.parameters.no_arr)s

show_parts_at_column_ends(npixels=2, remove=False, **kwargs)[source]

Highlight or remove features that touch the column ends

Parameters
  • %(DataReader.get_parts_at_column_ends.parameters)s

  • %(DataReader._show_parts2remove.parameters.no_arr)s

show_small_parts(n=10, remove=False, **kwargs)[source]

Highlight and potentially remove small features in the image

Parameters
  • n (int) – The maximal size of a feature to be considered as small

  • remove (bool) – If True, remove the data in the binary array, etc. If False, the enable_label_selection() method is envoked and the user can select the features to remove

  • select_all (bool) – If True and remove is False, all labels in arr will be selected and the given selection is ignored

  • selection (np.ndarray of dtype bool) – A boolean mask with the same shape as arr that is True where a pixel should be selected. If remove is True, only this mask will be used.

  • img (matplotlib image) – The image for the selection. If not provided, a new image is created

  • set_picker (bool) – If True, connect the matplotlib pick_event to the pick_label() method

See also

skimage.morphology.remove_small_objects()

start_column_selection(use_all=False)[source]

Enable the user to select columns

Parameters

use_all (bool) – If True, all columns can be selected. Otherwise only the columns in the columns attribute can be selected

strat_plot_identifier = 'percentages'
static to_binary_pil(image, threshold=690)[source]

Convert an image to a binary

Parameters
  • image (PIL.Image.Image) – The RGBA image file

  • threshold (float) – If the multiplied RGB values in a cell are above the threshold, the cell is regarded as background and will be set to 0

Returns

The binary image of integer type

Return type

np.ndarray of ndim 2

to_dataset(ds=None)[source]

All the necessary data as a xarray.Dataset

Parameters

ds (xarray.Dataset) – The dataset in which to insert the data. If None, a new one will be created

Returns

Either the given ds or a new xarray.Dataset instance

Return type

xarray.Dataset

static to_grey_pil(image, threshold=690)[source]

Convert an image to a greyscale image

Parameters
  • image (PIL.Image.Image) – The RGBA image file

  • threshold (float) – If the multiplied RGB values in a cell are above the threshold, the cell is regarded as background and will be set to 0

Returns

The greyscale image of integer type

Return type

np.ndarray of ndim 2

unique_bars(min_fract=None, asdict=True, *args, **kwargs)[source]

Estimate the unique bars

This method puts the overlapping bars of the different columns together

Parameters
  • min_fract (float) – The minimum fraction between 0 and 1 that two bars have to overlap such that they are considered as representing the same sample. If None, the min_fract attribute is used

  • asdict (bool) – If True, dictionaries are returned

Returns

A list of the bar locations. If asdict is True (default), each item in the returned list is a dictionary whose keys are the column indices and whose values are the indices for the corresponding column. Otherwise, a list of _Bar objects is returned

Return type

list

update_image(arr, amask)[source]

Update the image after having removed binary data

This method is in the remove_callbacks mapping and is called after a pixel has been removed from the binary data. It mainly just calls the reset_labels() method and updates the plot

update_rgba_image(arr, mask)[source]

Update the RGBA image from the given 3D-array

This method is in the remove_callbacks mapping and is called after a pixel has been removed from the binary data. It updates the image attribute

Parameters
  • arr (3D np.ndarray of dtype float) – The image array

  • mask (boolean mask of the same shape as arr) – The mask of features that shall be set to 0 in arr

vline_locs = None

list or floats. The indexes of vertical lines

xaxis_data = None
property xaxis_px

The x indices in column pixel coordinates that are used for x-axes translations

class straditize.binary.LineDataReader(image, ax=None, extent=None, plot=True, children=[], parent=None, magni=None, plot_background=False, binary=None)[source]

Bases: straditize.binary.DataReader

A data reader for digitizing line diagrams

This class does not have a significantly different behaviour than the base DataReader class, but might be improved with more specific features in the future

Parameters
  • image (PIL.Image.Image) – The image of the diagram

  • ax (matplotlib.axes.Axes) – The matplotlib axes to plot on

  • extent (list) – List of four number specifying the extent of the image in it’s source. This extent will be used for the call of matplotlib.pyplot.imshow()

  • children (list of DataReader) – Child readers for other columns in case the newly created instance is the parent reader

  • parent (DataReader) – The parent reader.

  • magni (straditize.magnifier.Magnifier) – The magnifier for the given ax

  • plot_background (bool) – If True (and plot is True), a white, opaque are is plotted below the plot_im

  • binary (None) – The binary version of the given image. If not provided, the to_binary_pil() method is used with the given image

Attributes

strat_plot_identifier

str(object=’’) -> str

strat_plot_identifier = 'default'
class straditize.binary.RoundedBarDataReader(*args, **kwargs)[source]

Bases: straditize.binary.BarDataReader

A bar data reader that can be used for rounded bars

Parameters

tolerance (int) – If x0 is the value in a pixel row y and x1 the value in the next pixel row y+1, then the two pixel rows are considered as belonging to different bars if abs(x1 - x0) > tolerance (see the get_bars() method and the tolerance attribute)

Attributes

tolerance

int([x]) -> integer

tolerance = 10
straditize.binary.groupby_arr(arr)[source]

Groupby a boolean array

Parameters

arr (np.ndarray of ndim 1 of dtype bool) – An array that can be converted to a numeric array

Returns

  • keys (np.ndarrayrdi) – The keys in the array

  • starts (np.ndarray) – The index of the first element that correspond to the key in keys

straditize.binary.only_parent(func)[source]

Call the given func only from the parent reader