straditize.binary module¶
A module to read in and digitize the pollen diagram
Disclaimer
Copyright (C) 20182019 Philipp S. Sommer
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
Classes

A DataReader for digitizing bar pollen diagrams 

A class to read in and digitize the data files of the pollen diagram 

A data reader for digitizing line diagrams 

A bar data reader that can be used for rounded bars 
Functions

Groupby a boolean array 

Call the given func only from the parent reader 

class
straditize.binary.
BarDataReader
(*args, **kwargs)[source]¶ Bases:
straditize.binary.DataReader
A DataReader for digitizing bar pollen diagrams
Compared to the base
DataReader
class, this reader implements a different strategy in digitizing and finding the samples. When digitizing the full diagram, we try to find the distinct bars using theget_bars()
method. These bars might have to be splitted manually if they are not easy to distinguish. One key element to distinguish to adjacent bars is the specified tolerance.The base class works for rectangular bars. If you require rounded bars, use the
RoundedBarDataReader
 Parameters
tolerance (int) – If x0 is the value in a pixel row y and x1 the value in the next pixel row y+1, then the two pixel rows are considered as belonging to different bars if
abs(x1  x0) > tolerance
(see theget_bars()
method and thetolerance
attribute)
Methods
create_grouper
(ds, columns, *args, **kwargs)Create the grouper that plots the results
digitize
([do_split, inplace])Reimplemented to ignore the rows between the bars
find_potential_samples
(col[, min_len, …])Find the bars in the column
from_dataset
(ds, *args, **kwargs)Create a new
DataReader
from axarray.Dataset
get_bars
(arr[, do_split])Find the distinct bars in an array
shift_vertical
(pixels)Shift the columns vertically.
to_dataset
([ds])All the necessary data as a
xarray.Dataset
Attributes
The minimum fraction of overlap for two bars to be considered as the
dict() > new empty dictionary
There should not be samples at the boundaries because the first
Tolerance to distinguish bars.

create_grouper
(ds, columns, *args, **kwargs)[source]¶ Create the grouper that plots the results
 Parameters
ds (xarray.Dataset) – The dataset with the data
columns (list of int) – The numbers of the columns for which the grouper should be created
fig (matplotlib.figure.Figure) – The matplotlib figure to plot on
x0 (float) – The left boundary of the larger Bbox of the stratigraphic diagram
y0 (int) – The upper boundary of the larger Bbox of the stratigraphic diagram
width (float) – The width of the final axes between 0 and 1
height (float) – The height of the final axis between 0 and 1
ax0 (matplotlib.axes.Axes) – The larger matplotlib axes whose bounding box shall be used.
transformed (bool) – If True, yaxes and xaxes have been translated (see the
px2data_x()
andpx2data_y()
methods)colnames (list of str) – The column names to use in the plot
**kwargs – any other keyword argument that is passed to the
psy_strat.stratplot.StratGroup.from_dataset()
method
 Returns
The grouper that visualizes the given columns in the fig
 Return type

find_potential_samples
(col, min_len=None, max_len=None, filter_func=None)[source]¶ Find the bars in the column
This method gets the bars in the given col and returns the distinct indices
 Parameters
col (int) – The column for which to find the extrema
min_len (int) – The minimum length of one extremum. If the width of the interval where we found an extrumum is smaller than that, the extremum is ignored. If None, this parameter does not have an effect (i.e.
min_len=1
).max_len (int) – The maximum length of one extremum. If the width of the interval where we found an extrumum is greater than that, the extremum is ignored. If None, this parameter does not have an effect.
filter_func (function) – A function to filter the extreme. It must accept one argument which is a list of integers representing the indices of the extremum in a
 Returns
list of list of int of shape (N, 2) – The list of N extremum locations. Each tuple in this list represents an interval a where one extremum might be located
list of list of int – The excluded extremum locations that are ignored because we could not find a change of sign in the slope.
See also
find_samples()

classmethod
from_dataset
(ds, *args, **kwargs)[source]¶ Create a new
DataReader
from axarray.Dataset
 Parameters
ds (xarray.Dataset) – The dataset that has been stored with the
to_dataset()
method*args,**kwargs – Any other arguments passed to the
DataReader
constructor
 Returns
The reader recreated from ds
 Return type

get_bars
(arr, do_split=False)[source]¶ Find the distinct bars in an array
 Parameters
arr (np.ndarray) – The array to find the bars in
do_split (bool) – If True and a bar is 1.7 times longer than the mean, it is splitted into two.
 Returns
list of list of ints – The list of the distinct positions of the bars
list of floats – The heights for each of the bars
list of list of ints – The indices of bars that are longer than 1.7 times the mean of the other bars and should be splitted. If do_split is True, they have been splitted already

max_len
= None¶

min_fract
= 0.9¶ The minimum fraction of overlap for two bars to be considered as the same sample (see
unique_bars()
)

min_len
= None¶

nc_meta
= {'bars{reader}_bars': {'dims': ('bars{reader}_bar', 'limit'), 'long_name': 'Boundaries of bars', 'units': 'px'}, 'bars{reader}_full_data_orig': {'dims': ('ydata', 'bars{reader}_column'), 'long_name': 'Full digitized data ignoring bars', 'units': 'px'}, 'bars{reader}_max_len': {'dims': (), 'long_name': 'Maximum length of a bar'}, 'bars{reader}_min_fract': {'dims': (), 'long_name': 'Minimum fraction for overlap estimation'}, 'bars{reader}_min_len': {'dims': (), 'long_name': 'Minimum length of a bar'}, 'bars{reader}_nbars': {'dims': 'bars{reader}_column', 'long_name': 'number of bars per column'}, 'bars{reader}_nsplit': {'dims': 'bars{reader}_column', 'long_name': 'number of the splitted bars'}, 'bars{reader}_splitted': {'dims': ('bar_split', 'limit'), 'long_name': 'Boundaries of bars to split', 'units': 'px'}, 'bars{reader}_tolerance': {'dims': (), 'long_name': 'bar distinguishing tolerance'}, 'binary': {'dims': ('reader', 'ydata', 'xdata'), 'long_name': 'Binary images for data readers'}, 'col_map': {'dims': 'column', 'long_name': 'Mapping from column to reader', 'units': 'reader_index'}, 'column_ends': {'dims': 'column', 'long_name': 'Ends of the columns', 'units': 'px'}, 'column_starts': {'dims': 'column', 'long_name': 'Start of the columns', 'units': 'px'}, 'exag_col_map': {'dims': 'column', 'long_name': 'Mapping from column to exaggerated reader', 'units': 'reader_index'}, 'full_data': {'dims': ('ydata', 'column'), 'long_name': 'Full digitized data', 'units': 'px'}, 'hline': {'long_name': 'Horizontal line location', 'units': 'px'}, 'is_exaggerated': {'dims': 'reader', 'long_name': 'Exaggeration factor'}, 'occurences': {'comments': 'The locations where the only an occurence of a taxa is highlighted without value', 'dims': ('occurence', 'xy'), 'long_name': 'taxa occurences'}, 'reader': {'dims': 'reader', 'long_name': 'index of the reader'}, 'reader_cls': {'dims': 'reader', 'long_name': 'The name of the class constructor'}, 'reader_image': {'dims': ('reader', 'ydata', 'xdata', 'rgba'), 'long_name': 'RGBA images for data readers', 'units': 'color'}, 'reader_mod': {'dims': 'reader', 'long_name': 'The module of the reader class'}, 'rough_locs': {'dims': ('sample', 'column', 'limit'), 'long_name': 'Rough locations for samples'}, 'sample': {'long_name': 'Sample location', 'units': 'px'}, 'samples': {'dims': ('sample', 'column'), 'long_name': 'Sample data', 'units': 'px'}, 'shifted': {'dims': 'column', 'long_name': 'Vertical shift per column', 'units': 'px'}, 'vline': {'long_name': 'Vertical line location', 'units': 'px'}, 'xaxis_translation': {'dims': ('reader', 'px_data', 'limit'), 'long_name': 'Pixel to data mapping for xaxis'}}¶

samples_at_boundaries
= False¶ There should not be samples at the boundaries because the first sample is in the middle of the first bar

shift_vertical
(pixels)[source]¶ Shift the columns vertically.
 Parameters
pixels (list of floats) – The yvalue for each column for which to shift the values. Note that theses values have to be greater than or equal to 0

to_dataset
(ds=None)[source]¶ All the necessary data as a
xarray.Dataset
 Parameters
ds (xarray.Dataset) – The dataset in which to insert the data. If None, a new one will be created
 Returns
Either the given ds or a new
xarray.Dataset
instance Return type

tolerance
= 2¶ Tolerance to distinguish bars. If x0 is the value in a pixel row y and x1 the value in the next pixel row y+1, then the two pixel rows are considered as belonging to different bars if
abs(x1  x0) > tolerance

class
straditize.binary.
DataReader
(image, ax=None, extent=None, plot=True, children=[], parent=None, magni=None, plot_background=False, binary=None)[source]¶ Bases:
straditize.label_selection.LabelSelection
A class to read in and digitize the data files of the pollen diagram
The source image is stored in the
image
attribute, the binary array of it is stored in thebinary
attribute. A labeled version created by theskimage.morphology.label()
function, is stored in thelabels
attribute and can regenerated using thereset_labels()
method.Subclasses of this class should reimplement the
digitize()
method that digitizes the diagram, and thefind_potential_samples()
method.There is always one parent reader stored in the
parent
attribute. This is then the reader that is accessible through thestraditize.straditizer.Straditizer.data_reader
attribute and holds the references to other readers in it’schildren
attribute Parameters
image (PIL.Image.Image) – The image of the diagram
ax (matplotlib.axes.Axes) – The matplotlib axes to plot on
extent (list) – List of four number specifying the extent of the image in it’s source. This extent will be used for the call of
matplotlib.pyplot.imshow()
children (list of
DataReader
) – Child readers for other columns in case the newly created instance is the parent readerparent (
DataReader
) – The parent reader.magni (straditize.magnifier.Magnifier) – The magnifier for the given ax
plot_background (bool) – If True (and plot is True), a white, opaque are is plotted below the
plot_im
binary (None) – The binary version of the given image. If not provided, the
to_binary_pil()
method is used with the given image
Methods
add_samples
(samples[, rough_locs])Add samples to the found ones
close
()color_labels
([categorize])The labels of the colored array
create_exaggerations_reader
(factor[, cls])Create a new exaggerations reader for this reader
create_grouper
(ds, columns, fig, x0, y0, …)Create the grouper that plots the results
create_variable
(ds, vname, data, **kwargs)Insert the data into a variable in an
xr.Dataset
digitize
([use_sum, inplace])Digitize the binary image to create the full dataframe
digitize_exaggerated
([fraction, absolute, …])Merge the exaggerated values into the original digitized result
disable_label_selection
(*args, **kwargs)Disable the label selection
Draw the matplotlib
fig
and themagni
figureEnd the column selection and rmove the artists
estimated_column_starts
([threshold])The estimated column starts as
numpy.ndarray
.find_potential_samples
(col[, min_len, …])Find potential samples in an array
find_samples
([min_fract, pixel_tol])Find the samples in the diagram
Calculate how many columns have a potential sample in each pixel row
from_dataset
(ds, *args, **kwargs)Create a new
DataReader
from axarray.Dataset
get_bbox_for_cols
(columns, x0, y0, width, height)Get the boundary boxes for the columns of this reader in the results
get_binary_for_col
(col)Get the binary array for a specific column
get_cross_column_features
([min_px])Get features that are contained in two or more columns
get_disconnected_parts
([fromlast, from0, …])Identify parts in the
binary
data that are not connectedCreate a connectivitybased labeled array of the
binary
dataExtract the positions of the occurences from the selection
get_parts_at_column_ends
([npixels])Identify parts in the
binary
data that touch the next columnget_reader_for_col
(col)Get the reader for a specific column
get_surrounding_slopes
(indices, arr)The RGBA values of the colored image
is_obstacle
(indices, arr)Check whether the found extrema is only an obstacle of the picture
mark_as_exaggerations
(mask)Mask the given array as exaggerated
merge_close_samples
(locs[, rough_locs, …])Get the binary data from all children and merge them into one array
Get the labeled binary data from all children merged into one array
new_child_for_cols
(columns, cls[, plot])Create a new child reader for specific columns
plot_background
([ax])Plot a white layer below the
plot_im
plot_color_image
([ax])Plot the colored
image
on a matplotlib axesplot_full_df
([ax])Plot the lines for the digitized diagram
plot_image
([ax])Plot the
binary
data image on a matplotlib axesplot_other_potential_samples
([tol, …])Plot potential samples that are not yet in the
samples
plot_potential_samples
([excluded, ax, plot_kws])Plot the ranges for potential samples
plot_results
(df[, ax, fig, transformed])Plot the reconstructed diagram
plot_sample_hlines
([ax])Plot one horizontal line per sample in the
sample_locs
plot_samples
([ax])Plot the diagram as lines reconstructed from the samples
px2data_x
(coord)Transform the pixel coordinates into data coordinates
recognize_hlines
([fraction, min_lw, max_lw, …])Recognize horizontal lines in the plot and subtract them
recognize_vlines
([fraction, min_lw, max_lw, …])Recognize horizontal lines in the plot and subtract them
recognize_xaxes
([fraction, min_lw, max_lw, …])Recognize (and potentially remove) xaxes at bottom and top
recognize_yaxes
([fraction, min_lw, max_lw, …])Find (and potentially remove) yaxes in the image
remove_in_children
(arr, amask)Update the child reader images after having removed binary data
Remove all plotted artists by this reader
Reset the column starts,
full_df
,shifted
reset_image
(image[, binary])Reset the image for this straditizer
Reset the
labels
arrayReset the samples
resize_axes
(grouper, bounds)Resize the axes based on column boundaries
Set this instance as the parent reader
set_hline_locs_from_selection
([selection])Save the locations of horizontal lines
set_vline_locs_from_selection
([selection])Save the locations of vertical lines
shift_vertical
(pixels[, draw])Shift the columns vertically.
show_cross_column_features
([min_px, remove])Highlight and maybe remove cross column features
show_disconnected_parts
([fromlast, from0, …])Highlight or remove disconnected parts
show_parts_at_column_ends
([npixels, remove])Highlight or remove features that touch the column ends
show_small_parts
([n, remove])Highlight and potentially remove small features in the image
start_column_selection
([use_all])Enable the user to select columns
to_binary_pil
(image[, threshold])Convert an image to a binary
to_dataset
([ds])All the necessary data as a
xarray.Dataset
to_grey_pil
(image[, threshold])Convert an image to a greyscale image
unique_bars
([min_fract, asdict])Estimate the unique bars
update_image
(arr, amask)Update the image after having removed binary data
update_rgba_image
(arr, mask)Update the RGBA image from the given 3Darray
Attributes
The boundaries for the data columns
1D numpy array with the ends for all column (including child reader)
1D numpy array with the ends for all column (including child reader)
The matplotlib axes where the
plot_im
is plotted onWhite rectangle that represents the background of the binary image.
A 2D numpy array representing the binary version of the
image
Child readers for specific columns.
The boundaries for the data columns
1D numpy array with the ends for each column of this reader
1D numpy array with the starts for each column of this reader
The indices of the columns that are handled by this reader
The reader that represents the exaggerations
The extent of the
plot_im
The matplotlib figure of the
ax
The full
pandas.DataFrame
of the digitized imagelist
or floats. The indexes of horizontal linesPIL.Image.Image of the diagram part with mode RGBA
Exaggeration factor that is not 0 if this reader represents exaggeration
Iter through the
parent
reader and it’schildren
Builtin mutable sequence.
A connectivitybased labeled version of the
binary
datathe
straditize.magnifier.Magnifier
for theax
White rectangle that represents the background of the binary image in the magnifier.
magnified
plot_im
The minimum fraction of overlap for two bars to be considered as the
A mapping from variable name to meta information
The reader that represents the exaggerations
The maximum label in the
labels
arrayA set of tuples marking the position of an occurence
A mapping from column number to an numpy array with the indices of
The value that is given to the occurences in the measurements
Parent reader for this instance.
the matplotlib image artist
The
pandas.DataFrame
with rough locations for the samples.The
pandas.DataFrame
with locations and values of thea boolean flag that shall indicate if we assume that the first and last
The number of pixels the columns have been shifted
str(object=’’) > str
list
or floats. The indexes of vertical linesThe x indices in column pixel coordinates that are used for xaxes

add_samples
(samples, rough_locs=None)[source]¶ Add samples to the found ones
 Parameters
samples (series, 1darray or DataFrame) –
The samples. If it is series, we assume that the index represents the yvalue of the sample and the value the xposition (see xcolumns). In case of a 1darray, we assume that the data represents the yvalues of the samples. In case of a DataFrame, we assume that the columns correspond to columns in the full_df attribute and are True where we have a sample.
Note that the yvalues must be in image coordinates (see
extent
attribute).rough_locs (DataFrame) – The rough locations of the new samples (see the
rough_locs
attribute)
See also
samples()
,rough_locs()
,find_samples()
,sample_locs()

property
all_column_bounds
¶ The boundaries for the data columns

property
all_column_ends
¶ 1D numpy array with the ends for all column (including child reader)
See also
all_column_starts
The starts for all column
all_column_bounds
The (start, end)tuple for all of the columns
column_ends
The ends for this specific reader
reader

property
all_column_starts
¶ 1D numpy array with the ends for all column (including child reader)
See also
all_column_ends
The ends for all column
all_column_bounds
The (start, end)tuple for all of the columns
column_starts
The starts for this specific reader
reader

background
= None¶ White rectangle that represents the background of the binary image. This is only plotted by the parent reader

children
= []¶ Child readers for specific columns. Is not empty if and only if the
parent
attribute is this instance

property
column_bounds
¶ The boundaries for the data columns

property
column_ends
¶ 1D numpy array with the ends for each column of this reader
See also
column_starts
The starts for each column
column_bounds
The (start, end)tuple for each of the columns
all_column_ends
The ends for all columns, including child
reader

property
column_starts
¶ 1D numpy array with the starts for each column of this reader
See also
column_ends
The ends for each column
column_bounds
The (start, end)tuple for each of the columns
all_column_starts
The starts for all columns, including child
reader

property
columns
¶ The indices of the columns that are handled by this reader

create_exaggerations_reader
(factor, cls=None)[source]¶ Create a new exaggerations reader for this reader
 Parameters
factor (float) – The exaggeration factor
cls (type) – The
DataReader
subclass
 Returns
The new exaggerated reader
 Return type
instance of cls

create_grouper
(ds, columns, fig, x0, y0, width, height, ax0=None, transformed=True, colnames=None, **kwargs)[source]¶ Create the grouper that plots the results
 Parameters
ds (xarray.Dataset) – The dataset with the data
columns (list of int) – The numbers of the columns for which the grouper should be created
fig (matplotlib.figure.Figure) – The matplotlib figure to plot on
x0 (float) – The left boundary of the larger Bbox of the stratigraphic diagram
y0 (int) – The upper boundary of the larger Bbox of the stratigraphic diagram
width (float) – The width of the final axes between 0 and 1
height (float) – The height of the final axis between 0 and 1
ax0 (matplotlib.axes.Axes) – The larger matplotlib axes whose bounding box shall be used.
transformed (bool) – If True, yaxes and xaxes have been translated (see the
px2data_x()
andpx2data_y()
methods)colnames (list of str) – The column names to use in the plot
**kwargs – any other keyword argument that is passed to the
psy_strat.stratplot.StratGroup.from_dataset()
method
 Returns
The grouper that visualizes the given columns in the fig
 Return type

create_variable
(ds, vname, data, **kwargs)[source]¶ Insert the data into a variable in an
xr.Dataset
 Parameters
ds (xarray.Dataset) – The destination dataset
vname (str) – The name of the variable in the
nc_meta
mapping. This name might include{reader}
which will then be replaced by the number of the reader in theiter_all_readers
attributedata (np.ndarray) – The numpy array to store in the variable specified by vname
**kwargs – A mapping from dimension to slicer that should be used to slice the dataset
 Returns
The resolved vname that has been used in the dataset
 Return type

digitize
(use_sum=False, inplace=True)[source]¶ Digitize the binary image to create the full dataframe
 Parameters
 Returns
The digitization result if inplace is
True
, otherwise None Return type
None or
pandas.DataFrame

digitize_exaggerated
(fraction=0.05, absolute=8, inplace=True, return_mask=False)[source]¶ Merge the exaggerated values into the original digitized result
 Parameters
fraction (float between 0 and 1) – The fraction under which the exaggerated data should be used. Set this to 0 to ignore it.
absolute (int) – The absolute value under which the exaggerated data should be used. Set this to 0 to ignore it.
inplace (bool) – If True (default), the
full_df
attribute is updated. Otherwise a DataFrame is returnedreturn_mask (bool) – If True, a boolean 2D array is returned indicating where the exaggerations have been used
 Returns
pandas.DataFrame or None – If inplace is False, the digitized result. Otherwise, if return_mask is True, the mask where the exaggerated results have been used. Otherwise None
pandas.DataFrame, optionally – If inplace is False and return_mask is True, a pandas.DataFrame containing the boolean mask where the exaggerated results have been used. Otherwise, this is skipped

disable_label_selection
(*args, **kwargs)[source]¶ Disable the label selection
This will disconnect the pick_event and remove the selection images
 Parameters
remove (bool) – Whether to remove the selection image from the plot. If None, the
_remove
attribute is used
See also
enable_label_selection()
,remove_selected_labels()

estimated_column_starts
(threshold=None)[source]¶ The estimated column starts as
numpy.ndarray
.We assume a new column a pixel column $i$ if
the previous pixel column $i1$ did not contain any data ($D(i1) = 0$)
THE amount of data points doubled compared to $i1$ ($D(i) geq 2cdot D(i1)$)
the amount of data points steadily increases within the next few columns to a value twice as large as the previous column ($D(i+n) geq 2cdot D(i1)$ with $n>0$ and $D(i+j) geq D(i)$ for all $0 < j geq n$)
Each potential column starts must also be covered by a given threshold.
 Parameters
threshold (float between 0 and 1) – The fraction that has to be covered to assume a valid column start. By default, 0.1 (i.e. 10 percent)
 Returns
The starts for each column
 Return type
np.ndarray

property
exaggerated_reader
¶ The reader that represents the exaggerations

find_potential_samples
(col, min_len=None, max_len=None, filter_func=None)[source]¶ Find potential samples in an array
This method finds extrema in an array and returns the indices where the extremum might be. The algorithm thereby filters out obstacles by first going over the array, making sure, that there is a change of sign in the slope in the found extremum, and if not, ignores it and flattens it out.
 Parameters
col (int) – The column for which to find the extrema
min_len (int) – The minimum length of one extremum. If the width of the interval where we found an extrumum is smaller than that, the extremum is ignored. If None, this parameter does not have an effect (i.e.
min_len=1
).max_len (int) – The maximum length of one extremum. If the width of the interval where we found an extrumum is greater than that, the extremum is ignored. If None, this parameter does not have an effect.
filter_func (function) – A function to filter the extreme. It must accept one argument which is a list of integers representing the indices of the extremum in a
 Returns
list of list of int of shape (N, 2) – The list of N extremum locations. Each tuple in this list represents an interval a where one extremum might be located
list of list of int – The excluded extremum locations that are ignored because we could not find a change of sign in the slope.
See also

find_samples
(min_fract=None, pixel_tol=5, *args, **kwargs)[source]¶ Find the samples in the diagram
This function finds the samples using the
find_potential_samples()
function. It combines the found extrema from all columns and estimates the exact location using an interpolation of the slope Parameters
min_fract (float) – The minimum fraction between 0 and 1 that two bars have to overlap such that they are considered as representing the same sample. If None, the
min_fract
attribute is usedmin_len (int) – The minimum length of one extremum. If the width of the interval where we found an extrumum is smaller than that, the extremum is ignored. If None, this parameter does not have an effect (i.e.
min_len=1
).max_len (int) – The maximum length of one extremum. If the width of the interval where we found an extrumum is greater than that, the extremum is ignored. If None, this parameter does not have an effect.
filter_func (function) – A function to filter the extreme. It must accept one argument which is a list of integers representing the indices of the extremum in a
 Returns
pandas.DataFrame – The x and ylocations of the samples. The index is the ylocation, the columns are the columns in the
full_df
.pandas.DataFrame – The rough locations of the samples. The index is the ylocation of the columns, the values are lists of the potential sample locations.

found_extrema_per_row
()[source]¶ Calculate how many columns have a potential sample in each pixel row
 Returns
A series with one entry per pixel row. The values are the number of columns in the diagram that have a potential sample noted in the
rough_locs
 Return type

classmethod
from_dataset
(ds, *args, **kwargs)[source]¶ Create a new
DataReader
from axarray.Dataset
 Parameters
ds (xarray.Dataset) – The dataset that has been stored with the
to_dataset()
method*args,**kwargs – Any other arguments passed to the
DataReader
constructor
 Returns
The reader recreated from ds
 Return type

property
full_df
¶ The full
pandas.DataFrame
of the digitized image

get_bbox_for_cols
(columns, x0, y0, width, height)[source]¶ Get the boundary boxes for the columns of this reader in the results plot
This method is used by the
plot_results()
method to get the Bbox for apsy_strat.stratplot.StratGroup
grouper Parameters
columns (list of int) – The column numbers to use
x0 (float) – The left boundary of the larger Bbox of the stratigraphic diagram
y0 (int) – The upper boundary of the larger Bbox of the stratigraphic diagram
width (float) – The width of the final axes between 0 and 1
height (float) – The height of the final axis between 0 and 1
 Returns
The boundary box for the given columns in the matplotlib figure
 Return type
See also

get_cross_column_features
(min_px=50)[source]¶ Get features that are contained in two or more columns

get_disconnected_parts
(fromlast=5, from0=10, cross_column=False)[source]¶ Identify parts in the
binary
data that are not connected Parameters
fromlast (int) – A pixel
x1 > x0
is considered as disconnected, if it is at leastx1  x0 >= fromlast
. If this is 0, it is ignored and onlyfrom0
is considered.from0 (int) – A pixel is considered as disconnected if it is more than from0 pixels away from the column start. If this is 0, it is ignored and only
fromlast
is consideredcross_column (bool) – If False, disconnected features are only marked in the column where the disconnection has been detected. Otherwise the entire feature is marked
 Returns
The 2D boolean mask with the same shape as the
binary
array that is True if a data pixel is considered as to be disconnected Return type
np.ndarray of dtype bool

get_parts_at_column_ends
(npixels=2)[source]¶ Identify parts in the
binary
data that touch the next column Parameters
npixels (int) – If a data pixel is less than npixels away from the column end, it is considered to be at the column end and marked
 Returns
A boolean mask with the same shape as the
binary
data that is True where a pixel is considered to be at the column end Return type
np.ndarray of dtype bool

get_reader_for_col
(col)[source]¶ Get the reader for a specific column
 Parameters
col (int) – The column of interest
 Returns
Either the reader or None if no reader could be found
 Return type
DataReader or None

image
= None¶ PIL.Image.Image of the diagram part with mode RGBA

is_exaggerated
= 0¶ Exaggeration factor that is not 0 if this reader represents exaggeration plots

is_obstacle
(indices, arr)[source]¶ Check whether the found extrema is only an obstacle of the picture

label_arrs
= ['binary', 'labels', 'image_array']¶

magni
= None¶ the
straditize.magnifier.Magnifier
for theax

magni_background
= None¶ White rectangle that represents the background of the binary image in the magnifier. This is only plotted by the parent reader

magni_color_plot_im
= None¶

mark_as_exaggerations
(mask)[source]¶ Mask the given array as exaggerated
 Parameters
mask (2D np.ndarray of dtype bool) – A mask with the same shape as the
binary
array that is True if a cell should be interpreted as the visualization of an exaggeration

merged_binaries
()[source]¶ Get the binary data from all children and merge them into one array
 Returns
The binary image with the same shape as the
binary
data Return type
np.ndarray of dtype int

merged_labels
()[source]¶ Get the labeled binary data from all children merged into one array
 Returns
The labeled binary image with the same shape as the
label
data Return type
np.ndarray of dtype int

min_fract
= 0.9¶ The minimum fraction of overlap for two bars to be considered as the same sample (see
unique_bars()
)

nc_meta
= {'binary': {'dims': ('reader', 'ydata', 'xdata'), 'long_name': 'Binary images for data readers'}, 'col_map': {'dims': 'column', 'long_name': 'Mapping from column to reader', 'units': 'reader_index'}, 'column_ends': {'dims': 'column', 'long_name': 'Ends of the columns', 'units': 'px'}, 'column_starts': {'dims': 'column', 'long_name': 'Start of the columns', 'units': 'px'}, 'exag_col_map': {'dims': 'column', 'long_name': 'Mapping from column to exaggerated reader', 'units': 'reader_index'}, 'full_data': {'dims': ('ydata', 'column'), 'long_name': 'Full digitized data', 'units': 'px'}, 'hline': {'long_name': 'Horizontal line location', 'units': 'px'}, 'is_exaggerated': {'dims': 'reader', 'long_name': 'Exaggeration factor'}, 'occurences': {'comments': 'The locations where the only an occurence of a taxa is highlighted without value', 'dims': ('occurence', 'xy'), 'long_name': 'taxa occurences'}, 'reader': {'dims': 'reader', 'long_name': 'index of the reader'}, 'reader_cls': {'dims': 'reader', 'long_name': 'The name of the class constructor'}, 'reader_image': {'dims': ('reader', 'ydata', 'xdata', 'rgba'), 'long_name': 'RGBA images for data readers', 'units': 'color'}, 'reader_mod': {'dims': 'reader', 'long_name': 'The module of the reader class'}, 'rough_locs': {'dims': ('sample', 'column', 'limit'), 'long_name': 'Rough locations for samples'}, 'sample': {'long_name': 'Sample location', 'units': 'px'}, 'samples': {'dims': ('sample', 'column'), 'long_name': 'Sample data', 'units': 'px'}, 'shifted': {'dims': 'column', 'long_name': 'Vertical shift per column', 'units': 'px'}, 'vline': {'long_name': 'Vertical line location', 'units': 'px'}, 'xaxis_translation': {'dims': ('reader', 'px_data', 'limit'), 'long_name': 'Pixel to data mapping for xaxis'}}¶ A mapping from variable name to meta information

new_child_for_cols
(columns, cls, plot=True)[source]¶ Create a new child reader for specific columns
 Parameters
columns (list of int) – The columns for the new reader
cls (type) – The
DataReader
subclassplot (bool) – Plot the binary image
 Returns
The new reader for the specified columns
 Return type
instance of cls

property
non_exaggerated_reader
¶ The reader that represents the exaggerations

property
occurences
¶ A set of tuples marking the position of an occurence
An occurence, motivated by pollen diagrams, just highlights the existence at a certain point without giving the exact value. In pollen diagrams, these are usually taxa that were found but have a percentage of less than 0.5 %.
This set of tuples (x, y) contains the coordinates of the occurences. The first value in each tuple is the yvalue, the second the xvalue.
See also
occurences_dict
A mapping from column number to occurences

property
occurences_dict
¶ A mapping from column number to an numpy array with the indices of an occurence

occurences_value
= 9999¶ The value that is given to the occurences in the measurements

parent
= None¶ Parent reader for this instance. Might be the instance itself

plot_background
(ax=None, **kwargs)[source]¶ Plot a white layer below the
plot_im
 Parameters
ax (matplotlib.axes.Axes) – The matplotlib axes to plot on. If not given, the
ax
attribute is used**kwargs – Any other keyword that is given to the
matplotlib.pyplot.imshow()
function

plot_color_image
(ax=None, **kwargs)[source]¶ Plot the colored
image
on a matplotlib axes Parameters
ax (matplotlib.axes.Axes) – The matplotlib axes to plot on. If not given, the
ax
attribute is used**kwargs – Any other keyword that is given to the
matplotlib.pyplot.imshow()
function

plot_full_df
(ax=None, *args, **kwargs)[source]¶ Plot the lines for the digitized diagram
 Parameters
ax (matplotlib.axes.Axes) – The matplotlib axes to plot on
*args,**kwargs – Any other argument and keyword argument that is passed to the
matplotlib.pyplot.plot()
function

plot_im
= None¶ the matplotlib image artist

plot_image
(ax=None, **kwargs)[source]¶ Plot the
binary
data image on a matplotlib axes Parameters
ax (matplotlib.axes.Axes) – The matplotlib axes to plot on. If not given, the
ax
attribute is used and (if this is None, too) a new figure is created**kwargs – Any other keyword that is given to the
matplotlib.pyplot.imshow()
function

plot_other_potential_samples
(tol=1, already_found=None, *args, **kwargs)[source]¶ Plot potential samples that are not yet in the
samples
attribute Parameters
tol (int) – The pixel tolerance for a sample. If the distance between a potential sample and all already existing sample is greater than tolerance, the potential sample will be plotted
already_found (np.ndarray) – The pixel rows of samples that have already been found. If not specified, the index of the
sample_locs
is usedexcluded (bool) – If True, plot the excluded samples instead of the included samples (see the return values in
find_potential_samples()
)ax (matplotlib.axes.Axes) – The matplotlib axes to plot on
plot_kws (dict) – Any other keyword argument that is passed to the
matplotlib.pyplot.plot()
function. By default, this is equal to{'marker': '+'}
min_len (int) – The minimum length of one extremum. If the width of the interval where we found an extrumum is smaller than that, the extremum is ignored. If None, this parameter does not have an effect (i.e.
min_len=1
).max_len (int) – The maximum length of one extremum. If the width of the interval where we found an extrumum is greater than that, the extremum is ignored. If None, this parameter does not have an effect.
filter_func (function) – A function to filter the extreme. It must accept one argument which is a list of integers representing the indices of the extremum in a

plot_potential_samples
(excluded=False, ax=None, plot_kws={}, *args, **kwargs)[source]¶ Plot the ranges for potential samples
This method plots the rough locations of potential samples (see
find_potential_samples()
 Parameters
excluded (bool) – If True, plot the excluded samples instead of the included samples (see the return values in
find_potential_samples()
)ax (matplotlib.axes.Axes) – The matplotlib axes to plot on
plot_kws (dict) – Any other keyword argument that is passed to the
matplotlib.pyplot.plot()
function. By default, this is equal to{'marker': '+'}
min_len (int) – The minimum length of one extremum. If the width of the interval where we found an extrumum is smaller than that, the extremum is ignored. If None, this parameter does not have an effect (i.e.
min_len=1
).max_len (int) – The maximum length of one extremum. If the width of the interval where we found an extrumum is greater than that, the extremum is ignored. If None, this parameter does not have an effect.
filter_func (function) – A function to filter the extreme. It must accept one argument which is a list of integers representing the indices of the extremum in a

plot_results
(df, ax=None, fig=None, transformed=True)[source]¶ Plot the reconstructed diagram
This method plots the reconstructed diagram using the psystrat module.
 Parameters
df (pandas.DataFrame) – The data to plot. E.g. the
sample_locs
or thestraditize.straditizer.Straditizer.final_df
dataax (matplotlib.axes.Axes) – The axes to plot on. If None, a new one is created inside the given fig
fig (matplotlib.figure.Figure) – The matplotlib figure to plot on. If not given, the current figure (see
matplotlib.pyplot.gcf()
) is usedtransformed (bool) – If True, yaxes and xaxes have been translated (see the
px2data_x()
andpx2data_y()
methods)
 Returns
psyplot.project.Project – The newly created psyplot project with the plotters
list of
psy_strat.stratplot.StratGroup
instances – The groupers for the different columns

plot_sample_hlines
(ax=None, **kwargs)[source]¶ Plot one horizontal line per sample in the
sample_locs
 Parameters
ax (matplotlib.axes.Axes) – The matplotlib axes to plot on
*args,**kwargs – Any other keyword argument that is passed to the
matplotlib.pyplot.hlines()
function

plot_samples
(ax=None, *args, **kwargs)[source]¶ Plot the diagram as lines reconstructed from the samples
 Parameters
ax (matplotlib.axes.Axes) – The matplotlib axes to plot on
*args,**kwargs – Any other argument and keyword argument that is passed to the
matplotlib.pyplot.plot()
function

px2data_x
(coord)[source]¶ Transform the pixel coordinates into data coordinates
 Parameters
coord (1D np.ndarray) – The coordinate values in pixels
 Returns
The numpy array starting from 0 with transformed coordinates
 Return type
np.ndarray
Notes
Since the xaxes for stratographic plots are usually interrupted, the return values here are relative and therefore always start from 0

recognize_hlines
(fraction=0.3, min_lw=1, max_lw=None, remove=False, **kwargs)[source]¶ Recognize horizontal lines in the plot and subtract them
This method removes horizontal lines in the data diagram, i.e. rows whose nonbackground cells cover at least the specified fraction of the row.
 Parameters
fraction (float) – The fraction (between 0 and 1) that has to be covered to recognize a horizontal line
min_lw (int) – The minimum line width for a line
max_lw (int) – The maximum line width for a line or None if it should be ignored
remove (bool) – If True, they will be removed immediately, otherwise they are displayed using the
enable_label_selection()
method and can be removed through theremove_selected_labels()
method
 Other Parameters
``**kwargs`` – Additional keywords are parsed to the
enable_label_selection()
method in case remove isFalse
Notes
This method has to be called before the
digitize()
method!

recognize_vlines
(fraction=0.3, min_lw=1, max_lw=None, remove=False, **kwargs)[source]¶ Recognize horizontal lines in the plot and subtract them
This method removes horizontal lines in the data diagram, i.e. rows whose nonbackground cells cover at least the specified fraction of the row.
 Parameters
fraction (float) – The fraction (between 0 and 1) that has to be covered to recognize a horizontal line
min_lw (int) – The minimum line width for a line
max_lw (int) – The maximum line width for a line or None if it should be ignored
remove (bool) – If True, they will be removed immediately, otherwise they are displayed using the
enable_label_selection()
method and can be removed through theremove_selected_labels()
method
 Other Parameters
``**kwargs`` – Additional keywords are parsed to the
enable_label_selection()
method in case remove isFalse
Notes
This method should be called before the column starts are set

recognize_xaxes
(fraction=0.3, min_lw=1, max_lw=None, remove=False, **kwargs)[source]¶ Recognize (and potentially remove) xaxes at bottom and top
 Parameters
fraction (float) – The fraction (between 0 and 1) that has to be covered to recognize an xaxis
min_lw (int) – The minimum line width of an axis
max_lw (int) – Tha maximum line width of an axis. If not specified, it will be ignored
remove (bool) – If True, they will be removed immediately, otherwise they are displayed using the
enable_label_selection()
method and can be removed through theremove_selected_labels()
method

recognize_yaxes
(fraction=0.3, min_lw=0, max_lw=None, remove=False)[source]¶ Find (and potentially remove) yaxes in the image
 Parameters
fraction (float) – The fraction (between 0 and 1) that has to be covered to recognize a yaxis
min_lw (int) – The minimum line width of an axis
max_lw (int) – Tha maximum line width of an axis. If not specified, the median if the axes widths is taken
remove (bool) – If True, they will be removed immediately, otherwise they are displayed using the
enable_label_selection()
method and can be removed through theremove_selected_labels()
method

remove_in_children
(arr, amask)[source]¶ Update the child reader images after having removed binary data
Calls the
update_image()
andupdate_rgba_image()
methods for allchildren

reset_column_starts
()[source]¶ Reset the column starts,
full_df
,shifted
andoccurences

reset_image
(image, binary=False)[source]¶ Reset the image for this straditizer
 Parameters
image (PIL.Image.Image) – The new image
binary (bool) – If True, then the image is considered as the binary image and the
image
attribute is not touched

resize_axes
(grouper, bounds)[source]¶ Resize the axes based on column boundaries
This method sets the xlimits for the different columns to the given bounds and resizes the axes
 Parameters
grouper (psy_strat.stratplot.StratGroup) – The grouper that manages the plot
bounds (np.ndarray of shape (N, 2)) – The boundaries for the columns handled by the grouper

property
rough_locs
¶ The
pandas.DataFrame
with rough locations for the samples. It has one row per sample in thesample_locs
dataframe andncols * 2
columns, wherencols
is the number of columns in thesample_locs
.If the potential sample
sample_locs
.iloc[i, col]
rangesj
tok
(see thefind_potential_samples()
method), the cell atrough_locs.iloc[i, col * 2]
specifies the first ypixel (j
) andrough_locs.iloc[i, col * 2 + 1]
the last ypixel (+1), i.e.k
where this sample might be located

property
sample_locs
¶ The
pandas.DataFrame
with locations and values of the samples

samples_at_boundaries
= True¶ a boolean flag that shall indicate if we assume that the first and last rows shall be a sample if they contain nonzero values

set_hline_locs_from_selection
(selection=None)[source]¶ Save the locations of horizontal lines
This methods takes every pixel row in the
hline_locs
attribute where at least 30% is selected. The digitize method will interpolate at these indices.

set_vline_locs_from_selection
(selection=None)[source]¶ Save the locations of vertical lines
This methods takes every pixel column in the
vline_locs
attribute where at least 30% is selected.

shifted
= None¶ The number of pixels the columns have been shifted

show_cross_column_features
(min_px=50, remove=False, **kwargs)[source]¶ Highlight and maybe remove cross column features
 Parameters
min_px (int) – The number of pixels that have to be contained in each column
remove (bool) – If True, remove the data in the
binary
array, etc. If False, theenable_label_selection()
method is envoked and the user can select the features to removeselect_all (bool) – If True and remove is False, all labels in arr will be selected and the given selection is ignored
selection (np.ndarray of dtype bool) – A boolean mask with the same shape as arr that is True where a pixel should be selected. If remove is True, only this mask will be used.
img (matplotlib image) – The image for the selection. If not provided, a new image is created
set_picker (bool) – If True, connect the matplotlib pick_event to the
pick_label()
method

show_disconnected_parts
(fromlast=5, from0=10, remove=False, **kwargs)[source]¶ Highlight or remove disconnected parts
 Parameters
%(DataReader.get_disconnected_parts.parameters.fromlastfrom0)s –
%(DataReader._show_parts2remove.parameters.no_arr)s –

show_parts_at_column_ends
(npixels=2, remove=False, **kwargs)[source]¶ Highlight or remove features that touch the column ends
 Parameters
%(DataReader.get_parts_at_column_ends.parameters)s –
%(DataReader._show_parts2remove.parameters.no_arr)s –

show_small_parts
(n=10, remove=False, **kwargs)[source]¶ Highlight and potentially remove small features in the image
 Parameters
n (int) – The maximal size of a feature to be considered as small
remove (bool) – If True, remove the data in the
binary
array, etc. If False, theenable_label_selection()
method is envoked and the user can select the features to removeselect_all (bool) – If True and remove is False, all labels in arr will be selected and the given selection is ignored
selection (np.ndarray of dtype bool) – A boolean mask with the same shape as arr that is True where a pixel should be selected. If remove is True, only this mask will be used.
img (matplotlib image) – The image for the selection. If not provided, a new image is created
set_picker (bool) – If True, connect the matplotlib pick_event to the
pick_label()
method
See also
skimage.morphology.remove_small_objects()

strat_plot_identifier
= 'percentages'¶

static
to_binary_pil
(image, threshold=690)[source]¶ Convert an image to a binary
 Parameters
image (PIL.Image.Image) – The RGBA image file
threshold (float) – If the multiplied RGB values in a cell are above the threshold, the cell is regarded as background and will be set to 0
 Returns
The binary image of integer type
 Return type
np.ndarray of ndim 2

to_dataset
(ds=None)[source]¶ All the necessary data as a
xarray.Dataset
 Parameters
ds (xarray.Dataset) – The dataset in which to insert the data. If None, a new one will be created
 Returns
Either the given ds or a new
xarray.Dataset
instance Return type

static
to_grey_pil
(image, threshold=690)[source]¶ Convert an image to a greyscale image
 Parameters
image (PIL.Image.Image) – The RGBA image file
threshold (float) – If the multiplied RGB values in a cell are above the threshold, the cell is regarded as background and will be set to 0
 Returns
The greyscale image of integer type
 Return type
np.ndarray of ndim 2

unique_bars
(min_fract=None, asdict=True, *args, **kwargs)[source]¶ Estimate the unique bars
This method puts the overlapping bars of the different columns together
 Parameters
 Returns
A list of the bar locations. If asdict is True (default), each item in the returned list is a dictionary whose keys are the column indices and whose values are the indices for the corresponding column. Otherwise, a list of
_Bar
objects is returned Return type

update_image
(arr, amask)[source]¶ Update the image after having removed binary data
This method is in the
remove_callbacks
mapping and is called after a pixel has been removed from thebinary
data. It mainly just calls thereset_labels()
method and updates the plot

update_rgba_image
(arr, mask)[source]¶ Update the RGBA image from the given 3Darray
This method is in the
remove_callbacks
mapping and is called after a pixel has been removed from thebinary
data. It updates theimage
attribute Parameters
arr (3D np.ndarray of dtype float) – The image array
mask (boolean mask of the same shape as arr) – The mask of features that shall be set to 0 in arr

xaxis_data
= None¶

property
xaxis_px
¶ The x indices in column pixel coordinates that are used for xaxes translations

class
straditize.binary.
LineDataReader
(image, ax=None, extent=None, plot=True, children=[], parent=None, magni=None, plot_background=False, binary=None)[source]¶ Bases:
straditize.binary.DataReader
A data reader for digitizing line diagrams
This class does not have a significantly different behaviour than the base
DataReader
class, but might be improved with more specific features in the future Parameters
image (PIL.Image.Image) – The image of the diagram
ax (matplotlib.axes.Axes) – The matplotlib axes to plot on
extent (list) – List of four number specifying the extent of the image in it’s source. This extent will be used for the call of
matplotlib.pyplot.imshow()
children (list of
DataReader
) – Child readers for other columns in case the newly created instance is the parent readerparent (
DataReader
) – The parent reader.magni (straditize.magnifier.Magnifier) – The magnifier for the given ax
plot_background (bool) – If True (and plot is True), a white, opaque are is plotted below the
plot_im
binary (None) – The binary version of the given image. If not provided, the
to_binary_pil()
method is used with the given image
Attributes
str(object=’’) > str

strat_plot_identifier
= 'default'¶

class
straditize.binary.
RoundedBarDataReader
(*args, **kwargs)[source]¶ Bases:
straditize.binary.BarDataReader
A bar data reader that can be used for rounded bars
 Parameters
tolerance (int) – If x0 is the value in a pixel row y and x1 the value in the next pixel row y+1, then the two pixel rows are considered as belonging to different bars if
abs(x1  x0) > tolerance
(see theget_bars()
method and thetolerance
attribute)
Attributes
int([x]) > integer

tolerance
= 10¶

straditize.binary.
groupby_arr
(arr)[source]¶ Groupby a boolean array
 Parameters
arr (np.ndarray of ndim 1 of dtype bool) – An array that can be converted to a numeric array
 Returns
keys (np.ndarrayrdi) – The keys in the array
starts (np.ndarray) – The index of the first element that correspond to the key in keys