What should I pay attention to when reviewing a Pull Request?

What should I pay attention to when reviewing a Pull Request?

Short introduction to Design by Contract

·

7 min read

Well, think about how you code. For sure you don't read the implementation of the libraries that you pip-installed. Instead, you usually look for the appropriate methods that may solve your task, read their documentation to understand if they indeed solve your needs and eventually use them.

Then the first question your PR review must answer is:

Are features in this PR well documented, so I can understand and use them without reading the implementation?

This article covers the basics of Desing by Contract with an example. We don't even get to see the actual implementation. We'll leave it for another article.

Here's a typical example of a Pull Request

Consider a small wind turbine at the rooftop that every minute collects the following data in CSV format:

time,wind_speed[m/s],wind_direction
2023-10-05T15:00:00Z,3.22,north
2023-10-05T15:01:00Z,1.00,north-east
2023-10-05T15:02:00Z,2.38,north-east
2023-10-05T15:03:00Z,6.00,east
2023-10-05T15:04:00Z,4.61,north-east
2023-10-05T15:05:00Z,2.11,north
2023-10-05T15:06:00Z,5.16,north
2023-10-05T15:07:00Z,2.66,north
2023-10-05T15:08:00Z,3.50,north-east
2023-10-05T15:09:00Z,4.33,north-east
...

The Pull Request implements a function resample to aggregates measurements every 5 minutes.

def resample(dataframe, frequency):
    """Resample measurements of a wind turbine."""
    # Some lines of code that group rows by groups of 5 minutes.
    # For every group, it computes the mean wind_speed and the most
    # appropriate wind direction representative.

    # The code uses pandas, which is great for data science, but
    # usually requires endless hours involving reading the 
    # documentation of many pandas methods and line-by-line 
    # debugging to understand what the teammate implemented.

The code has everything we ask for. The implementation works, it has a docstring, and the PR includes unit tests and integration tests.

Does this mean automatic approval?

Well, not really. Whoever wants to use this function does not know what it actually does nor how to use it. The user has to either understand the implementation to understand if it solves the tasks or use it blindly and hope everything just works fine (I can assure you that it won't).

This function is a clockmaking bomb ready to explode in production, causing undefined behaviour and lots of headaches to find out where the error is.

Let's address issues one by one.

Proper naming is crucial

This function cannot be used with any dataframe. The implementation expects a dataframe with time, wind speed and wind direction measurements. Let's make it clear.

def resample_wind_turbine_measurements(dataframe, frequency):
    ...

Add type hints

Be strict with type hints. Even if you don't use mypy (you should!) type hints help communicating your intent.

import pandas as pd

def resample_wind_turbine_measurements(
    df: pd.DataFrame,
    frequency: str,
) -> pd.DataFrame:
    ...

Now type hints communicate that the function returns a new dataframe, thus input dataframe is not modified. This is precious information to the caller!

Describe function arguments

Still, there is no clue about the contents of df nor allowed values of frequency.

We'll use the pandas docstring guide. Feel free to use any convention you want as long as you communicate effectively.

import pandas as pd

def resample_wind_turbine_measurements(
    df: pd.DataFrame,
    frequency: str,
) -> pd.DataFrame:
    """
    Resample measurements of a wind turbine.

    Parameters
    ----------
    df : pd.DataFrame
        Pandas DataFrame with wind turbine measurements. ``df`` index must be
        a ``pd.DatetimeIndex`` with timezone-aware (preferably UTC) timestamps.
        ``df`` columns must include wind_speed (float) and wind_direction
        (string). The dataframe may contain ``nan`` values.
    frequency : str
        String frequency in pandas format. Allowed values
        https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html

    Returns
    -------
    pd.DataFrame
        A dataframe having measurements with the desired frequency. ``df``
        index is a ``pd.DatetimeIndex`` with the same timezone as input's.
        Columns are wind_speed and wind_direction. Extra columns are dropped.
        The DataFrame may contain ``nan`` values. Values are rounded to three 
        decimals. If ``df`` is empty, the function returns an empty dataframe 
        as well.
    """

This already looks much better. The caller now does not need to read the implementation to understand the structure of df . The caller also understands the structure of the returned dataframe.

Notice the link to all possible values accepted in frequency. The caller will instantly love you.

Describe the implementation of the function

Still, it is not clear what this function actually does. Does resample mean downsample, upsample or both? downsampling usually aggregates data, whereas upsampling usually interpolates data (which is always risky, so the caller must know if this may happen).

Also, the docstring indicates how measurements are grouped. It is really important to communicate this convention to the caller.

def downsample_wind_turbine_measurements(  # Update method name
    df: pd.DataFrame,
    frequency: str,
) -> pd.DataFrame:
    """
    Resample measurements of a wind turbine.

    This method only implements downsampling. This means that the value of
    ``frequency`` must be greater than the frequency of the measurements of
    ``df``. Measurements in ``df`` may have non-homogeneous frequency.

    Measurements are grouped by frequency. Group time intervals are left-closed
    and right-opened.

    Parameters
    ----------
    <no changes here>
    """

Add doctests

Doctests are really helpful for users to understand how to use the function and that serve as tests too.

def downsample_wind_turbine_measurements(  # Update method name
    df: pd.DataFrame,
    frequency: str,
) -> pd.DataFrame:
    """
    <no changes here>

    Returns
    -------
    <no changes here>

    Examples
    --------
    >>> measurements = pd.DataFrame(
    ...     [
    ...         [3.22, 'north'],
    ...         [1.00, 'north-east'],
    ...         [2.38, 'north-east'],
    ...         [6.00, 'east'],
    ...         [4.61, 'north-east'],
    ...         [2.11, 'north'],
    ...         [5.16, 'north'],
    ...         [2.66, 'north'],
    ...         [3.50, 'north-east'],
    ...         [4.33, 'north-east'],
    ...     ],
    ...     columns=['wind_speed', 'wind_direction'],
    ...     index=Index(
    ...         [
    ...             datetime(2023, 10, 5, 15, minute, tzinfo=timezone.utc),
    ...             for minute in range(10)
    ...         ],
    ...         name='time',
    ...     ),
    ... )
    >>> resampled = downsample_wind_turbine_measurements(measurements, '5min')
    >>> print(resampled.to_csv())
    time,wind_speed,wind_direction
    2023-10-05 15:00:00+00:00,3.442,'north-east'
    2023-10-05 15:05:00+00:00,3.552,'north'
    <BLANKLINE>
    """

What about the actual implementation?

We still haven't reviewed the implementation! We'll leave this for another article.

What we've done so far is defining a contract between the caller and the implementer. The caller is responsible for making sure that input values comply with the requested conditions. The implementer is responsible for delivering what it promises under the requested conditions.

The caller can understand and use this method without the implementation. And more importantly, the caller understands when not to use this method.

The implementer now has a clear contract too, which helps defining with unittests.

Finally (and this is important), the implementor is not responsible of any unappropriate use of this function. If the caller breaks the contract (if input arguments don't comply with the conditions) the function can behave unexpectedly.

You can learn more about Desing by Contract.

Wrap up

With this tips you'll help your teammates incredibly. It shows you care about them. Teammates will appreciate the effort you put in documentation and they will be able to implement their tasks quicker and happier, specially with tight deadlines.

Here's the entire function definition.

def downsample_wind_turbine_measurements(  # Update method name
    df: pd.DataFrame,
    frequency: str,
) -> pd.DataFrame:
    """
    Resample measurements of a wind turbine.

    This method only implements downsampling. This means that the value of
    ``frequency`` must be greater than the frequency of the measurements of
    ``df``. Measurements in ``df`` may have non-homogeneous frequency.

    Measurements are grouped by frequency. Group time intervals are left-closed
    and right-opened.

    Parameters
    ----------
    df : pd.DataFrame
        Pandas DataFrame with wind turbine measurements. ``df`` index must be
        a ``pd.DatetimeIndex`` with timezone-aware (preferably UTC) timestamps.
        ``df`` columns must include wind_speed (float) and wind_direction
        (string). The dataframe may contain ``nan`` values.
    frequency : str
        String frequency in pandas format. Allowed values
        https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html

    Returns
    -------
    pd.DataFrame
        A dataframe having measurements with the desired frequency. ``df`` 
        index is a ``pd.DatetimeIndex`` with the same timezone as input's.
        Columns are wind_speed and wind_direction. Extra colums are dropped.
        The DataFrame may contain ``nan`` values. If ``df`` is empty, the function returns an empty dataframe 
        as well.

    Examples
    --------
    >>> measurements = pd.DataFrame(
    ...     [
    ...         [3.22, 'north'],
    ...         [1.00, 'north-east'],
    ...         [2.38, 'north-east'],
    ...         [6.00, 'east'],
    ...         [4.61, 'north-east'],
    ...         [2.11, 'north'],
    ...         [5.16, 'north'],
    ...         [2.66, 'north'],
    ...         [3.50, 'north-east'],
    ...         [4.33, 'north-east'],
    ...     ],
    ...     columns=['wind_speed', 'wind_direction'],
    ...     index=Index(
    ...         [
    ...             datetime(2023, 10, 5, 15, minute, tzinfo=timezone.utc),
    ...             for minute in range(10)
    ...         ],
    ...         name='time',
    ...     ),
    ... )
    >>> resampled = downsample_wind_turbine_measurements(measurements, '5min')
    >>> print(resampled.to_csv())
    time,wind_speed,wind_direction
    2023-10-05 15:00:00+00:00,3.442,'north-east'
    2023-10-05 15:05:00+00:00,3.552,'north'
    <BLANKLINE>
    """
    # Implementation here