Chapter 1 On Modelling

1.1 Characteristics of Mathematical Models

1.1.1 General

Formal fishery stock assessments are generally based upon mathematical models of population production processes and of the population dynamics of the stock being fished. The positive production processes include the recruitment dynamics (which add numbers), and individual growth (which adds biomass), while the negative production processes include natural mortality and fishing mortality (that includes selectivity), which both reduce numbers and hence biomass. The population dynamics includes details such as the time-steps used in the modelled dynamics, whether biomass or numbers are modelled (at age or at size, or both), details of spatial structuring, and other specifics to the case in hand. Such a plethora of potential details means there is a vast potential array of diverse mathematical models of biological populations and processes. Nevertheless, it is still possible to make some general statements regarding such models.

All models constitute an abstraction or simulation by the modeller of what is currently known about the process or phenomenon being modelled. Mathematical models are only a subset of the class of all models, and models may take many forms, ranging from a physical representation of whatever is being modelled (think of ball and stick models of DNA, as produced by Watson and Crick, 1953), diagrammatic models (such as a geographical map), and the more abstract mathematical representations being discussed here. We can impose some sort of conceptual order on this diversity by focussing on different properties of the models and on some of the constraints imposed by decisions made by the modellers.

1.1.2 Model Design or Selection

As abstractions, models are never perfect copies of what is known about the modelled subject, so there must be some degree of selection of what the modeller considers to be a system’s essential properties or components. This notion of “essential properties or components” is making the assumption that all parts of a system are not all equally important. For example, in a model of the human blood circulation system a superficial vein somewhere in the skin would not be as important as the renal artery. If that assumption is accepted then a fundamental idea behind modelling is to select the properties to be included in order that the behaviour of the model may be expected to exhibit a close approximation to the observable behaviour of the modelled system. This selection of what are considered to be the important properties of a system permits, or even forces, the modeller to emphasize particular aspects of the system being modelled. A road map shows roads greatly magnified in true geographical scale because that is the point of the map. A topological map emphasizes different things, so the purpose for which a model is to be used is also important when determining what structure to use.

The selection of what aspects of a system to include in a model is what determines whether a model will be generally applicable to a class of systems, or is so specialized that it is attempting to simulate the detailed behaviour of a particular system (for system one might read a fished stock or population). However, by selecting particular parts of a natural system the model is also being constrained in what it can describe. The assumption is that despite not being complete it will provide an adequate description of the process of interest and that those aspects not included will not unexpectedly distort the representation of the whole (Haddon, 1980).

Of course, in order to make an abstraction one first needs to understand the whole, but unfortunately, in the real world, there is a great deal that remains unknown or misunderstood. Hence it is quite possible that a model becomes what is known as “misspecified”. This is where the model’s dynamics or behaviour, fails to capture the full dynamics of the system being studied. In some of the examples illustrated later in this book we will see the average predicted biomass trajectory for a stock fail to account for what appear to be oscillations in stock size that exhibit an approximate 10 year cycle (see, for example, the surplus production model fitted to the dataspm data-set in the Bootstrap Confidence Intervals section of the chapter on Surplus Production Models). In such a case there is an influence (or influences) acting with an unknown mechanism on the stock size in what appears to be a patterned or repeatable manner. Making the assumption that the pattern is meaningful, because the mechanism behind it is not included in the model structure then, obviously, the model cannot account for its influence. This is classical misspecification, although not all misspecifications are so clear or have such clear patterns.

Model design, or model selection, is complex because the decisions made when putting a model together will depend on what is already known and the use to which the model is to be put.

1.1.3 Constraints due to the Model Type

A model can be physical, verbal, graphical, or mathematical, however, the particular form chosen for a model imposes limits on what it can describe. For example, a verbal description of a dynamic population process would be a challenge for anyone as invariably there is a limit to how well one can capture or express the dynamic properties of a population using words. Words appear to be better suited to the description of static objects. This limitation is not necessarily due to any lack of language skills on the part of the speaker. Rather, it is because spoken languages (at least those of which I am aware) do not seem well designed for describing dynamic processes, especially where more than one variable or aspect of a system is changing through time or relative to other variables. Happily, we can consider mathematics to be an alternative language that provides excellent ways of describing dynamic systems. But even with mathematics as the basis of our descriptions there are many decisions that need to be made.

1.1.4 Mathematical Models

There are many types of mathematical models. They can be characterized as descriptive, explanatory, realistic, idealistic, general, or particular; they can also be deterministic, stochastic, continuous, and discrete. Sometimes they can be combinations of some or all of these things. With all these possibilities, there is a great potential for confusion over exactly what role mathematical models can play in scientific investigations. To gain a better understanding of the potential limitations of particular models, we will attempt to explain the meaning of some of these terms.

Mathematical population models are termed dynamic because they can represent the present state of a population/fishery in terms of its past state or states, with the potential to describe future states. For example, the Schaefer model (Schaefer, 1957) of stock biomass dynamics (of which we will be hearing more) can be partly represented as:

\[\begin{equation} B_{t+1} = B_t + r{B_t} \left(1 - \frac{B_t}{K} \right) - C_t \tag{1.1} \end{equation}\]

where the variable \(C_t\) is the catch taken during time \(t\), and \(B_t\) is the stock biomass at the start of time \(t\) (\(B_t\) is also an output of the model). The model parameters are \(r\), representing the population growth rate of biomass (or numbers, depending on what interpretation is given to \(B_t\), perhaps = \(N_t\)), and \(K\), the maximum biomass (or numbers) that the system can attain (these parameters come from the logistic model from early mathematical ecology; see the Simple Population Models chapter). By examining this relatively simple model one can see that expected biomass levels at one time (\(t+1\)) are directly related to catches and the earlier biomass (time = t; the values are serially correlated). The influence of the earlier biomass on population growth is controlled by the combination of the two parameters \(r\) and \(K\). By accounting for the serial correlations between variables from time period to time period, such dynamic state models differ markedly from traditional statistical analyses. Serial correlation means that if we were to sample a population each year then, strictly, the samples would not be independent, which is a requirement of more classical statistical analyses. For example, in a closed population the number of two-year old fish in one year cannot be greater than the number of one-year fish the year before; they are not independent.

1.1.5 Parameters and Variables

At the most primitive level, mathematical models are made up of variables and parameters. A model’s variables must represent something definable or measurable in nature (at least in principle). Parameters modify the impact or contribution of a variable to the model’s outputs, or are concerned with the relationships between the variables within the model. Parameters are the things that dictate quantitatively how the variables interact. They differ from a model’s variables because the parameters are the things estimated when a model is fitted to observed data. In Equ(1.1), \(B_t\) and \(C_t\) are the variables and \(r\) and \(K\) are the parameters. There can be overlap as, for example, one might estimate the very first value in the \(B_t\) series, perhaps \(B_{init}\) and hence the series would be made up of one parameter with the rest a direct function of the\(B_{init}\), \(r\) and \(K\) parameters and the \(C_t\) variable.

In any model, such as Equ(1.1), we must either estimate or provide constant values for the parameters. With the variables, either one provides observed values for them (e.g., a time series of catches, \(C_t\)) or they are an output from the model (with the exception of \(B_{init}\) as described above). Thus, in Equ(1.1), given a time series of observed catches plus estimates of parameter values for \(B_{init}\), \(r\), and \(K\), then a time series of biomass values, \(B_t\), is implied by the model as an output. As long as one is aware of the possibilities for confusion that can arise over the terms observe, estimate, variable, parameter, and model output, one can be more clear about exactly what one is doing while modelling a particular phenomenon. The relation between theory and model structure is not necessarily simple. Background knowledge and theory may be the drive behind the selection of a model’s structure. The relationships proposed between a set of variables may constitute a novel hypothesis or theory about the organization of nature, or simply a summary of what is currently known, ready to be modified as more is learnt.

A different facet of model misspecification derives from the fact that the parameters controlling the dynamics of a population are often assumed to be constant through time, which should generally be acknowledged to be an approximation. If the population growth rate \(r\) or carrying capacity \(K\) varied randomly through time, and yet were assumed to be constant, this would be an example of what is known as process error. Such process error would add to the observable variation in samples from the population, even if they could be collected without error (without measurement error). If the parameters varied in response to some factor external to the population (an environmental factor or biological factor such as a predator or competitor) then such non-random responses have the potential to lead to an improved understanding of the natural world. So, an important aspect of the decisions made when constructing a model is to be explicit about the assumptions being made about the chosen structure.

1.2 Mathematical Model Properties

1.2.1 Deterministic vs Stochastic

We can define a model parameter as a quantitative property (of the system being modelled) that is assumed either to remain constant over the period for which data are available, or to be modulated by environmental variation. Roughly speaking, models in which the parameters remain constant on the timescale of the model’s application are referred to as deterministic. With a given set of inputs, because of its constant parameters, a deterministic model will always give the same outputs for the same inputs. Because the relationships between the model variables are fixed (constant parameters), the output from a given input is “determined” by the structure of the model. One should not be confused by situations where parameters in a deterministic model are altered sequentially by taking one of an indexed set of predetermined values (e.g., a recruitment index or catchability index may alter and be changed on a yearly basis). In such a case, although the estimated parameters are expected to change through time, they are doing so in a repeatable, deterministic fashion (constant over a longer timescale), and the major property that a given input will always give the same output still holds.

Deterministic models contrast with stochastic models in which at least one of the parameters varies in a random or unpredictable fashion over the time period covered by the model. Thus, given a set of input values, the associated output values will be uncertain. The parameters that vary will take on a random value from a predetermined probability distribution (either from one of the classical probability density functions (PDFs) or from a custom distribution). Thus, for example, when simulating a fish stock, each year the recruitment level may attain a mean value plus or minus a random amount determined by the nature of a random variate, Equ(1.2).

\[\begin{equation} R_{y} = \bar{R} e^{N(0,\sigma^{2}_{R})-\sigma^{2}_{R}/2} \tag{1.2} \end{equation}\]

where \(R_y\) is the recruitment in year \(y\), \(\bar{R}\) is the average recruitment across years (which may itself be a function of stock size), \(N \left(0, \sigma^{2}_{R} \right)\) is the notation used for a random variable whose values are described in this example by a Normal distribution with mean = zero (i.e., has both positive and negative values) and variance \(\sigma^{2}_{R}\). By including the Normal distribution in an exponential term this designates Log-Normal variation and \(-\sigma^{2}_{R}/2\) is a bias correction term for Log-Normal errors within recruitment time series (Haltuch et al, 2008).

A simulation model differs from a model with estimated parameters. The objectives of these two types of model are also different, the former might be used to explore the implications of different management scenarios while the latter might be used to estimate the current state of depletion of a stock.

Given a set of input data (assumed to be complete and accurate; watch out for those assumptions), a deterministic model expresses all of its possible responses. However, stochastic models form the basis of so-called Monte Carlo simulations where the model is run repeatedly with the same input data, but for each run new random values are produced for the stochastic parameters, as with Equ(1.2). For each run a different output is produced, and these are tabulated or graphed to see what range of outcomes could be expected from such a system. Even if the variation intrinsic to a model is normally distributed, it does not imply that a particular output can be expected to be normally distributed about some mean value. If there are nonlinear aspects in the model, skew and other changes may arise.

Future population projections, risk assessments, and determining the impact of uncertainty in one’s data all require the use of Monte Carlo modelling. Simulation testing of model structures is a very powerful tool. Details of running such projections are given in the chapters On Uncertainty and Surplus Production Models.

1.2.2 Continuous vs. Discrete Models

Early fishery modellers used continuous differential equations to design their models, so the time steps in the models were all infinitesimal (Beverton and Holt, 1957). At that time computers were still very much in their infancy and analytical solutions were the culture of the day. Early fishery models were thus formed using differential calculus, and parts of their structures were determined more by what could be solved analytically than because they reflected nature in a particular accurate manner. At the same time, the application of these models reflected or assumed equilibrium conditions. Fortunately, we can now simulate a population using easily available computers and software, and we can use more realistic, or more detailed, formulations. While it may not be possible to solve such models analytically (i.e., if the model formulation has that structure it follows that its solution must be this), they can usually be solved numerically (informed and improving trial and error). Although both approaches are still used, one big change in fisheries science has been a move away from continuous differential equations toward difference equations, which attempt to model a system as it changes through discrete intervals (ranging from infinitesimal to yearly time steps).

There are other aspects of model building that can limit what behaviours can be captured or described by a model. The actual structure or form of a model imposes limits. For example, if a mathematical modeller uses difference equations to describe a system, the resolution of events cannot be finer than the time intervals with which the model is structured. This obvious effect occurs in many places. For example, in models that include a seasonal component the resolution is quite clearly limited depending on whether the available data are for weeks, months, or some other interval. For example, in the Static Models chapter we fit a seasonal growth curve using data collected at mostly weekly intervals, obviously, if the data had been collected yearly then describing seasonal growth would be impossible.

1.2.3 Descriptive vs Explanatory

Whether a model is discrete or continuous, and deterministic or stochastic, is a matter of model structure and clearly influences what can be modelled. The purpose for which a model is to be used is also important. For a model to be descriptive it only needs to mimic the empirical behaviour of the observed data. A fine fit to individual growth data, for example, may usually be obtained by using polynomial equations:

\[\begin{equation} y=a+bx+cx^2+dx^3 ... +mx^n \tag{1.3} \end{equation}\]

in which no attempt is made to interpret the various parameters used (usually one would never use a polynomial greater than order six, with order two or three being much more common). Such descriptive models can be regarded as black boxes, which provide a deterministic output for a given input. It is not necessary to know the workings of such models; one could even use a simple look-up table that produced a particular output value from a given input value by literally looking up the output from a cross-tabulation of values. Such black box models would be descriptive and nothing else. Even though empirical descriptive models may make assumptions, if a particular case fails to meet those assumptions this does not mean the model need be rejected completely, merely that one must be constrained concerning which systems to apply it to. Such purely descriptive models need not have elements of realism about them except for the variables being described although it is common that their parameters can be give interpretations (such as the maximum size achievable). But again, what matters is how well such models describe the available data not whether the values attributed to their parameters make biological sense. In the Model Parameter Estimation chapter we will be examining an array of three growth curves, including the famous von Bertalanffy curve. That section will enable a deeper discussion of the use of such descriptive models.

Explanatory models also provide a description of the empirical observations of interest, but in addition they attempt to provide some justification or explanation, a mechanism, for why the particular observations noted occurred instead of a different set. With explanatory models it is necessary to take into account the assumptions and parameters, as well as the variables that make up the model. By attempting to make the parameters and variables, and how the variables interact, reflect nature, explanatory models attempt to simulate real events in nature. A model is explanatory if it contains theoretical constructs (assumptions, variables, or parameters), which purport to relate to the processes of nature and not only to how nature behaves.

1.2.4 Testing Explanatory Models

Explanatory models are, at least partly, hypotheses or theories about the mechanisms and structure of nature and how it operates. They should thus be testable against observations from nature. But how do we test explanatory models? Can fitting a model to data provide a test of the model? If the expected values for the observed data, predicted by a model, account for a large proportion of the variability within the observed data, then our confidence that the model adequately describes the observations can be great. But the initial model fitting does not constitute a direct test of the structure of the model. A good fit to a model does not test whether the model explains observed data; it only tests how well the model describes and is consistent with the data (Haddon, 1980). The distinction between explanation and description is very important. A purely descriptive or empirical model could provide just as good a fit to the data, which hopefully makes it clear that we need further, independent observations against which to really test the model’s structure. What requires testing is not only whether a model can fit a set of observed data (i.e., not only the quality of fit), but also whether the model assumptions are valid and whether the interactions between model variables, as encoded in one’s model, closely reflect nature.

Comparing the now fitted model with new observations does constitute a test of sorts. Ideally, given particular inputs, the model would provide a predicted observation along with confidence intervals around the expected result. An observation would be said to be inconsistent with the model if the model predicted that its value was highly unlikely given the inputs. But with this test, if there is a refutation, there is no indication of what aspect of the model was at fault. This is because it is not a test of the model’s structure but merely a test of whether the particular parameter values are adequate (given the model structure) to predict future outcomes! We do not know whether the fitting procedure was limited because the data available did not express the full potential for variation inherent in the population under study. Was it the assumptions or the particular manner in which the modeller has made the variables interact that was at fault? Was the model too simple, meaning were important interactions or variables left out of the structure? We cannot tell without independent tests of the assumptions or of the importance of particular variables.

If novel observations are in accord with the model, then one has gained little. In practice, it is likely that the new data would then be included with the original and the parameters re-estimated. But the same could be said about a purely empirical model. What are needed are independent tests that the structure chosen does not leave out important sources of variation; to test this requires more than a simple comparison of expected outputs with real observations.

While we can be content with the quality of fit between our observed data and those predicted from a model, we can never be sure that the model we settle on is the best possible. It is certainly the case that some models can appear less acceptable because alternative models may fit the data more effectively.

However, any discussion over which curve or model best represents a set of data depends not only upon the quality of fit, but also upon other information concerning the form of the relationship between the variables. An empirical model with a parameter for every data point could fit a data-set exactly but would not provide any useful information. Clearly, in such cases, criteria other than just quality of numerical fit must be used to determine which model should be preferred. In the Static Models chapter, we consider methods for Objective Model Selection, which attempt to assess whether increasing the number of parameters in a model is statistically justifiable. Any explanatory model must be biologically plausible. It might be possible to ascribe meaning even to the parameters of a completely arbitrary model structure. However, such interpretations would be ad hoc and only superficially plausible. There would be no expectation that the model would do more than describe a particular set of data. An explanatory model should be applicable to a new data set, although perhaps with a new set of particular parameters to suit the new circumstances.

Precision may not be possible even in a realistic model because of intrinsic uncertainty either in our estimates of the fitted variables (observation error) or in the system’s responses, perhaps to environmental variation (process error in the model’s parameters). In other words, it may not be possible to go beyond certain limits with the precision of our predicted system outcomes (the quality of fit may have intrinsic limits).

1.2.5 Realism vs Generality

Related to the problem of whether or not we should work with explanatory models is the problem of realism within models. Purely descriptive models need have nothing realistic about them. But it is an assumption that if one is developing an explanatory model, then at least parts of it have to be realistic. For example, in populations where ages or sizes can be distinguished, age- or size-structured models would be considered more realistic than a model that lumped all age or size categories into one. But a model can be a combination of real and empirical.

For a model to be general, it would have a very broad domain of applicability, that is, it could be applied validly in many circumstances. There have been many instances in the development of fisheries science where a number of models describing a particular process (e.g., individual growth) have been subsumed into a more general mathematical model of which they are special cases (see Static Models chapter). Usually this involves increasing the number of parameters involved, but nevertheless, these new models are clearly more mathematically general. It is difficult to draw conclusions over whether such more general equations/models are less realistic. That would be a matter of whether the extra parameters can be realistically interpreted or whether they are simply ad hoc solutions to combining disparate equations into one that is more mathematically general. With more complex phenomena, such as age-structured models, general models do not normally give as accurate predictions as more specialized models tuned to a particular situation. It is because of this that modellers often consider mathematically general models to be less realistic when dealing with particular circumstances (Maynard-Smith, 1974).

1.2.6 When is a Model a Theory

All models may be considered to have theoretical components, even supposedly empirical models. It becomes a matter of perception more than model structure. With simple models, for example, the underlying assumptions can begin to take on the weight of hypothetical assertions. Thus, if one were using the logistic equation to describe the growth of a population, it imports the assumption that density-dependent compensation of the population growth rate is linearly related to population density. In other words, the negative impact on population growth of increases in population size is linearly related to population size (see the Simple Population Models chapter). This can be regarded either as a domain assumption (that is, the model can only apply validly to situations where density-dependent effects are linearly related to population density) or as a theory (nonlinear density-dependent effects are unimportant in the system being modelled). It is clearly a matter of perception or modelling objective as to which of these two possibilities obtains. This is a good reason one should be explicit concerning the interpretation of one’s model’s assumptions.

If one were to restrict oneself purely to empirical relationships, the only way in which one’s models could improve would be to increase the amount of variance in the observations accounted for by the model. There would be no valid expectation that an empirical model would provide insights into the future behaviour of a system. An advantage of explanatory/theoretical models is that it should be possible to test the assumptions, the relationships between variables, and the error structures, independently from the quality of fit to observed outcomes.

It should, therefore, be possible to present evidence in support of a model, which goes beyond the quality of fit. Those models where the proposed structure is not supported in this way may as well be empirical.

1.3 Concluding Remarks

Writing and talking about models, their use and construction is sometimes valuable as providing a reminder of the framework within which we work. A theoretical understanding of the strengths and weaknesses of mathematical models will always have value if you are to become a modeller. However, often the best way to understand models and their properties is to actually use them and explore their properties by manipulating their parameters and examining how they operate in practice. Hopefully you will find that using R as a programming language makes such explorations relatively simple to implement.

The material to follow includes very general methods and others that are more specific. An objective of the book is to encourage you, and perhaps provide you with a beginning, to develop your own analytical functions, perhaps by modifying some from this book. You might do that so that your own analyses become quicker and easier, and to some extent automated, leaving you more time to think about and interpret your findings. The less time you need to spend mechanically conducting analyses the more time there is for thinking about your scientific problems and exploring further than you might if you were using other less programmatic analyses. A major advantage of using R to implement your modelling, is that any work you do should become much more easily repeatable, and thus, presumably, more defensible. Of course, the range of subjects covered here only brushes the surface of what is available but tries to explore some of the fundamental methods, such as maximum likelihood estimation. Remember there are an enormous number of R packages available and these may assist you in implementing your own models be they statistical or dynamic.