From 0e2dddf5e90ae32e8521ec4733ae98e97809282f Mon Sep 17 00:00:00 2001
From: Jerome Kelleher <jk@well.ox.ac.uk>
Date: Thu, 28 Jul 2022 20:31:23 +0100
Subject: [PATCH] Split VCF docs into separate examples section

Closes  #2328

Apply suggestions from code review

Co-authored-by: Ben Jeffery <ben.jeffery@gmail.com>
---
 docs/_toc.yml            |   1 +
 docs/build.sh            |   2 +-
 docs/export.md           | 399 +++++++++++++++++++++++++++++++++++++++
 python/CHANGELOG.rst     |   4 +
 python/tests/test_vcf.py |  17 ++
 python/tskit/trees.py    | 176 ++++++-----------
 6 files changed, 475 insertions(+), 124 deletions(-)
 create mode 100644 docs/export.md

diff --git a/docs/_toc.yml b/docs/_toc.yml
index 22d12bec27..120fa1d542 100644
--- a/docs/_toc.yml
+++ b/docs/_toc.yml
@@ -16,6 +16,7 @@ parts:
   - file: stats
   - file: topological-analysis
   - file: ibd
+  - file: export
 - caption: Interfaces
   chapters:
   - file: python-api
diff --git a/docs/build.sh b/docs/build.sh
index 558f64fd42..d454c14173 100755
--- a/docs/build.sh
+++ b/docs/build.sh
@@ -4,7 +4,7 @@
 # saved reports, which makes it difficult to debug the reasons for 
 # build failures in CI. This is a simple wrapper to handle that.
 
-REPORTDIR=${BUILDDIR}/html/reports
+REPORTDIR=_build/html/reports
 
 jupyter-book build -Wn --keep-going .
 RETVAL=$?
diff --git a/docs/export.md b/docs/export.md
new file mode 100644
index 0000000000..5616dee055
--- /dev/null
+++ b/docs/export.md
@@ -0,0 +1,399 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.12
+    jupytext_version: 1.9.1
+kernelspec:
+  display_name: Python 3
+  language: python
+  name: python3
+---
+
+```{currentmodule} tskit
+```
+
+```{code-cell}
+:tags: [hide-input]
+
+from IPython.display import display
+```
+
+(sec_export)=
+
+# Data export
+
+(sec_export_vcf)=
+## Variant Call Format
+
+Tskit supports exporting data to the standard
+[Variant Call Format](http://samtools.github.io/hts-specs/VCFv4.3.pdf)
+via the `tskit vcf` {ref}`command line interface<sec_cli>` command
+and the {meth}`TreeSequence.write_vcf` method in the {ref}`sec_python_api`.
+Conversion is quite efficient, with tskit producing VCF data at several
+hundred megabytes per second (for large files), which is usually as fast as
+it can be written to storage or consumed by programs in a pipeline.
+
+::::{tip}
+If we have a tree sequence file the
+{ref}`command line interface<sec_cli>` is often the most
+convenient way to convert to VCF:
+
+:::{code-block} bash
+$ tskit vcf example.trees > example.vcf
+:::
+
+See the {ref}`sec_export_vcf_compression` section for information
+on how to compress the VCF output.
+::::
+
+For tree sequences produced by recent versions of programs such as
+``msprime``, ``SLiM``, ``fwdpy11`` or ``tsinfer``, VCF output will
+"do the right thing" and no further arguments are needed.
+For example, here we simulate 3 diploid individuals
+with mutations using ``msprime``, and convert to VCF.
+
+```{code-cell}
+import sys
+import msprime
+ts = msprime.sim_ancestry(
+    samples=3, ploidy=2, sequence_length=10, random_seed=2)
+ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
+ts.write_vcf(sys.stdout)
+```
+
+In the output VCF we have 3 diploid samples
+(see the {ref}`sec_export_vcf_terminology` section)
+corresponding to samples specified in the ancestry simulation
+with IDs ``tsk_0``, ``tsk_1`` and ``tsk_2``
+(see the {ref}`sec_export_vcf_individual_names`
+section for how to override these default labels).
+We then have a line for every row
+in the {ref}`site table<sec_site_table_definition>`, and
+the data is derived directly from the {meth}`TreeSequence.variants`
+method; e.g.
+
+```{code-cell}
+for var in ts.variants():
+    print(var.site.position, var.site.id, var.alleles, var.genotypes, sep="\t")
+```
+
+We can see the ``POS`` value is equal to the site's position
+(see the {ref}`sec_export_vcf_modifying_coordinates` for information
+on how we deal with continuous coordinates), the ``ID`` value
+is the site's ID, and the ``REF`` and ``ALT`` values
+are derived from the variant's ``alleles``.
+
+The ``GT`` values for the three diploid individuals are derived from the
+variant's genotypes (see the {ref}`sec_export_vcf_terminology` section).
+For this simulation, the diploid individuals correspond to
+adjacent sample nodes in order, and we can see there is a direct
+correspondence between the phased ``GT`` values and variant's genotypes.
+See the {ref}`sec_export_vcf_constructing_gt` section for
+more information on how this done in general and for options
+to control the VCF sample and ``GT`` values.
+
+::::{important}
+In these examples we write the VCF data to ``sys.stdout`` so that we can see
+the output. Usually, however, you'd write to a file:
+
+:::{code-block}
+with open("output.vcf", "w") as vcf_file:
+    ts.write_vcf(vcf_file)
+:::
+
+:::{seealso}
+See the {ref}`sec_export_vcf_compression` section for information
+on how to compress the output or convert to BCF.
+:::
+
+::::
+
+(sec_export_vcf_terminology)=
+
+### Terminology
+
+There are some mismatches between the terminology for tskit and VCF.
+In VCF a "sample" is a multiploid individual, but in tskit a sample
+refers to a single **node** (monoploid genome), and an individual
+consists of one or more nodes (e.g., two nodes for a diploid).
+Similarly, in VCF a "genotype" refers to the observed allelic state
+for a sample **individual** at a particular site,
+whereas in tskit a genotype is the observed allelic state
+for a **node** (see {attr}`.Variant.genotypes`).
+
+:::{seealso}
+See the {ref}`sec_glossary` for more details on tskit's data model
+ and terminology.
+:::
+
+(sec_export_vcf_compression)=
+
+### Compressed output
+
+The simplest way to compress the VCF output is to use the
+`tskit vcf` {ref}`command line interface<sec_cli>`
+and pipe the output to `bgzip`:
+
+:::{code-block} bash
+$ tskit vcf example.trees | bgzip -c > example.vcf.gz
+:::
+A general way to convert VCF data to various formats is to pipe the text
+produced by ``tskit`` into ``bcftools`` using the command
+line interface:
+
+:::{code-block} bash
+$ tskit vcf example.trees | bcftools view -O b > example.bcf
+:::
+
+If you need more control over the form of the output (or want to work
+directly in Python), the following recipe has the same effect:
+
+:::{code-block}
+
+import os
+import subprocess
+
+read_fd, write_fd = os.pipe()
+write_pipe = os.fdopen(write_fd, "w")
+with open("output.bcf", "w") as bcf_file:
+    proc = subprocess.Popen(
+        ["bcftools", "view", "-O", "b"], stdin=read_fd, stdout=bcf_file
+    )
+    ts.write_vcf(write_pipe)
+    write_pipe.close()
+    os.close(read_fd)
+    proc.wait()
+    if proc.returncode != 0:
+        raise RuntimeError("bcftools failed with status:", proc.returncode)
+:::
+
+
+The VCF output can also be compressed using the {mod}`gzip` Python module:
+
+:::{code-block}
+
+import gzip
+
+with gzip.open("output.vcf.gz", "wt") as f:
+    ts.write_vcf(f)
+:::
+
+However, this gzipped VCF won't be fully compatible with downstream tools
+such as tabix, which usually require the VCF to use the specialised bgzip format.
+
+(sec_export_vcf_masking_output)=
+
+### Masking output
+
+The {meth}`TreeSequence.write_vcf` method provides the
+``site_mask`` and ``sample_mask`` arguments to
+omit or mark parts of the output as missing.
+
+```{code-cell}
+ts = msprime.sim_ancestry(
+    samples=3, ploidy=2, sequence_length=10, random_seed=2)
+ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
+ts.tables.sites
+```
+
+The ``sample_mask`` argument provides a general way to mask out
+parts of the output, which can be helpful when simulating missing
+data. In this (contrived) example, we create a sample mask function
+that marks one genotype missing in each variant in a regular
+pattern:
+
+:::{code-block}
+
+def sample_mask(variant):
+    sample_mask = np.zeros(ts.num_samples, dtype=bool)
+    sample_mask[variant.site.id % ts.num_samples] = 1
+    return sample_mask
+
+
+ts.write_vcf(sys.stdout, sample_mask=sample_mask)
+:::
+
+(sec_export_vcf_constructing_gt)=
+
+### Constructing GT values
+
+The core elements of the tskit
+{ref}`data model<sec_data_model>`
+are {ref}`nodes<sec_node_table_definition>`,
+{ref}`edges<sec_node_table_definition>`,
+{ref}`sites<sec_site_table_definition>` and
+{ref}`mutations<sec_mutation_table_definition>`.
+These four tables allow us to completely describe the
+genetic ancestry of a set of sampled monoploid
+genomes and their genetic variation.
+The {ref}`individual table<sec_individual_table_definition>`
+defines a set of individual *organisms*, and it can
+be used to define the inheritance relationships between
+then (the pedigree). An individual may be associated
+with one or more nodes, and these nodes may or
+may not be samples (see the {ref}`sec_glossary`
+for clarification of these terms).
+Thus, there is some complexity in how the per-individual GT values
+are generated, which we explain in this section.
+
+#### Without individuals
+
+We start with an example in which there are no individuals
+defined (which was the default in msprime before version 1.0):
+
+```{code-cell}
+import tskit
+tables = tskit.Tree.generate_balanced(4, span=10).tree_sequence.dump_tables()
+tables.sites.add_row(3, ancestral_state="A")
+tables.mutations.add_row(site=0, node=0, derived_state="T")
+ts = tables.tree_sequence()
+display(ts.draw_svg())
+display(ts)
+ts.write_vcf(sys.stdout)
+```
+
+Here we define a tree sequence consisting of a single tree, which
+has a variant site at position 3 and a mutation over node 0.
+There is no information about individuals in this tree sequence,
+and so we assume that each of the nodes corresponds to a single
+haploid individual.
+
+Users of msprime simulations would often be interested in producing
+VCFs for diploid organisms. Because of the assumptions made
+by these simulations, this means arbitrarily combining the sample
+nodes into pairs. This is what the ``ploidy`` option does:
+
+```{code-cell}
+ts.write_vcf(sys.stdout, ploidy=2)
+```
+
+Thus, the ``GT`` values for the (synthetic) diploid individual ``tsk_0``
+are generated by combining nodes 0 and 1, and ``tsk_1``
+by combining nodes 2 and 3.
+
+:::{important}
+Software packages modelling multiploid individuals are encouraged to
+use the individual table to make their assumptions explicit. Recent
+versions of simulators and inference methods should all do this,
+and so the ``ploidy`` argument is really only intended to support
+legacy code. It is therefore an error to supply a value for ``ploidy``
+when individual information is present in a tree sequence.
+:::
+
+#### With individuals
+
+Extending the example in the previous section, we add some individual data
+defining a pair of diploid sibs and their parents.
+
+:::{note}
+We set the nodes for (e.g.) individual 2 to [1, 3] here to illustrate
+that nodes for a given individual are not necessarily contiguous.
+:::
+
+```{code-cell}
+tables.individuals.add_row(parents=[-1, -1])
+tables.individuals.add_row(parents=[-1, -1])
+tables.individuals.add_row(parents=[0, 1])
+tables.individuals.add_row(parents=[0, 1])
+node_individual = tables.nodes.individual
+node_individual[[1, 3]] = 2
+node_individual[[0, 2]] = 3
+tables.nodes.individual = node_individual
+display(tables.individuals)
+display(tables.nodes)
+ts = tables.tree_sequence()
+ts.write_vcf(sys.stdout)
+```
+
+In this model we have four individuals defined, but only
+individuals 2 and 3 are associated with nodes (more specifically,
+**sample** nodes). Thus, we output **two** VCF sample individuals
+composed of the linked nodes.
+
+:::{note}
+Note that the labels are ``tsk_0`` and ``tsk_1`` even though
+the individual IDs are 2 and 3. See the
+{ref}`sec_export_vcf_individual_names` section for how to change the
+these default labels.
+:::
+
+(sec_export_vcf_individual_names)=
+
+### Individual names
+
+By default the VCF samples are given the labels ``tsk_0``, ``tsk_1``,
+..., ``tsk_{N - 1}``, where ``N`` is the number of individuals to
+be output (see the {ref}`sec_export_vcf_constructing_gt` section).
+
+We can change this default labelling using the ``individual_names``
+argument::
+
+```{code-cell}
+import sys
+import msprime
+ts = msprime.sim_ancestry(
+    samples=3, ploidy=2, sequence_length=10, random_seed=2)
+ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
+ts.write_vcf(sys.stdout, individual_names=["A", "B", "C"])
+```
+
+#### Exporting to plink
+
+The default VCF sample IDs produced by ``tskit`` do not work well
+with plink because it parses the individual
+IDs based on a particular format, and does not allow ``0`` as a valid
+identifier. We get an error like this:
+
+```
+Error: Sample ID ends with "_0", which induces an invalid IID of '0`.
+```
+
+This can be fixed by using the ``individual_names`` argument
+to set the names to anything where the first name doesn't end with ``_0``.
+An example implementation for diploid individuals is:
+
+:::{code-block}
+n_dip_indv = int(ts.num_samples / 2)
+indv_names = [f"tsk_{i}indv" for i in range(n_dip_indv)]
+with open("output.vcf", "w") as vcf_file:
+    ts.write_vcf(vcf_file, individual_names=indv_names)
+:::
+
+Adding a second ``_`` (eg: ``tsk_0_indv``) is not recommended as
+``plink`` uses ``_`` as the default separator for separating family
+id and individual id, and two underscores will throw an error.
+
+(sec_export_vcf_modifying_coordinates)=
+
+### Modifying coordinates
+
+Tskit supports continuous genome coordinates, but VCF only supports discrete
+genome positions. Thus, when we convert a tree sequence that has sites
+at continuous positions we must discretise the coordinates in some way.
+
+The ``position_transform`` argument provides a way to flexibly translate
+the genomic location of sites in tskit to the appropriate value in VCF.
+There are two fundamental differences in the way that tskit and VCF define
+genomic coordinates. The first is that tskit uses floating point values
+to encode positions, whereas VCF uses integers. Thus, if the tree sequence
+contains positions at non-integral locations there is an information loss
+incurred by translating to VCF. By default, we round the site positions
+to the nearest integer, such that there may be several sites with the
+same integer position in the output. The second difference between VCF
+and tskit is that VCF is defined to be a 1-based coordinate system, whereas
+tskit uses 0-based. However, how coordinates are transformed depends
+on the VCF parser, and so we do **not** account for this change in
+coordinate system by default.
+
+:::{note}
+The msprime 0.x legacy API simulates using continuous coordinates. It may
+be simpler to update your code to use the msprime 1.0 API (which uses
+discrete coordinates by default) than to work out how to transform
+coordinates in a way that is suitable for your application.
+:::
+
+:::{todo}
+Provide some examples of using position transform.
+:::
diff --git a/python/CHANGELOG.rst b/python/CHANGELOG.rst
index 2ca6421f05..8a828a9bdf 100644
--- a/python/CHANGELOG.rst
+++ b/python/CHANGELOG.rst
@@ -11,6 +11,10 @@
 - Various memory leaks fixed (:user:`jeromekelleher`, :pr:`2424`, :issue:`2423`,
   :issue:`2427`).
 
+- Fix bugs in VCF output when there isn't a 1-1 mapping between individuals
+  and sample nodes (:user:`jeromekelleher`, :pr:`2442`, :issue:`2257`,
+  :issue:`2446`, :issue:`2448`).
+
 **Performance improvements**
 
 - TreeSequence.site position search performance greatly improved, with much lower
diff --git a/python/tests/test_vcf.py b/python/tests/test_vcf.py
index 0034b99b4b..147173cc34 100644
--- a/python/tests/test_vcf.py
+++ b/python/tests/test_vcf.py
@@ -379,6 +379,12 @@ def test_empty_individuals(self):
         with pytest.raises(ValueError, match="List of sample individuals empty"):
             ts.as_vcf(individuals=[])
 
+    def test_duplicate_individuals(self):
+        ts = msprime.sim_ancestry(3, random_seed=2)
+        ts = tsutil.insert_branch_sites(ts)
+        with pytest.raises(tskit.LibraryError, match="TSK_ERR_DUPLICATE_SAMPLE"):
+            ts.as_vcf(individuals=[0, 0])
+
     def test_mixed_sample_non_sample_individuals(self):
         ts = msprime.sim_ancestry(3, random_seed=2)
         tables = ts.dump_tables()
@@ -830,6 +836,17 @@ def test_no_individuals_ploidy_3(self):
         expected = textwrap.dedent(s)
         assert drop_header(ts.as_vcf(ploidy=3)) == expected
 
+    def test_no_individuals_ploidy_3_names(self):
+        ts = drop_individuals(self.ts())
+        s = """\
+        #CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tA
+        1\t0\t0\t0\t1\t.\tPASS\t.\tGT\t1|0|0
+        1\t2\t1\t0\t1\t.\tPASS\t.\tGT\t0|1|1
+        1\t4\t2\t0\t1\t.\tPASS\t.\tGT\t0|1|0
+        1\t6\t3\t0\t1\t.\tPASS\t.\tGT\t0|0|1"""
+        expected = textwrap.dedent(s)
+        assert drop_header(ts.as_vcf(ploidy=3, individual_names="A")) == expected
+
     def test_defaults(self):
         ts = self.ts()
         s = """\
diff --git a/python/tskit/trees.py b/python/tskit/trees.py
index 7a2cc01002..69a52c9cb2 100644
--- a/python/tskit/trees.py
+++ b/python/tskit/trees.py
@@ -5972,17 +5972,40 @@ def write_vcf(
         isolated_as_missing=None,
     ):
         """
-        Writes a VCF formatted file to the specified file-like object.
-        If there is individual information present in the tree sequence
-        (see :ref:`sec_individual_table_definition`), the values for
-        sample nodes associated with these individuals are combined
-        into phased multiploid individuals and output.
-
-        If there is no individual data present in the tree sequence, synthetic
-        individuals are created by combining adjacent samples, and the number
-        of samples combined is equal to the specified ploidy value (1 by
-        default). For example, if we have a ploidy of 2 and a sample of size 6,
-        then we will have 3 diploid samples in the output, consisting of the
+        Convert the genetic variation data in this tree sequence to Variant
+        Call Format and write to the specified file-like object.
+
+        .. seealso: See the :ref:`sec_export_vcf` section for examples
+            and explanations of how we map VCF to the tskit data model.
+
+        Multiploid samples in the output VCF are generated either using
+        individual information in the data model (see
+        :ref:`sec_individual_table_definition`), or by combining genotypes for
+        adjacent sample nodes using the ``ploidy`` argument. See the
+        :ref:`sec_export_vcf_constructing_gt` section for more details
+        and examples.
+
+        If individuals that are associated with sample nodes are defined in the
+        data model (see :ref:`sec_individual_table_definition`), the genotypes
+        for each of the individual's samples are combined into a phased
+        multiploid values at each site. By default, all individuals associated
+        with sample nodes are included in increasing order of individual ID.
+
+        Subsets or permutations of the sample individuals may be specified
+        using the ``individuals`` argument. It is an error to specify any
+        individuals that are not associated with any nodes, or whose
+        nodes are not all samples.
+
+        Mixed-sample individuals (e.g., those associated with one node
+        that is a sample and another that is not) in the data model will
+        result in an error by default. However, such individuals can be
+        excluded using the ``individuals`` argument.
+
+        If there are no individuals in the tree sequence,
+        synthetic individuals are created by combining adjacent samples, and
+        the number of samples combined is equal to the ``ploidy`` value (1 by
+        default). For example, if we have a ``ploidy`` of 2 and 6 sample nodes,
+        then we will have 3 diploid samples in the VCF, consisting of the
         combined genotypes for samples [0, 1], [2, 3] and [4, 5]. If we had
         genotypes 011110 at a particular variant, then we would output the
         diploid genotypes 0|1, 1|1 and 1|0 in VCF.
@@ -6000,31 +6023,6 @@ def write_vcf(
             N is the number of individuals we output. These numbers
             are **not** necessarily the individual IDs.
 
-        .. note::
-
-            Warning to ``plink`` users:
-
-            As the default first individual name is ``tsk_0``, ``plink`` will
-            throw this error when loading the VCF:
-
-            ``Error: Sample ID ends with "_0", which induces an invalid IID of '0'.``
-
-            This can be fixed by using the ``individual_names`` argument
-            to set the names to anything where the first name doesn't end with ``_0``.
-            An example implementation for diploid individuals is:
-
-            .. code-block:: python
-
-                n_dip_indv = int(ts.num_samples / 2)
-                indv_names = [f"tsk_{str(i)}indv" for i in range(n_dip_indv)]
-                with open("output.vcf", "w") as vcf_file:
-                    ts.write_vcf(vcf_file, individual_names=indv_names)
-
-            Adding a second ``_`` (eg: ``tsk_0_indv``) is not recommended as
-            ``plink`` uses ``_`` as the default separator for separating family
-            id and individual id, and two ``_`` will throw an error.
-
-
         The REF value in the output VCF is the ancestral allele for a site
         and ALT values are the remaining alleles. It is important to note,
         therefore, that for real data this means that the REF value for a given
@@ -6032,96 +6030,19 @@ def write_vcf(
         check that the alleles result in a valid VCF---for example, it is possible
         to use the tab character as an allele, leading to a broken VCF.
 
-        The ID value in the output VCF file is the integer ID of the corresponding
-        :ref:`site <sec_site_table_definition>` (``site.id``). Subsequently,
+        The ID value in the output VCF file is the integer ID of the
+        corresponding :ref:`site <sec_site_table_definition>` (``site.id``).
         These ID values can be utilized to match the contents of the VCF file
         to the sites in the tree sequence object.
 
-        The ``position_transform`` argument provides a way to flexibly translate
-        the genomic location of sites in tskit to the appropriate value in VCF.
-        There are two fundamental differences in the way that tskit and VCF define
-        genomic coordinates. The first is that tskit uses floating point values
-        to encode positions, whereas VCF uses integers. Thus, if the tree sequence
-        contains positions at non-integral locations there is an information loss
-        incurred by translating to VCF. By default, we round the site positions
-        to the nearest integer, such that there may be several sites with the
-        same integer position in the output. The second difference between VCF
-        and tskit is that VCF is defined to be a 1-based coordinate system, whereas
-        tskit uses 0-based. However, how coordinates are transformed depends
-        on the VCF parser, and so we do **not** account for this change in
-        coordinate system by default.
-
         .. note::
-           Older code often uses the ``ploidy=2`` argument, because previous
+           Older code often uses the ``ploidy=2`` argument, because old
            versions of msprime did not output individual data. Specifying
            individuals in the tree sequence is more robust, and since tree
            sequences now  typically contain individuals (e.g., as produced by
            ``msprime.sim_ancestry( )``), this is not necessary, and the
-           ``ploidy`` argument can safely be removed from old code in most cases.
-
-
-        Example usage:
-
-        .. code-block:: python
-
-            with open("output.vcf", "w") as vcf_file:
-                tree_sequence.write_vcf(vcf_file)
-
-        The VCF output can also be compressed using the :mod:`gzip` module, if you wish:
-
-        .. code-block:: python
-
-            import gzip
-
-            with gzip.open("output.vcf.gz", "wt") as f:
-                ts.write_vcf(f)
-
-        However, this gzipped VCF may not be fully compatible with downstream tools
-        such as tabix, which may require the VCF use the specialised bgzip format.
-        A general way to convert VCF data to various formats is to pipe the text
-        produced by ``tskit`` into ``bcftools``, as done here:
-
-        .. code-block:: python
-
-            import os
-            import subprocess
-
-            read_fd, write_fd = os.pipe()
-            write_pipe = os.fdopen(write_fd, "w")
-            with open("output.bcf", "w") as bcf_file:
-                proc = subprocess.Popen(
-                    ["bcftools", "view", "-O", "b"], stdin=read_fd, stdout=bcf_file
-                )
-                ts.write_vcf(write_pipe)
-                write_pipe.close()
-                os.close(read_fd)
-                proc.wait()
-                if proc.returncode != 0:
-                    raise RuntimeError("bcftools failed with status:", proc.returncode)
-
-        This can also be achieved on the command line use the ``tskit vcf`` command,
-        e.g.:
-
-        .. code-block:: bash
-
-            $ tskit vcf example.trees | bcftools view -O b > example.bcf
-
-
-        The ``sample_mask`` argument provides a general way to mask out
-        parts of the output, which can be helpful when simulating missing
-        data. In this (contrived) example, we create a sample mask function
-        that marks one genotype missing in each variant in a regular
-        pattern:
-
-        .. code-block:: python
-
-            def sample_mask(variant):
-                sample_mask = np.zeros(ts.num_samples, dtype=bool)
-                sample_mask[variant.site.id % ts.num_samples] = 1
-                return sample_mask
-
-
-            ts.write_vcf(sys.stdout, sample_mask=sample_mask)
+           ``ploidy`` argument can safely be removed as part of the process
+           of updating from the msprime 0.x legacy API.
 
         :param io.IOBase output: The file-like object to write the VCF output.
         :param int ploidy: The ploidy of the individuals to be written to
@@ -6129,7 +6050,10 @@ def sample_mask(variant):
             used if there is individual data in the tree sequence.
         :param str contig_id: The value of the CHROM column in the output VCF.
         :param list(int) individuals: A list containing the individual IDs to
-            write out to VCF. Defaults to all individuals in the tree sequence.
+            corresponding to the VCF samples. Defaults to all individuals
+            associated with sample nodes in the tree sequence.
+            See the {ref}`sec_export_vcf_constructing_gt` section for more
+            details and examples.
         :param list(str) individual_names: A list of string names to identify
             individual columns in the VCF. In VCF nomenclature, these are the
             sample IDs. If specified, this must be a list of strings of
@@ -6137,8 +6061,9 @@ def sample_mask(variant):
             we do not check the form of these strings in any way, so that is
             is possible to output malformed VCF (for example, by embedding a
             tab character within on of the names). The default is to output
-            ``tsk_j`` for the jth individual. Cannot be used if ploidy is
-            specified.
+            ``tsk_j`` for the jth individual.
+            See the :ref:`sec_export_vcf_individual_names` for examples
+            and more information.
         :param position_transform: A callable that transforms the
             site position values into integer valued coordinates suitable for
             VCF. The function takes a single positional parameter x and must
@@ -6148,10 +6073,14 @@ def sample_mask(variant):
             pre 0.2.0 legacy behaviour of rounding values to the nearest integer
             (starting from 1) and avoiding the output of identical positions
             by incrementing is used.
+            See the :ref:`sec_export_vcf_modifying_coordinates` for examples
+            and more information.
         :param site_mask: A numpy boolean array (or something convertable to
             a numpy boolean array) with num_sites elements, used to mask out
             sites in the output. If  ``site_mask[j]`` is True, then this
             site (i.e., the line in the VCF file) will be omitted.
+            See the :ref:`sec_export_vcf_masking_output` for examples
+            and more information.
         :param sample_mask: A numpy boolean array (or something convertable to
             a numpy boolean array) with num_samples elements, or a callable
             that returns such an array, such that if
@@ -6160,8 +6089,9 @@ def sample_mask(variant):
             callable, it must take a single argument and return a boolean
             numpy array. This function will be called for each (unmasked) site
             with the corresponding :class:`.Variant` object, allowing
-            for dynamic masks to be generated. See above for example
-            usage.
+            for dynamic masks to be generated.
+            See the :ref:`sec_export_vcf_masking_output` for examples
+            and more information.
         :param bool isolated_as_missing: If True, the genotype value assigned to
             missing samples (i.e., isolated samples without mutations) is "."
             If False, missing samples will be assigned the ancestral allele.