From a7cbc7ed4827af1a494e418d8954f7e30b765679 Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sun, 14 Jul 2019 16:34:52 -0400
Subject: [PATCH 01/21] draft of multithreading blog post

---
 blog/_posts/2019-07-14-multithreading.md | 464 +++++++++++++++++++++++
 1 file changed, 464 insertions(+)
 create mode 100644 blog/_posts/2019-07-14-multithreading.md

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
new file mode 100644
index 0000000000..ba6552c369
--- /dev/null
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -0,0 +1,464 @@
+---
+layout: post
+title:  Announcing composable multi-threaded parallelism in Julia
+author: Jeff Bezanson, Jameson Nash
+---
+
+Software performance depends more and more on exploiting multiple processor cores.
+The [free lunch][] is still over.
+Well, we here in the Julia developer community have something of a reputation for
+caring about performance, so we've known for years that we would need a good
+story for multi-threaded, multi-core execution.
+Today we are happy to announce a major new chapter in that story.
+We are releasing an entirely new threading interface for Julia programs:
+fully general task parallelism, inspired by parallel programming systems
+like [Cilk][] and [Go][].
+
+In this paradigm, any piece of a program can be marked for execution in parallel,
+and a "task" will be started to run that code automatically on an available thread.
+A dynamic scheduler handles all the decisions and details for you.
+Here's an example of parallel code you can now write in Julia:
+
+```
+function fib(n::Int)
+    if n < 2
+        return n
+    end
+    t = @par fib(n - 2)
+    return fib(n - 1) + fetch(t)
+end
+```
+
+This, of course, is the classic highly-inefficient tree recursive implementation of
+the Fibonacci sequence --- but running on any number of processor cores!
+The line `t = @par fib(n - 2)` starts a task to compute `fib(n - 2)`, which runs in
+parallel with the following line computing `fib(n - 1)`.
+`fetch(t)` waits for task `t` to complete and gets its return value.
+
+This model of parallelism has many wonderful properties.
+I think of it as somewhat analogous to garbage collection: with GC, you
+can freely allocate objects without worrying about how it works or when and how they
+are freed.
+With task parallelism, you freely spawn tasks without worrying about where they run.
+
+The model is portable and free from low-level details.
+You don't need to explicitly start and stop threads, and you don't even need to know how
+many processors or threads there are (though you can find out if you want).
+
+The model is nestable and composable: you can start parallel tasks that call library
+functions that themselves start parallel tasks, and everything works.
+Your CPUs will not be over-subscribed with threads.
+This property is crucial for a high-level language where a lot of work is done by library
+functions.
+You need to be free to write whatever code you need --- including parallel code ---
+without worrying about how the libraries it calls are implemented.
+
+This is, in fact, the reason we are excited about this announcement: from this point on,
+multi-core parallelism is unleashed over the entire Julia package ecosystem.
+
+## Some history
+
+One of the most surprising aspects of this new feature is just how long it has been in
+the works.
+From the very beginning --- prior even to the 0.1 release --- Julia has had the `Task`
+type, providing symmetric coroutines and event-based I/O.
+So we have always had a unit of *concurrency* in the language, it just wasn't *parallel*
+(simultaneous streams of execution) yet.
+We knew we needed parallelism though, so in 2014 (roughly the version 0.3 timeframe) we
+set about the long process of making all of our code thread-safe.
+Yichao Yu put in some particularly impressive work on the garbage collector and signal
+handling.
+Kiran Pamnany (of Intel) put some basic infrastructure in place for starting and
+running multiple threads.
+
+Within about two years, we were ready to release the `@threads` macro in version 0.5,
+which provides simple parallel loops.
+Even though that wasn't the final design we wanted, it did two important jobs:
+it let Julia programmers start taking advantage of multiple cores, and provided
+test cases to shake out thread-related bugs in our runtime.
+`@threads` had some huge limitations, however.
+`@threads` loops could not be nested: all the functions you call from within such a loop
+must not themselves use `@threads`.
+It was also incompatible with our `Task` and I/O system: you couldn't do any I/O or
+switch among `Task`s inside a threaded loop.
+
+So the next logical step was to merge the `Task` and threading systems, and "simply"
+(cue laughter) allow `Task`s to run simultaneously on a pool of threads.
+We had many early discussions with Arch Robison (then also of Intel) and concluded
+that this was the best model for our language.
+After version 0.5 (around 2016) Kiran started experimenting with a new parallel
+task scheduler [PARTR][] based on the idea of depth-first scheduling.
+He sold all of us on it with some nice animated slides, and it also didn't hurt that
+he was willing to do some of the work.
+The plan was to first develop PARTR as a standalone C library so it could be tested
+and benchmarked on its own, and then integrate it with the Julia runtime.
+
+After Kiran completed the standalone version of PARTR, we embarked on a series of
+work sessions including Anton Malakhov (also of Intel) to figure out how to do
+the integration.
+The Julia runtime brings many extra features, such as garbage collection and
+event-based I/O, so this was not entirely straightforward.
+Somewhat disappointingly, though not unusually for a complex software project,
+it took much longer than expected --- nearly two years --- to get the new
+system working reliably.
+A later section of this post will explain some of the internals and difficulties
+involved for the curious.
+But first, let's take it for a spin.
+
+## How to use it
+
+To use Julia with multiple threads, set the `JULIA_NUM_THREADS` environment
+variable:
+
+```
+$ JULIA_NUM_THREADS=4 ./julia
+```
+
+The `Threads` submodule of `Base` houses most of the thread-specific functionality,
+such as querying the number of threads and the ID of the current thread:
+
+```
+julia> Threads.nthreads()
+4
+
+julia> Threads.threadid()
+1
+```
+
+`@threads` loops still work, except now I/O is no problem:
+
+```
+julia> Threads.@threads for i = 1:10
+           println("i = $i on thread $(Threads.threadid())")
+       end
+i = 1 on thread 1
+i = 7 on thread 3
+i = 2 on thread 1
+i = 8 on thread 3
+i = 3 on thread 1
+i = 9 on thread 4
+i = 10 on thread 4
+i = 4 on thread 2
+i = 5 on thread 2
+i = 6 on thread 2
+```
+
+Without further ado, let's try some nested parallelism.
+A perennial favorite example is mergesort, which divides its input in half
+and recursively sorts each half.
+The halves can be sorted independently, yielding a natural opportunity
+for parallelism.
+Here is the code:
+
+```
+# sort the elements of `v` in place, from indices `lo` to `hi` inclusive
+function psort!(v, lo::Int=1, hi::Int=length(v))
+    if lo >= hi                       # 1 or 0 elements; nothing to do
+        return v
+    end
+    if hi - lo < 100000               # below some cutoff, run in serial
+        sort!(view(v, lo:hi), alg = MergeSort)
+        return v
+    end
+
+    mid = (lo+hi)>>>1                 # find the midpoint
+
+    half = @par psort!(v, lo, mid)    # task to sort the lower half; will run
+    psort!(v, mid+1, hi)              # in parallel with the current call sorting
+                                      # the upper half
+    wait(half)                        # wait for the lower half to finish
+
+    temp = v[lo:mid]                  # workspace for merging
+
+    i, k, j = 1, lo, mid+1            # merge the two sorted sub-arrays
+    @inbounds while k < j <= hi
+        if v[j] < temp[i]
+            v[k] = v[j]
+            j += 1
+        else
+            v[k] = temp[i]
+            i += 1
+        end
+        k += 1
+    end
+    @inbounds while k < j
+        v[k] = temp[i]
+        k += 1
+        i += 1
+    end
+
+    return v
+end
+```
+
+This is just a standard mergesort implementation, similar to the one in Julia's
+`Base` library, with only the tiny addition of the `@par` construct on one
+of the recursive calls.
+`wait` simply waits for the specified task to finish.
+The code works by modifying its input, so we don't need the task's return value.
+Indicating that a return value is not needed is the only difference with the
+`fetch` call used in the earlier `fib` example.
+Note that we explicitly request `MergeSort` when calling Julia's standard `sort!`,
+to make sure we're comparing apples to apples --- `sort!` actually uses
+quicksort by default for sorting numbers, which tends to be faster for random data.
+Let's time the code under `JULIA_NUM_THREADS=2`:
+
+```
+julia> a = rand(20000000);
+
+julia> b = copy(a); @time sort!(b, alg = MergeSort);
+  2.589243 seconds (11 allocations: 76.294 MiB, 0.17% gc time)
+
+julia> b = copy(a); @time sort!(b, alg = MergeSort);
+  2.582697 seconds (11 allocations: 76.294 MiB, 2.25% gc time)
+
+julia> b = copy(a); @time psort!(b);
+  1.770902 seconds (3.78 k allocations: 686.935 MiB, 4.25% gc time)
+
+julia> b = copy(a); @time psort!(b);
+  1.741141 seconds (3.78 k allocations: 686.935 MiB, 4.16% gc time)
+```
+
+While the run times are bit variable, we see a definite speedup from using
+two threads.
+The laptop I ran this on has four hyperthreads, and I find it especially amazing
+that the performance of this code continues to scale if we add a third thread:
+
+```
+julia> b = copy(a); @time psort!(b);
+  1.511860 seconds (3.77 k allocations: 686.935 MiB, 6.45% gc time)
+```
+
+I don't know about you, but thinking about this two-way decomposition
+algorithm running on three threads makes my head hurt a little!
+
+Notice that this speedup occurs despite the parallel code allocating
+*drastically* more memory than the standard routine.
+The allocations come from two sources: `Task` objects, and the `temp`
+arrays allocated on each call.
+The reference sorting routine re-uses a single temporary buffer among
+all recursive calls.
+Re-using the temporary array is more difficult with parallelism, but
+still possible --- more on that a little later.
+
+## Moving to a parallel world
+
+All of this will be released as part of Julia version 1.3.
+During the 1.3 series the new thread runtime is considered to be in beta testing.
+An "official" version will appear in a later release, to give us time to settle
+on an API we can commit to for the long term.
+
+To aid compatibility, code will continue to run within a single thread by default.
+When tasks are launched using existing primitives (`schedule`, `@async`), they
+will run only within the thread that launches them.
+Similarly, a `Condition` object (used to signal tasks when events occur) can only
+be used by the thread that created it.
+Attempts to wait for or notify conditions from other threads will raise errors.
+Separate thread-safe condition variables have been added, and are available as
+`Threads.Condition`.
+This needs to be a separate type because thread-safe use of condition variables
+requires acquiring a lock.
+In Julia, the lock is bundled with the condition, so `lock` can simply be called
+on the condition itself:
+
+```
+lock(cond::Threads.Condition)
+while !ready
+    wait(cond)
+end
+unlock(cond)
+```
+
+As in previous versions, the standard lock to use to protect critical sections
+is `ReentrantLock`, which is now thread-safe (it was previously only used for
+synchronizing tasks).
+`Threads.SpinLock` is also available, to be used in rare circumstances where
+(1) only threads and not tasks need to be synchronized, and (2) you expect to
+hold the lock for a short time.
+`Semaphore` and `Event` are also available, completing the standard set of
+synchronization primitives.
+
+Julia code naturally tends to be purely functional (no side effects or mutation),
+or only uses local mutation, so migrating to full thread-safety will hopefully
+be easy in many cases.
+But if your code uses shared state and you'd like to make it thread-safe, there
+is some work to do.
+So far we have used two kinds of approaches to this in Julia's standard library:
+synchronization (locks), and thread- or task-local state.
+Locks work well for shared resources not accessed frequently, or for resources
+that cannot be duplicated for each thread.
+
+But for high-performance code we recommend thread-local state.
+Our `psort!` routine above can be improved in this way.
+Here is a recipe.
+First, we modify the function to accept pre-allocated buffers, using a default
+argument value to allocate space automatically when the caller doesn't provide it:
+
+```
+function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()])
+```
+
+The maximum size of temporary array our mergesort needs is half the array, using
+ceiling division (`cld`) to handle odd lengths.
+We simply need to allocate one array per thread.
+Next, we modify the recursive calls to reuse the space:
+
+```
+    half = @par psort!(v, lo, mid, temps)
+    psort!(v, mid+1, hi, temps)
+```
+
+Finally, use the array reserved for the current thread, instead of allocating a new one:
+
+```
+    temp = temps[Threads.threadid()]
+    copyto!(temp, 1, v, lo, m-lo+1)
+```
+
+## Note on random numbers
+
+Julia's default global random number generator (`rand()`) is a particularly
+challenging case for thread-safety.
+We have split it into separate random streams for each thread, allowing
+code with `rand()` to be freely parallelized and get independent random
+numbers on each thread.
+
+However, seeding (`Random.seed!(n)`) is trickier.
+Seeding all of the per-thread streams would require some kind of synchronization
+among threads, which would unacceptably slow down random number generation.
+It also makes a very limited amount of sense: threading introduces its own
+nondeterminism, so you cannot get much predictability by seeding one thread's
+state from another.
+Therefore we decided to have the `seed!(n)` call affect only the current thread's
+state.
+That way, multiple independent code sequences that seed and then use random
+numbers can at least individually work as expected.
+For more elaborate seeding requirements, we recommend allocating and passing
+your own RNG objects (e.g. `Rand.MersenneTwister()`).
+
+
+## Under the hood
+
+As with garbage collection, the simple interface (`@par`) belies great
+complexity underneath.
+Here I will try to summarize some of the main difficulties and design
+decisions we faced.
+
+### Allocating and switching task stacks
+
+Each `Task` requires its own execution stack, distinct from the usual process or
+thread stacks provided by Unix operating systems.
+Windows has fibers, which correspond closely to tasks, and several library
+implementations of similar abstractions exist for Unix-family systems.
+
+There are many possible approaches to this, with different tradeoffs.
+As we often do, we tried to pick a method that would maximize throughput
+and reliability.
+We have a shared pool of stacks allocated by `mmap` (`VirtualAlloc` on
+windows), defaulting to 4MiB each (2MiB on 32-bit systems).
+This can use quite a bit of virtual memory, so don't be alarmed if `top`
+shows your shiny new multi-threaded Julia code using 100GiB of address
+space.
+2<sup>64</sup> (ok, in practice more like 2<sup>48</sup>) is a big number, and integers are,
+fortunately, free.
+These are larger stacks than task systems in lower-level languages would
+probably provide, but we feel it makes good use of the CPU and OS kernel's
+highly refined memory management capabilities, while greatly reducing the
+possibility of stack overflow.
+
+A thread can switch to running a given task simply (in principle) by switching
+its stack pointer to refer to the new task's stack and jumping to the next
+instruction.
+As soon as a task is done running, we can immediately release its stack back
+to the pool, avoiding excessive GC pressure.
+
+In practice, we have an alternate implementation of stack switching that trades
+time for memory by copying only live stack data.
+And of course, each implementation has code for multiple platforms and
+architectures, often requiring assembly language.
+Stack switching is a rich topic that could very well fill a blog post on its own.
+
+### I/O
+
+We use libuv for cross-platform event-based I/O.
+It is designed to be able to function within a multithreaded program, but is not
+explicitly a multithreaded I/O library and so doesn't support concurrent use from
+multiple threads out of the box.
+We decided to protect access to libuv structures with a lock, and then allow any thread
+(one at a time) to run the event loop.
+When another thread needs the event loop thread to wake up, it issues an async signal.
+This can happen for multiple reasons, including another thread scheduling new work,
+or another thread needing to run garbage collection.
+
+### Task migration
+
+In general, a task might start running on one thread, block for a while, and then
+restart on another.
+This changes fundamental assumptions about when thread-local values can change.
+Internally, Julia code uses thread-local variables *constantly*, for example
+every time you allocate memory.
+We have yet to begin all the changes needed to support migration, so for now
+a task must always run on the thread it started running on (though, of course,
+it can start running on any thread).
+To support this, we have a concept of "sticky" tasks that must run on a given
+thread, and per-thread queues for running tasks associated with each thread.
+
+### Sleeping idle threads
+
+When there aren't enough tasks to keep all threads busy, some need to block to avoid
+using 100% of all CPUs all the time.
+This is a tricky synchronization problem, since some threads might be scheduling new work
+while other threads are deciding to block.
+
+### Where does the scheduler run?
+
+When a task blocks, we need to call the scheduler to pick another task to run.
+What stack does that code use to run?
+It's possible to have a dedicated scheduler task, but we felt there would be
+less overhead if we allowed the scheduler code to run in the context of the
+recently-blocked task.
+That works fine, but it means a task can exist in a strange intermediate state
+where it is considered not to be running, and yet is in fact running the
+scheduler.
+In particular, we need to make sure no other thread sees that task and thinks
+"oh, there's a task I can run", causing it to scribble on the scheduler's
+stack.
+
+### Classic bugs
+
+While trying to get this new functionality working, we encountered several
+maddeningly difficult bugs.
+My hands-down favorite was a mysterious hang on Windows that was fixed
+by literally [flipping a single bit][].
+
+
+## Looking forward
+
+While we are excited about this milestone, a lot of work remains.
+Here are some of the points we hope to focus on to further develop
+our threading capabilities:
+
+* Performance work on task switch and I/O latency.
+* Adding parallelism to the standard library. Many common operations like sorting and
+  array broadcasting can now use multiple threads internally.
+* Consider allowing task migration.
+* Improved debugging tools.
+* Explore API extensions like cancel points.
+* Provide alternate schedulers.
+
+
+## Acknowledgements
+
+We would like to gratefully acknowledge funding support from Intel and relational.ai
+that made it possible to develop these new capabilities.
+
+We are also grateful to the several people who patiently tried this functionality
+while it was in development and filed bug reports or pull requests, and spurred us
+to keep going!
+
+
+[free lunch]: http://www.gotw.ca/publications/concurrency-ddj.htm
+[Cilk]: http://cilk.mit.edu/
+[Go]: https://tour.golang.org/concurrency/1
+[PARTR]: https://github.com/kpamnany/partr
+[flipping a single bit]: https://github.com/JuliaLang/libuv/commit/26dbe5672c33fc885462c509fe2a9b36f35866fd

From a59d9e5c95de4f33afbded30cc221fd2809ac420 Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Mon, 15 Jul 2019 18:01:58 -0400
Subject: [PATCH 02/21] some edits

---
 blog/_posts/2019-07-14-multithreading.md | 54 +++++++++++++++++-------
 1 file changed, 38 insertions(+), 16 deletions(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index ba6552c369..ff397f2b0d 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -11,8 +11,10 @@ caring about performance, so we've known for years that we would need a good
 story for multi-threaded, multi-core execution.
 Today we are happy to announce a major new chapter in that story.
 We are releasing an entirely new threading interface for Julia programs:
-fully general task parallelism, inspired by parallel programming systems
+general task parallelism, inspired by parallel programming systems
 like [Cilk][] and [Go][].
+Task parallelism is now available on the master branch, and a beta version will be
+released as part of the upcoming Julia version 1.3.
 
 In this paradigm, any piece of a program can be marked for execution in parallel,
 and a "task" will be started to run that code automatically on an available thread.
@@ -37,9 +39,9 @@ parallel with the following line computing `fib(n - 1)`.
 
 This model of parallelism has many wonderful properties.
 I think of it as somewhat analogous to garbage collection: with GC, you
-can freely allocate objects without worrying about how it works or when and how they
-are freed.
-With task parallelism, you freely spawn tasks without worrying about where they run.
+freely allocate objects without worrying about when and how they are freed.
+With task parallelism, you freely spawn tasks --- potentially millions of them --- without
+worrying about where they run.
 
 The model is portable and free from low-level details.
 You don't need to explicitly start and stop threads, and you don't even need to know how
@@ -114,6 +116,11 @@ variable:
 $ JULIA_NUM_THREADS=4 ./julia
 ```
 
+The [Juno IDE][] automatically sets the number of threads based on the number of
+available processor cores, and also provides a graphical interface for changing
+the number of threads, so setting the variable manually is not necessary
+in that environment.
+
 The `Threads` submodule of `Base` houses most of the thread-specific functionality,
 such as querying the number of threads and the ID of the current thread:
 
@@ -241,12 +248,14 @@ all recursive calls.
 Re-using the temporary array is more difficult with parallelism, but
 still possible --- more on that a little later.
 
-## Moving to a parallel world
+## How to move to a parallel world
 
-All of this will be released as part of Julia version 1.3.
 During the 1.3 series the new thread runtime is considered to be in beta testing.
 An "official" version will appear in a later release, to give us time to settle
 on an API we can commit to for the long term.
+Here's what you need to know if you want to upgrade your code over this period.
+
+### Task scheduling and synchronization
 
 To aid compatibility, code will continue to run within a single thread by default.
 When tasks are launched using existing primitives (`schedule`, `@async`), they
@@ -278,6 +287,8 @@ hold the lock for a short time.
 `Semaphore` and `Event` are also available, completing the standard set of
 synchronization primitives.
 
+### Thread-local state
+
 Julia code naturally tends to be purely functional (no side effects or mutation),
 or only uses local mutation, so migrating to full thread-safety will hopefully
 be easy in many cases.
@@ -315,7 +326,7 @@ Finally, use the array reserved for the current thread, instead of allocating a
     copyto!(temp, 1, v, lo, m-lo+1)
 ```
 
-## Note on random numbers
+### Seeding the default random number generator
 
 Julia's default global random number generator (`rand()`) is a particularly
 challenging case for thread-safety.
@@ -359,21 +370,31 @@ windows), defaulting to 4MiB each (2MiB on 32-bit systems).
 This can use quite a bit of virtual memory, so don't be alarmed if `top`
 shows your shiny new multi-threaded Julia code using 100GiB of address
 space.
-2<sup>64</sup> (ok, in practice more like 2<sup>48</sup>) is a big number, and integers are,
-fortunately, free.
+The vast majority of this space will not consume real resources, and is only
+there in case a task needs to execute a deep call chain (which will hopefully
+not persist for long).
 These are larger stacks than task systems in lower-level languages would
 probably provide, but we feel it makes good use of the CPU and OS kernel's
 highly refined memory management capabilities, while greatly reducing the
 possibility of stack overflow.
 
+The default stack size is a build-time option, set in `src/options.h`.
+The `Task` constructor also has an undocumented second argument allowing
+you to specify a stack size per-task.
+Using it is not recommended, since it is hard to predict how much stack
+space will be needed, for instance by the compiler or called libraries.
+
 A thread can switch to running a given task simply (in principle) by switching
 its stack pointer to refer to the new task's stack and jumping to the next
 instruction.
 As soon as a task is done running, we can immediately release its stack back
 to the pool, avoiding excessive GC pressure.
 
-In practice, we have an alternate implementation of stack switching that trades
-time for memory by copying only live stack data.
+We also have an alternate implementation of stack switching (controlled by the
+`ALWAYS_COPY_STACKS` variable in `options.h`) that trades time for memory by
+copying live stack data when a task switch occurs.
+We fall back to this implementation if stacks are consuming too much address
+space (some platforms impose a limit, which we exceeded in early testing).
 And of course, each implementation has code for multiple platforms and
 architectures, often requiring assembly language.
 Stack switching is a rich topic that could very well fill a blog post on its own.
@@ -405,10 +426,10 @@ thread, and per-thread queues for running tasks associated with each thread.
 
 ### Sleeping idle threads
 
-When there aren't enough tasks to keep all threads busy, some need to block to avoid
+When there aren't enough tasks to keep all threads busy, some need to sleep to avoid
 using 100% of all CPUs all the time.
 This is a tricky synchronization problem, since some threads might be scheduling new work
-while other threads are deciding to block.
+while other threads are deciding to sleep.
 
 ### Where does the scheduler run?
 
@@ -428,7 +449,7 @@ stack.
 
 While trying to get this new functionality working, we encountered several
 maddeningly difficult bugs.
-My hands-down favorite was a mysterious hang on Windows that was fixed
+The clear favorite was a mysterious hang on Windows that was fixed
 by literally [flipping a single bit][].
 
 
@@ -440,7 +461,7 @@ our threading capabilities:
 
 * Performance work on task switch and I/O latency.
 * Adding parallelism to the standard library. Many common operations like sorting and
-  array broadcasting can now use multiple threads internally.
+  array broadcasting could now use multiple threads internally.
 * Consider allowing task migration.
 * Improved debugging tools.
 * Explore API extensions like cancel points.
@@ -449,7 +470,7 @@ our threading capabilities:
 
 ## Acknowledgements
 
-We would like to gratefully acknowledge funding support from Intel and relational.ai
+We would like to gratefully acknowledge funding support from Intel and relationalAI
 that made it possible to develop these new capabilities.
 
 We are also grateful to the several people who patiently tried this functionality
@@ -461,4 +482,5 @@ to keep going!
 [Cilk]: http://cilk.mit.edu/
 [Go]: https://tour.golang.org/concurrency/1
 [PARTR]: https://github.com/kpamnany/partr
+[Juno IDE]: https://junolab.org/
 [flipping a single bit]: https://github.com/JuliaLang/libuv/commit/26dbe5672c33fc885462c509fe2a9b36f35866fd

From 1f63a1240480f85e908136631af88b7d35ee0ee0 Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Tue, 16 Jul 2019 14:52:59 -0400
Subject: [PATCH 03/21] some edits, part 2

---
 blog/_posts/2019-07-14-multithreading.md | 62 +++++++++++++++++++-----
 1 file changed, 51 insertions(+), 11 deletions(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index ff397f2b0d..b68f33dc90 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -38,7 +38,7 @@ parallel with the following line computing `fib(n - 1)`.
 `fetch(t)` waits for task `t` to complete and gets its return value.
 
 This model of parallelism has many wonderful properties.
-I think of it as somewhat analogous to garbage collection: with GC, you
+We see it as somewhat analogous to garbage collection: with GC, you
 freely allocate objects without worrying about when and how they are freed.
 With task parallelism, you freely spawn tasks --- potentially millions of them --- without
 worrying about where they run.
@@ -63,7 +63,7 @@ multi-core parallelism is unleashed over the entire Julia package ecosystem.
 One of the most surprising aspects of this new feature is just how long it has been in
 the works.
 From the very beginning --- prior even to the 0.1 release --- Julia has had the `Task`
-type, providing symmetric coroutines and event-based I/O.
+type providing symmetric coroutines, which we've used for event-based I/O.
 So we have always had a unit of *concurrency* in the language, it just wasn't *parallel*
 (simultaneous streams of execution) yet.
 We knew we needed parallelism though, so in 2014 (roughly the version 0.3 timeframe) we
@@ -89,13 +89,13 @@ So the next logical step was to merge the `Task` and threading systems, and "sim
 We had many early discussions with Arch Robison (then also of Intel) and concluded
 that this was the best model for our language.
 After version 0.5 (around 2016) Kiran started experimenting with a new parallel
-task scheduler [PARTR][] based on the idea of depth-first scheduling.
+task scheduler [partr][] based on the idea of depth-first scheduling.
 He sold all of us on it with some nice animated slides, and it also didn't hurt that
 he was willing to do some of the work.
-The plan was to first develop PARTR as a standalone C library so it could be tested
+The plan was to first develop partr as a standalone C library so it could be tested
 and benchmarked on its own, and then integrate it with the Julia runtime.
 
-After Kiran completed the standalone version of PARTR, we embarked on a series of
+After Kiran completed the standalone version of partr, we embarked on a series of
 work sessions including Anton Malakhov (also of Intel) to figure out how to do
 the integration.
 The Julia runtime brings many extra features, such as garbage collection and
@@ -228,7 +228,7 @@ julia> b = copy(a); @time psort!(b);
 
 While the run times are bit variable, we see a definite speedup from using
 two threads.
-The laptop I ran this on has four hyperthreads, and I find it especially amazing
+The laptop we ran this on has four hyperthreads, and it is especially amazing
 that the performance of this code continues to scale if we add a third thread:
 
 ```
@@ -236,8 +236,21 @@ julia> b = copy(a); @time psort!(b);
   1.511860 seconds (3.77 k allocations: 686.935 MiB, 6.45% gc time)
 ```
 
-I don't know about you, but thinking about this two-way decomposition
-algorithm running on three threads makes my head hurt a little!
+Thinking about this two-way decomposition algorithm running on three threads
+can make your head hurt a little!
+In our view, this helps underscore how "automatic" this interface makes
+parallelism feel.
+
+Let's try a different machine with more CPU cores:
+
+```
+$ for n in 1 2 4 8 16; do    JULIA_NUM_THREADS=$n ./julia psort.jl; done
+  2.958881 seconds (3.58 k allocations: 686.932 MiB, 4.71% gc time)
+  1.868720 seconds (3.77 k allocations: 686.935 MiB, 7.03% gc time)
+  1.222777 seconds (3.78 k allocations: 686.935 MiB, 9.14% gc time)
+  0.958517 seconds (3.79 k allocations: 686.935 MiB, 18.21% gc time)
+  0.836891 seconds (3.78 k allocations: 686.935 MiB, 21.10% gc time)
+```
 
 Notice that this speedup occurs despite the parallel code allocating
 *drastically* more memory than the standard routine.
@@ -326,6 +339,21 @@ Finally, use the array reserved for the current thread, instead of allocating a
     copyto!(temp, 1, v, lo, m-lo+1)
 ```
 
+After these minor modifications, let's check performance on our large
+machine:
+
+```
+$ for n in 1 2 4 8 16; do    JULIA_NUM_THREADS=$n ./julia psort.jl; done
+  2.723312 seconds (3.07 k allocations: 152.852 MiB, 0.14% gc time)
+  1.711112 seconds (3.28 k allocations: 229.149 MiB, 0.59% gc time)
+  0.971327 seconds (3.28 k allocations: 381.737 MiB, 1.60% gc time)
+  0.782790 seconds (3.28 k allocations: 686.913 MiB, 8.63% gc time)
+  0.722063 seconds (3.33 k allocations: 1.267 GiB, 21.43% gc time)
+```
+
+Definitely faster, but we do seem to have some work to do on the
+scalability of the runtime system.
+
 ### Seeding the default random number generator
 
 Julia's default global random number generator (`rand()`) is a particularly
@@ -352,7 +380,7 @@ your own RNG objects (e.g. `Rand.MersenneTwister()`).
 
 As with garbage collection, the simple interface (`@par`) belies great
 complexity underneath.
-Here I will try to summarize some of the main difficulties and design
+Here we will try to summarize some of the main difficulties and design
 decisions we faced.
 
 ### Allocating and switching task stacks
@@ -452,6 +480,14 @@ maddeningly difficult bugs.
 The clear favorite was a mysterious hang on Windows that was fixed
 by literally [flipping a single bit][].
 
+Another good one was a [missing exception handling personality][].
+In hindsight that could have been straightforward, but was confounded by two
+factors: first, the failure mode caused the kernel to stop our process in a way
+that we were not able to intercept in a debugger, and second, the failure was
+triggered by a seemingly-unrelated change.
+All Julia stack frames have an exception handling personality set, so the problem
+could only appear in the runtime system outside any Julia frame --- a narrow window,
+since of course we are usually executing Julia code.
 
 ## Looking forward
 
@@ -464,8 +500,9 @@ our threading capabilities:
   array broadcasting could now use multiple threads internally.
 * Consider allowing task migration.
 * Improved debugging tools.
-* Explore API extensions like cancel points.
+* Explore API extensions, e.g. cancel points.
 * Provide alternate schedulers.
+* Explore integration with the [TAPIR][] parallel IR (some early work [here][]).
 
 
 ## Acknowledgements
@@ -481,6 +518,9 @@ to keep going!
 [free lunch]: http://www.gotw.ca/publications/concurrency-ddj.htm
 [Cilk]: http://cilk.mit.edu/
 [Go]: https://tour.golang.org/concurrency/1
-[PARTR]: https://github.com/kpamnany/partr
+[partr]: https://github.com/kpamnany/partr
 [Juno IDE]: https://junolab.org/
 [flipping a single bit]: https://github.com/JuliaLang/libuv/commit/26dbe5672c33fc885462c509fe2a9b36f35866fd
+[missing exception handling personality]: https://github.com/JuliaLang/julia/pull/32570
+[TAPIR]: http://cilk.mit.edu/tapir/
+[here]: https://github.com/JuliaLang/julia/pull/31086

From bf7ae0bb836025158ae1003076ae836fadba53fd Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 12:52:17 -0400
Subject: [PATCH 04/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index b68f33dc90..807f06a4c1 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -54,6 +54,7 @@ This property is crucial for a high-level language where a lot of work is done b
 functions.
 You need to be free to write whatever code you need --- including parallel code ---
 without worrying about how the libraries it calls are implemented.
+(*in the future we plan to extend this to C libraries such as BLAS, currently it only applies to Julia code)
 
 This is, in fact, the reason we are excited about this announcement: from this point on,
 multi-core parallelism is unleashed over the entire Julia package ecosystem.

From d84b7fc9d4875e74006ca2c26695383a105060fa Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 12:52:36 -0400
Subject: [PATCH 05/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index 807f06a4c1..551459ac33 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -5,7 +5,7 @@ author: Jeff Bezanson, Jameson Nash
 ---
 
 Software performance depends more and more on exploiting multiple processor cores.
-The [free lunch][] is still over.
+The [free lunch][] from Moore's Law is still over.
 Well, we here in the Julia developer community have something of a reputation for
 caring about performance, so we've known for years that we would need a good
 story for multi-threaded, multi-core execution.

From ae16cf8f289ad14cce2d9526578cc24a2b233820 Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 12:57:47 -0400
Subject: [PATCH 06/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index 551459ac33..4f84a65de5 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -65,7 +65,7 @@ One of the most surprising aspects of this new feature is just how long it has b
 the works.
 From the very beginning --- prior even to the 0.1 release --- Julia has had the `Task`
 type providing symmetric coroutines, which we've used for event-based I/O.
-So we have always had a unit of *concurrency* in the language, it just wasn't *parallel*
+So we have always had a unit of *concurrency* (independent streams of execution) in the language, it just wasn't *parallel* (simultaneous)
 (simultaneous streams of execution) yet.
 We knew we needed parallelism though, so in 2014 (roughly the version 0.3 timeframe) we
 set about the long process of making all of our code thread-safe.

From ea85488782640cf6d747f81a38a55c2e54839fbb Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 12:58:45 -0400
Subject: [PATCH 07/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index 4f84a65de5..131a215d61 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -69,7 +69,7 @@ So we have always had a unit of *concurrency* (independent streams of execution)
 (simultaneous streams of execution) yet.
 We knew we needed parallelism though, so in 2014 (roughly the version 0.3 timeframe) we
 set about the long process of making all of our code thread-safe.
-Yichao Yu put in some particularly impressive work on the garbage collector and signal
+Yichao Yu put in some particularly impressive work on the garbage collector and thread-local-storage performance.
 handling.
 Kiran Pamnany (of Intel) put some basic infrastructure in place for starting and
 running multiple threads.

From af04294a189d9fefa8218d16216a49aeba02edf8 Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 13:00:55 -0400
Subject: [PATCH 08/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index 131a215d61..a5df99c273 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -156,7 +156,7 @@ A perennial favorite example is mergesort, which divides its input in half
 and recursively sorts each half.
 The halves can be sorted independently, yielding a natural opportunity
 for parallelism.
-Here is the code:
+Here is that code:
 
 ```
 # sort the elements of `v` in place, from indices `lo` to `hi` inclusive

From a1ebca11ff35f8e3ba321ff1230c1c37084dec2c Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 13:01:49 -0400
Subject: [PATCH 09/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index a5df99c273..49dff304f0 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -71,7 +71,7 @@ We knew we needed parallelism though, so in 2014 (roughly the version 0.3 timefr
 set about the long process of making all of our code thread-safe.
 Yichao Yu put in some particularly impressive work on the garbage collector and thread-local-storage performance.
 handling.
-Kiran Pamnany (of Intel) put some basic infrastructure in place for starting and
+Kiran Pamnany (of Intel) designed some basic infrastructure for scheduling multiple threads and managing atomic datastructures.
 running multiple threads.
 
 Within about two years, we were ready to release the `@threads` macro in version 0.5,

From a19258dbfaaa82710211d8c12f9f2a594d0eeb89 Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 13:08:19 -0400
Subject: [PATCH 10/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index 49dff304f0..abeeab5a29 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -74,7 +74,7 @@ handling.
 Kiran Pamnany (of Intel) designed some basic infrastructure for scheduling multiple threads and managing atomic datastructures.
 running multiple threads.
 
-Within about two years, we were ready to release the `@threads` macro in version 0.5,
+In version 0.5 about two years later, we released the `@threads for` macro with "experimental" status which could handle simple parallel loops running on all cores.
 which provides simple parallel loops.
 Even though that wasn't the final design we wanted, it did two important jobs:
 it let Julia programmers start taking advantage of multiple cores, and provided

From c6f2163e9fa4996f55d263c62c0b765bdb5101ef Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 13:09:06 -0400
Subject: [PATCH 11/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index abeeab5a29..cc90ce3c8a 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -133,7 +133,7 @@ julia> Threads.threadid()
 1
 ```
 
-`@threads` loops still work, except now I/O is no problem:
+Existing `@threads for` uses will still work, and now I/O is fully supported:
 
 ```
 julia> Threads.@threads for i = 1:10

From 920e8f3bc8c062398dbb4a3a85f5899e05be6d30 Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 15:39:23 -0400
Subject: [PATCH 12/21] Apply suggestions from code review

Co-Authored-By: Stefan Karpinski <stefan@karpinski.org>
---
 blog/_posts/2019-07-14-multithreading.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index cc90ce3c8a..2124765e2f 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -12,7 +12,7 @@ story for multi-threaded, multi-core execution.
 Today we are happy to announce a major new chapter in that story.
 We are releasing an entirely new threading interface for Julia programs:
 general task parallelism, inspired by parallel programming systems
-like [Cilk][] and [Go][].
+like [Cilk][], [Intel Threading Building Blocks][] and [Go][].
 Task parallelism is now available on the master branch, and a beta version will be
 released as part of the upcoming Julia version 1.3.
 
@@ -518,6 +518,7 @@ to keep going!
 
 [free lunch]: http://www.gotw.ca/publications/concurrency-ddj.htm
 [Cilk]: http://cilk.mit.edu/
+[Intel Threading Building Blocks]: https://software.intel.com/en-us/intel-tbb/
 [Go]: https://tour.golang.org/concurrency/1
 [partr]: https://github.com/kpamnany/partr
 [Juno IDE]: https://junolab.org/

From c3f095854d0a2bc59d97cbd0dd6d9eabe4c6849b Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 15:46:19 -0400
Subject: [PATCH 13/21] even more edits

---
 blog/_posts/2019-07-14-multithreading.md | 85 +++++++++++++++---------
 1 file changed, 53 insertions(+), 32 deletions(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index 2124765e2f..d806d66b43 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -7,14 +7,16 @@ author: Jeff Bezanson, Jameson Nash
 Software performance depends more and more on exploiting multiple processor cores.
 The [free lunch][] from Moore's Law is still over.
 Well, we here in the Julia developer community have something of a reputation for
-caring about performance, so we've known for years that we would need a good
-story for multi-threaded, multi-core execution.
+caring about performance.
+We have already built a lot of functionality for multi-process, distributed
+programming and GPUs, but we've known for years that we would also need a good
+story for fast, composable multi-threading.
 Today we are happy to announce a major new chapter in that story.
-We are releasing an entirely new threading interface for Julia programs:
+We are releasing a preview of an entirely new threading interface for Julia programs:
 general task parallelism, inspired by parallel programming systems
-like [Cilk][], [Intel Threading Building Blocks][] and [Go][].
+like [Cilk][], [Intel Threading Building Blocks][] (TBB) and [Go][].
 Task parallelism is now available on the master branch, and a beta version will be
-released as part of the upcoming Julia version 1.3.
+included as part of the upcoming Julia version 1.3.
 
 In this paradigm, any piece of a program can be marked for execution in parallel,
 and a "task" will be started to run that code automatically on an available thread.
@@ -22,25 +24,27 @@ A dynamic scheduler handles all the decisions and details for you.
 Here's an example of parallel code you can now write in Julia:
 
 ```
+using Base.Threads
+
 function fib(n::Int)
     if n < 2
         return n
     end
-    t = @par fib(n - 2)
+    t = @spawn fib(n - 2)
     return fib(n - 1) + fetch(t)
 end
 ```
 
 This, of course, is the classic highly-inefficient tree recursive implementation of
-the Fibonacci sequence --- but running on any number of processor cores!
-The line `t = @par fib(n - 2)` starts a task to compute `fib(n - 2)`, which runs in
+the Fibonacci sequence---but running on any number of processor cores!
+The line `t = @spawn fib(n - 2)` starts a task to compute `fib(n - 2)`, which runs in
 parallel with the following line computing `fib(n - 1)`.
 `fetch(t)` waits for task `t` to complete and gets its return value.
 
 This model of parallelism has many wonderful properties.
 We see it as somewhat analogous to garbage collection: with GC, you
 freely allocate objects without worrying about when and how they are freed.
-With task parallelism, you freely spawn tasks --- potentially millions of them --- without
+With task parallelism, you freely spawn tasks---potentially millions of them---without
 worrying about where they run.
 
 The model is portable and free from low-level details.
@@ -52,22 +56,23 @@ functions that themselves start parallel tasks, and everything works.
 Your CPUs will not be over-subscribed with threads.
 This property is crucial for a high-level language where a lot of work is done by library
 functions.
-You need to be free to write whatever code you need --- including parallel code ---
-without worrying about how the libraries it calls are implemented.
-(*in the future we plan to extend this to C libraries such as BLAS, currently it only applies to Julia code)
+You need to be free to write whatever code you need---including parallel code---
+without worrying about how the libraries it calls are implemented
+(currently only for Julia code, but in the future we plan to extend this to native libraries
+such as BLAS).
 
-This is, in fact, the reason we are excited about this announcement: from this point on,
+This is, in fact, the main reason we are excited about this announcement: from this point on,
 multi-core parallelism is unleashed over the entire Julia package ecosystem.
 
 ## Some history
 
 One of the most surprising aspects of this new feature is just how long it has been in
 the works.
-From the very beginning --- prior even to the 0.1 release --- Julia has had the `Task`
+From the very beginning---prior even to the 0.1 release---Julia has had the `Task`
 type providing symmetric coroutines, which we've used for event-based I/O.
-So we have always had a unit of *concurrency* (independent streams of execution) in the language, it just wasn't *parallel* (simultaneous)
-(simultaneous streams of execution) yet.
-We knew we needed parallelism though, so in 2014 (roughly the version 0.3 timeframe) we
+So we have always had a unit of *concurrency* (independent streams of execution) in the language, it just
+wasn't *parallel* (simultaneous streams of execution) yet.
+We knew we needed thread-based parallelism though, so in 2014 (roughly the version 0.3 timeframe) we
 set about the long process of making all of our code thread-safe.
 Yichao Yu put in some particularly impressive work on the garbage collector and thread-local-storage performance.
 handling.
@@ -80,8 +85,8 @@ Even though that wasn't the final design we wanted, it did two important jobs:
 it let Julia programmers start taking advantage of multiple cores, and provided
 test cases to shake out thread-related bugs in our runtime.
 `@threads` had some huge limitations, however.
-`@threads` loops could not be nested: all the functions you call from within such a loop
-must not themselves use `@threads`.
+`@threads` loops could not be nested: if the functions they called used `@threads`
+recursively, those inner loops would only occupy the CPU that called them.
 It was also incompatible with our `Task` and I/O system: you couldn't do any I/O or
 switch among `Task`s inside a threaded loop.
 
@@ -89,20 +94,20 @@ So the next logical step was to merge the `Task` and threading systems, and "sim
 (cue laughter) allow `Task`s to run simultaneously on a pool of threads.
 We had many early discussions with Arch Robison (then also of Intel) and concluded
 that this was the best model for our language.
-After version 0.5 (around 2016) Kiran started experimenting with a new parallel
+After version 0.5 (around 2016), Kiran started experimenting with a new parallel
 task scheduler [partr][] based on the idea of depth-first scheduling.
 He sold all of us on it with some nice animated slides, and it also didn't hurt that
 he was willing to do some of the work.
 The plan was to first develop partr as a standalone C library so it could be tested
-and benchmarked on its own, and then integrate it with the Julia runtime.
+and benchmarked on its own and then integrate it with the Julia runtime.
 
-After Kiran completed the standalone version of partr, we embarked on a series of
-work sessions including Anton Malakhov (also of Intel) to figure out how to do
-the integration.
+After Kiran completed the standalone version of partr, a few of us (the authors of
+this post, as well as Keno Fischer and Intel's Anton Malakhov) embarked on a series of
+face-to-face work sessions to figure out how to do the integration.
 The Julia runtime brings many extra features, such as garbage collection and
 event-based I/O, so this was not entirely straightforward.
 Somewhat disappointingly, though not unusually for a complex software project,
-it took much longer than expected --- nearly two years --- to get the new
+it took much longer than expected---nearly two years---to get the new
 system working reliably.
 A later section of this post will explain some of the internals and difficulties
 involved for the curious.
@@ -159,6 +164,8 @@ for parallelism.
 Here is that code:
 
 ```
+using Base.Threads
+
 # sort the elements of `v` in place, from indices `lo` to `hi` inclusive
 function psort!(v, lo::Int=1, hi::Int=length(v))
     if lo >= hi                       # 1 or 0 elements; nothing to do
@@ -171,7 +178,7 @@ function psort!(v, lo::Int=1, hi::Int=length(v))
 
     mid = (lo+hi)>>>1                 # find the midpoint
 
-    half = @par psort!(v, lo, mid)    # task to sort the lower half; will run
+    half = @spawn psort!(v, lo, mid)  # task to sort the lower half; will run
     psort!(v, mid+1, hi)              # in parallel with the current call sorting
                                       # the upper half
     wait(half)                        # wait for the lower half to finish
@@ -200,14 +207,22 @@ end
 ```
 
 This is just a standard mergesort implementation, similar to the one in Julia's
-`Base` library, with only the tiny addition of the `@par` construct on one
+`Base` library, with only the tiny addition of the `@spawn` construct on one
 of the recursive calls.
+Julia's `Distributed` standard library has also exported a `@spawn` macro for
+quite a while, but we plan to discontinue it in favor of the new threaded
+meaning (though it will still be available in 1.x versions, for backwards
+compatibility).
+This way of expressing parallelism is much more useful in shared memory,
+and "spawn" is a pretty standard term in task parallel APIs (used in Cilk
+as well as [TBB][], for example).
+
 `wait` simply waits for the specified task to finish.
 The code works by modifying its input, so we don't need the task's return value.
 Indicating that a return value is not needed is the only difference with the
 `fetch` call used in the earlier `fib` example.
 Note that we explicitly request `MergeSort` when calling Julia's standard `sort!`,
-to make sure we're comparing apples to apples --- `sort!` actually uses
+to make sure we're comparing apples to apples---`sort!` actually uses
 quicksort by default for sorting numbers, which tends to be faster for random data.
 Let's time the code under `JULIA_NUM_THREADS=2`:
 
@@ -260,7 +275,7 @@ arrays allocated on each call.
 The reference sorting routine re-uses a single temporary buffer among
 all recursive calls.
 Re-using the temporary array is more difficult with parallelism, but
-still possible --- more on that a little later.
+still possible---more on that a little later.
 
 ## How to move to a parallel world
 
@@ -329,7 +344,7 @@ We simply need to allocate one array per thread.
 Next, we modify the recursive calls to reuse the space:
 
 ```
-    half = @par psort!(v, lo, mid, temps)
+    half = @spawn psort!(v, lo, mid, temps)
     psort!(v, mid+1, hi, temps)
 ```
 
@@ -379,7 +394,7 @@ your own RNG objects (e.g. `Rand.MersenneTwister()`).
 
 ## Under the hood
 
-As with garbage collection, the simple interface (`@par`) belies great
+As with garbage collection, the simple interface (`@spawn`) belies great
 complexity underneath.
 Here we will try to summarize some of the main difficulties and design
 decisions we faced.
@@ -487,7 +502,7 @@ factors: first, the failure mode caused the kernel to stop our process in a way
 that we were not able to intercept in a debugger, and second, the failure was
 triggered by a seemingly-unrelated change.
 All Julia stack frames have an exception handling personality set, so the problem
-could only appear in the runtime system outside any Julia frame --- a narrow window,
+could only appear in the runtime system outside any Julia frame---a narrow window,
 since of course we are usually executing Julia code.
 
 ## Looking forward
@@ -500,8 +515,13 @@ our threading capabilities:
 * Adding parallelism to the standard library. Many common operations like sorting and
   array broadcasting could now use multiple threads internally.
 * Consider allowing task migration.
+* Provide more atomic operations at the Julia level.
+* Using multiple threads in the compiler.
+* More performant parallel loops and reductions, with more scheduling options.
+* Allow adding more threads at run time.
 * Improved debugging tools.
 * Explore API extensions, e.g. cancel points.
+* Thread-safe data structures.
 * Provide alternate schedulers.
 * Explore integration with the [TAPIR][] parallel IR (some early work [here][]).
 
@@ -526,3 +546,4 @@ to keep going!
 [missing exception handling personality]: https://github.com/JuliaLang/julia/pull/32570
 [TAPIR]: http://cilk.mit.edu/tapir/
 [here]: https://github.com/JuliaLang/julia/pull/31086
+[TBB]: https://software.intel.com/en-us/node/506304

From 78d0840826ed663769d2c19a84c2ad769af8cdf9 Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 15:48:37 -0400
Subject: [PATCH 14/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Kristoffer Carlsson <kcarlsson89@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index d806d66b43..02a7b86856 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -62,7 +62,7 @@ without worrying about how the libraries it calls are implemented
 such as BLAS).
 
 This is, in fact, the main reason we are excited about this announcement: from this point on,
-multi-core parallelism is unleashed over the entire Julia package ecosystem.
+the possibility of adding multi-core parallelism is unleashed over the entire Julia package ecosystem.
 
 ## Some history
 

From 8862f0e34f2765969b7f9f3a8f5fce3c1b5c15e8 Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 17:19:27 -0400
Subject: [PATCH 15/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Tim Holy <tim.holy@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index 02a7b86856..2e6a9254ac 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -75,7 +75,6 @@ wasn't *parallel* (simultaneous streams of execution) yet.
 We knew we needed thread-based parallelism though, so in 2014 (roughly the version 0.3 timeframe) we
 set about the long process of making all of our code thread-safe.
 Yichao Yu put in some particularly impressive work on the garbage collector and thread-local-storage performance.
-handling.
 Kiran Pamnany (of Intel) designed some basic infrastructure for scheduling multiple threads and managing atomic datastructures.
 running multiple threads.
 

From 12501769c43b8c3d04da24485329c9712b05398b Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Sat, 20 Jul 2019 17:21:24 -0400
Subject: [PATCH 16/21] Apply suggestions from code review

Co-Authored-By: Tim Holy <tim.holy@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index 2e6a9254ac..b0e1a5463f 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -76,14 +76,12 @@ We knew we needed thread-based parallelism though, so in 2014 (roughly the versi
 set about the long process of making all of our code thread-safe.
 Yichao Yu put in some particularly impressive work on the garbage collector and thread-local-storage performance.
 Kiran Pamnany (of Intel) designed some basic infrastructure for scheduling multiple threads and managing atomic datastructures.
-running multiple threads.
 
 In version 0.5 about two years later, we released the `@threads for` macro with "experimental" status which could handle simple parallel loops running on all cores.
-which provides simple parallel loops.
 Even though that wasn't the final design we wanted, it did two important jobs:
 it let Julia programmers start taking advantage of multiple cores, and provided
 test cases to shake out thread-related bugs in our runtime.
-`@threads` had some huge limitations, however.
+That initial `@threads` had some huge limitations, however:
 `@threads` loops could not be nested: if the functions they called used `@threads`
 recursively, those inner loops would only occupy the CPU that called them.
 It was also incompatible with our `Task` and I/O system: you couldn't do any I/O or
@@ -228,13 +226,13 @@ Let's time the code under `JULIA_NUM_THREADS=2`:
 ```
 julia> a = rand(20000000);
 
-julia> b = copy(a); @time sort!(b, alg = MergeSort);
+julia> b = copy(a); @time sort!(b, alg = MergeSort);   # single-threaded
   2.589243 seconds (11 allocations: 76.294 MiB, 0.17% gc time)
 
 julia> b = copy(a); @time sort!(b, alg = MergeSort);
   2.582697 seconds (11 allocations: 76.294 MiB, 2.25% gc time)
 
-julia> b = copy(a); @time psort!(b);
+julia> b = copy(a); @time psort!(b);    # two threads
   1.770902 seconds (3.78 k allocations: 686.935 MiB, 4.25% gc time)
 
 julia> b = copy(a); @time psort!(b);
@@ -244,7 +242,7 @@ julia> b = copy(a); @time psort!(b);
 While the run times are bit variable, we see a definite speedup from using
 two threads.
 The laptop we ran this on has four hyperthreads, and it is especially amazing
-that the performance of this code continues to scale if we add a third thread:
+that the performance of this code continues to improve if we add a third thread:
 
 ```
 julia> b = copy(a); @time psort!(b);

From e8e4724560f3fd557f2d7b2acacabe201badee64 Mon Sep 17 00:00:00 2001
From: "Viral B. Shah" <viral@mayin.org>
Date: Mon, 22 Jul 2019 11:22:03 -0400
Subject: [PATCH 17/21] Update blog/_posts/2019-07-14-multithreading.md

Co-Authored-By: Kristoffer Carlsson <kcarlsson89@gmail.com>
---
 blog/_posts/2019-07-14-multithreading.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index b0e1a5463f..93d545dec0 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -62,7 +62,7 @@ without worrying about how the libraries it calls are implemented
 such as BLAS).
 
 This is, in fact, the main reason we are excited about this announcement: from this point on,
-the possibility of adding multi-core parallelism is unleashed over the entire Julia package ecosystem.
+the capability of adding multi-core parallelism is unleashed over the entire Julia package ecosystem.
 
 ## Some history
 

From e8554e576b3a5de92e0c5e2a96601976c332dece Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Mon, 22 Jul 2019 11:31:12 -0400
Subject: [PATCH 18/21] more edits

---
 blog/_posts/2019-07-14-multithreading.md | 101 ++++++++++++-----------
 1 file changed, 53 insertions(+), 48 deletions(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index b0e1a5463f..0e812f2d86 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -75,9 +75,11 @@ wasn't *parallel* (simultaneous streams of execution) yet.
 We knew we needed thread-based parallelism though, so in 2014 (roughly the version 0.3 timeframe) we
 set about the long process of making all of our code thread-safe.
 Yichao Yu put in some particularly impressive work on the garbage collector and thread-local-storage performance.
-Kiran Pamnany (of Intel) designed some basic infrastructure for scheduling multiple threads and managing atomic datastructures.
+Kiran Pamnany (of Intel) designed some basic infrastructure for scheduling multiple threads and managing
+atomic datastructures.
 
-In version 0.5 about two years later, we released the `@threads for` macro with "experimental" status which could handle simple parallel loops running on all cores.
+In version 0.5 about two years later, we released the `@threads for` macro with "experimental" status
+which could handle simple parallel loops running on all cores.
 Even though that wasn't the final design we wanted, it did two important jobs:
 it let Julia programmers start taking advantage of multiple cores, and provided
 test cases to shake out thread-related bugs in our runtime.
@@ -298,20 +300,25 @@ on the condition itself:
 
 ```
 lock(cond::Threads.Condition)
-while !ready
-    wait(cond)
+try
+    while !ready
+        wait(cond)
+    end
+finally
+    unlock(cond)
 end
-unlock(cond)
 ```
 
 As in previous versions, the standard lock to use to protect critical sections
 is `ReentrantLock`, which is now thread-safe (it was previously only used for
 synchronizing tasks).
-`Threads.SpinLock` is also available, to be used in rare circumstances where
-(1) only threads and not tasks need to be synchronized, and (2) you expect to
-hold the lock for a short time.
-`Semaphore` and `Event` are also available, completing the standard set of
-synchronization primitives.
+There are some other types of locks (`Threads.SpinLock` and `Threads.Mutex`) defined
+mostly for internal purposes.
+These are used in rare circumstances where (1) only threads and not tasks will be
+synchronized, and (2) you know the the lock will only be held for a short time.
+
+The `Threads` module also provides `Semaphore` and `Event` types, which have their
+standard definitions.
 
 ### Thread-local state
 
@@ -328,16 +335,14 @@ that cannot be duplicated for each thread.
 But for high-performance code we recommend thread-local state.
 Our `psort!` routine above can be improved in this way.
 Here is a recipe.
-First, we modify the function to accept pre-allocated buffers, using a default
+First, we modify the function signature to accept pre-allocated buffers, using a default
 argument value to allocate space automatically when the caller doesn't provide it:
 
 ```
-function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()])
+function psort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, 0) for i = 1:Threads.nthreads()])
 ```
 
-The maximum size of temporary array our mergesort needs is half the array, using
-ceiling division (`cld`) to handle odd lengths.
-We simply need to allocate one array per thread.
+We simply need to allocate one initially-empty array per thread.
 Next, we modify the recursive calls to reuse the space:
 
 ```
@@ -345,10 +350,12 @@ Next, we modify the recursive calls to reuse the space:
     psort!(v, mid+1, hi, temps)
 ```
 
-Finally, use the array reserved for the current thread, instead of allocating a new one:
+Finally, use the array reserved for the current thread, instead of allocating a new one,
+and resize it as needed:
 
 ```
     temp = temps[Threads.threadid()]
+    length(temp) < m-lo+1 && resize!(temp, m-lo+1)
     copyto!(temp, 1, v, lo, m-lo+1)
 ```
 
@@ -367,27 +374,20 @@ $ for n in 1 2 4 8 16; do    JULIA_NUM_THREADS=$n ./julia psort.jl; done
 Definitely faster, but we do seem to have some work to do on the
 scalability of the runtime system.
 
-### Seeding the default random number generator
+### Random number generation
 
-Julia's default global random number generator (`rand()`) is a particularly
-challenging case for thread-safety.
-We have split it into separate random streams for each thread, allowing
-code with `rand()` to be freely parallelized and get independent random
-numbers on each thread.
-
-However, seeding (`Random.seed!(n)`) is trickier.
-Seeding all of the per-thread streams would require some kind of synchronization
-among threads, which would unacceptably slow down random number generation.
-It also makes a very limited amount of sense: threading introduces its own
-nondeterminism, so you cannot get much predictability by seeding one thread's
-state from another.
-Therefore we decided to have the `seed!(n)` call affect only the current thread's
-state.
-That way, multiple independent code sequences that seed and then use random
-numbers can at least individually work as expected.
-For more elaborate seeding requirements, we recommend allocating and passing
-your own RNG objects (e.g. `Rand.MersenneTwister()`).
+The approach we've taken with Julia's default global random number generator (`rand()` and friends)
+is to make it thread-specific.
+On first use, each thread will create an independent instance of the default RNG type
+(currently `MersenneTwister`) seeded from system entropy.
+All operations that affect the random number state (`rand`, `srand`, `randn`, etc.) will then operate
+on only the current thread's RNG state.
+This way, multiple independent code sequences that seed and then use random numbers will individually
+work as expected.
 
+If you need all threads to use a known initial seed, you will need to set it up explicitly.
+For that kind of more precise control, or better performance, we recommend allocating and passing your
+own RNG objects (e.g. `Rand.MersenneTwister()`).
 
 ## Under the hood
 
@@ -425,19 +425,22 @@ you to specify a stack size per-task.
 Using it is not recommended, since it is hard to predict how much stack
 space will be needed, for instance by the compiler or called libraries.
 
-A thread can switch to running a given task simply (in principle) by switching
-its stack pointer to refer to the new task's stack and jumping to the next
-instruction.
+A thread can switch to running a given task just by adjusting its registers to
+appear to “return from” the previous task switch.
+We allocate a new stack out of a local pool just before we start running it.
 As soon as a task is done running, we can immediately release its stack back
 to the pool, avoiding excessive GC pressure.
 
 We also have an alternate implementation of stack switching (controlled by the
 `ALWAYS_COPY_STACKS` variable in `options.h`) that trades time for memory by
 copying live stack data when a task switch occurs.
+This may not be compatible with foreign code that uses `cfunction`,
+so it is not the default.
+
 We fall back to this implementation if stacks are consuming too much address
-space (some platforms impose a limit, which we exceeded in early testing).
+space (some platforms—notably Linux and 32-bit machines—impose a fairly low limit).
 And of course, each implementation has code for multiple platforms and
-architectures, often requiring assembly language.
+architectures, sometimes optimized further with inline assembly.
 Stack switching is a rich topic that could very well fill a blog post on its own.
 
 ### I/O
@@ -446,13 +449,14 @@ We use libuv for cross-platform event-based I/O.
 It is designed to be able to function within a multithreaded program, but is not
 explicitly a multithreaded I/O library and so doesn't support concurrent use from
 multiple threads out of the box.
-We decided to protect access to libuv structures with a lock, and then allow any thread
+For now, we protect access to libuv structures with a single global lock, and then allow any thread
 (one at a time) to run the event loop.
 When another thread needs the event loop thread to wake up, it issues an async signal.
 This can happen for multiple reasons, including another thread scheduling new work,
-or another thread needing to run garbage collection.
+another thread starting to run garbage collection, or another thread that wants to take
+the IO lock to do IO.
 
-### Task migration
+### Task migration across system threads
 
 In general, a task might start running on one thread, block for a while, and then
 restart on another.
@@ -482,9 +486,8 @@ recently-blocked task.
 That works fine, but it means a task can exist in a strange intermediate state
 where it is considered not to be running, and yet is in fact running the
 scheduler.
-In particular, we need to make sure no other thread sees that task and thinks
-"oh, there's a task I can run", causing it to scribble on the scheduler's
-stack.
+In particular, it means we might pull a task out of the scheduler queue just
+to realize that we don't need to switch away at all.
 
 ### Classic bugs
 
@@ -518,14 +521,14 @@ our threading capabilities:
 * Allow adding more threads at run time.
 * Improved debugging tools.
 * Explore API extensions, e.g. cancel points.
-* Thread-safe data structures.
+* Standard library of thread-safe data structures.
 * Provide alternate schedulers.
 * Explore integration with the [TAPIR][] parallel IR (some early work [here][]).
 
 
 ## Acknowledgements
 
-We would like to gratefully acknowledge funding support from Intel and relationalAI
+We would like to gratefully acknowledge funding support from [Intel][] and [relationalAI][]
 that made it possible to develop these new capabilities.
 
 We are also grateful to the several people who patiently tried this functionality
@@ -544,3 +547,5 @@ to keep going!
 [TAPIR]: http://cilk.mit.edu/tapir/
 [here]: https://github.com/JuliaLang/julia/pull/31086
 [TBB]: https://software.intel.com/en-us/node/506304
+[Intel]: https://www.intel.com/
+[relationalAI]: http://relational.ai/

From 315b149665215c71b6446c90477265bd01bda1bb Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Mon, 22 Jul 2019 17:39:23 -0400
Subject: [PATCH 19/21] more edits

---
 blog/_posts/2019-07-14-multithreading.md | 48 ++++++++++++++----------
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index 56835f010d..5cce847ba8 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -1,7 +1,7 @@
 ---
 layout: post
 title:  Announcing composable multi-threaded parallelism in Julia
-author: Jeff Bezanson, Jameson Nash
+author: Jeff Bezanson (Julia Computing), Jameson Nash (Julia Computing), Kiran Pamnany (Intel)
 ---
 
 Software performance depends more and more on exploiting multiple processor cores.
@@ -24,7 +24,7 @@ A dynamic scheduler handles all the decisions and details for you.
 Here's an example of parallel code you can now write in Julia:
 
 ```
-using Base.Threads
+import Base.Threads.@spawn
 
 function fib(n::Int)
     if n < 2
@@ -75,7 +75,7 @@ wasn't *parallel* (simultaneous streams of execution) yet.
 We knew we needed thread-based parallelism though, so in 2014 (roughly the version 0.3 timeframe) we
 set about the long process of making all of our code thread-safe.
 Yichao Yu put in some particularly impressive work on the garbage collector and thread-local-storage performance.
-Kiran Pamnany (of Intel) designed some basic infrastructure for scheduling multiple threads and managing
+Kiran designed some basic infrastructure for scheduling multiple threads and managing
 atomic datastructures.
 
 In version 0.5 about two years later, we released the `@threads for` macro with "experimental" status
@@ -91,7 +91,7 @@ switch among `Task`s inside a threaded loop.
 
 So the next logical step was to merge the `Task` and threading systems, and "simply"
 (cue laughter) allow `Task`s to run simultaneously on a pool of threads.
-We had many early discussions with Arch Robison (then also of Intel) and concluded
+We had many early discussions with Arch Robison (then of Intel) and concluded
 that this was the best model for our language.
 After version 0.5 (around 2016), Kiran started experimenting with a new parallel
 task scheduler [partr][] based on the idea of depth-first scheduling.
@@ -163,7 +163,7 @@ for parallelism.
 Here is that code:
 
 ```
-using Base.Threads
+import Base.Threads.@spawn
 
 # sort the elements of `v` in place, from indices `lo` to `hi` inclusive
 function psort!(v, lo::Int=1, hi::Int=length(v))
@@ -256,17 +256,22 @@ can make your head hurt a little!
 In our view, this helps underscore how "automatic" this interface makes
 parallelism feel.
 
-Let's try a different machine with more CPU cores:
+Let's try a different machine with slightly lower single thread performance,
+but more CPU cores:
 
 ```
 $ for n in 1 2 4 8 16; do    JULIA_NUM_THREADS=$n ./julia psort.jl; done
-  2.958881 seconds (3.58 k allocations: 686.932 MiB, 4.71% gc time)
-  1.868720 seconds (3.77 k allocations: 686.935 MiB, 7.03% gc time)
-  1.222777 seconds (3.78 k allocations: 686.935 MiB, 9.14% gc time)
-  0.958517 seconds (3.79 k allocations: 686.935 MiB, 18.21% gc time)
-  0.836891 seconds (3.78 k allocations: 686.935 MiB, 21.10% gc time)
+  2.949212 seconds (3.58 k allocations: 686.932 MiB, 4.70% gc time)
+  1.861985 seconds (3.77 k allocations: 686.935 MiB, 9.32% gc time)
+  1.112285 seconds (3.78 k allocations: 686.935 MiB, 4.45% gc time)
+  0.787816 seconds (3.80 k allocations: 686.935 MiB, 2.08% gc time)
+  0.655762 seconds (3.79 k allocations: 686.935 MiB, 4.62% gc time)
 ```
 
+The `psort.jl` script simply defines the `psort!` function, calls it once
+to avoid measuring compilation overhead, and then runs the same commands
+we used above.
+
 Notice that this speedup occurs despite the parallel code allocating
 *drastically* more memory than the standard routine.
 The allocations come from two sources: `Task` objects, and the `temp`
@@ -364,16 +369,13 @@ machine:
 
 ```
 $ for n in 1 2 4 8 16; do    JULIA_NUM_THREADS=$n ./julia psort.jl; done
-  2.723312 seconds (3.07 k allocations: 152.852 MiB, 0.14% gc time)
-  1.711112 seconds (3.28 k allocations: 229.149 MiB, 0.59% gc time)
-  0.971327 seconds (3.28 k allocations: 381.737 MiB, 1.60% gc time)
-  0.782790 seconds (3.28 k allocations: 686.913 MiB, 8.63% gc time)
-  0.722063 seconds (3.33 k allocations: 1.267 GiB, 21.43% gc time)
+  2.813555 seconds (3.08 k allocations: 153.448 MiB, 1.44% gc time)
+  1.731088 seconds (3.28 k allocations: 192.195 MiB, 0.37% gc time)
+  1.028344 seconds (3.30 k allocations: 221.997 MiB, 0.37% gc time)
+  0.750888 seconds (3.31 k allocations: 267.298 MiB, 0.54% gc time)
+  0.620054 seconds (3.38 k allocations: 298.295 MiB, 0.77% gc time)
 ```
 
-Definitely faster, but we do seem to have some work to do on the
-scalability of the runtime system.
-
 ### Random number generation
 
 The approach we've taken with Julia's default global random number generator (`rand()` and friends)
@@ -508,6 +510,8 @@ since of course we are usually executing Julia code.
 ## Looking forward
 
 While we are excited about this milestone, a lot of work remains.
+This alpha release introduces the `@spawn` construct, but is not intended to finalize
+its design.
 Here are some of the points we hope to focus on to further develop
 our threading capabilities:
 
@@ -533,7 +537,9 @@ that made it possible to develop these new capabilities.
 
 We are also grateful to the several people who patiently tried this functionality
 while it was in development and filed bug reports or pull requests, and spurred us
-to keep going!
+to keep going.
+If you encounter any issues using threads in Julia, please let us know on [GitHub][] or
+our [Discourse][] forum!
 
 
 [free lunch]: http://www.gotw.ca/publications/concurrency-ddj.htm
@@ -549,3 +555,5 @@ to keep going!
 [TBB]: https://software.intel.com/en-us/node/506304
 [Intel]: https://www.intel.com/
 [relationalAI]: http://relational.ai/
+[GitHub]: https://github.com/JuliaLang/julia/issues
+[Discourse]: https://discourse.julialang.org/

From c7642e14fb8373ec2b69706bf468a5d708cca27f Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Tue, 23 Jul 2019 09:22:37 -0400
Subject: [PATCH 20/21] more edits

---
 blog/_posts/2019-07-14-multithreading.md | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-14-multithreading.md
index 5cce847ba8..6468c50e65 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-14-multithreading.md
@@ -8,15 +8,15 @@ Software performance depends more and more on exploiting multiple processor core
 The [free lunch][] from Moore's Law is still over.
 Well, we here in the Julia developer community have something of a reputation for
 caring about performance.
-We have already built a lot of functionality for multi-process, distributed
+In pursuit of it, we have already built a lot of functionality for multi-process, distributed
 programming and GPUs, but we've known for years that we would also need a good
-story for fast, composable multi-threading.
+story for composable multi-threading.
 Today we are happy to announce a major new chapter in that story.
 We are releasing a preview of an entirely new threading interface for Julia programs:
 general task parallelism, inspired by parallel programming systems
 like [Cilk][], [Intel Threading Building Blocks][] (TBB) and [Go][].
-Task parallelism is now available on the master branch, and a beta version will be
-included as part of the upcoming Julia version 1.3.
+Task parallelism is now available in the v1.3.0-alpha release, an early preview
+of Julia version 1.3.0 likely to be released in a couple months.
 
 In this paradigm, any piece of a program can be marked for execution in parallel,
 and a "task" will be started to run that code automatically on an available thread.
@@ -73,9 +73,9 @@ type providing symmetric coroutines, which we've used for event-based I/O.
 So we have always had a unit of *concurrency* (independent streams of execution) in the language, it just
 wasn't *parallel* (simultaneous streams of execution) yet.
 We knew we needed thread-based parallelism though, so in 2014 (roughly the version 0.3 timeframe) we
-set about the long process of making all of our code thread-safe.
+set about the long process of making our code thread-safe.
 Yichao Yu put in some particularly impressive work on the garbage collector and thread-local-storage performance.
-Kiran designed some basic infrastructure for scheduling multiple threads and managing
+One of the authors (Kiran) designed some basic infrastructure for scheduling multiple threads and managing
 atomic datastructures.
 
 In version 0.5 about two years later, we released the `@threads for` macro with "experimental" status
@@ -244,7 +244,7 @@ julia> b = copy(a); @time psort!(b);
 While the run times are bit variable, we see a definite speedup from using
 two threads.
 The laptop we ran this on has four hyperthreads, and it is especially amazing
-that the performance of this code continues to improve if we add a third thread:
+that performance continues to improve if we add a third thread:
 
 ```
 julia> b = copy(a); @time psort!(b);
@@ -328,7 +328,7 @@ standard definitions.
 ### Thread-local state
 
 Julia code naturally tends to be purely functional (no side effects or mutation),
-or only uses local mutation, so migrating to full thread-safety will hopefully
+or to use only local mutation, so migrating to full thread-safety will hopefully
 be easy in many cases.
 But if your code uses shared state and you'd like to make it thread-safe, there
 is some work to do.
@@ -382,7 +382,7 @@ The approach we've taken with Julia's default global random number generator (`r
 is to make it thread-specific.
 On first use, each thread will create an independent instance of the default RNG type
 (currently `MersenneTwister`) seeded from system entropy.
-All operations that affect the random number state (`rand`, `srand`, `randn`, etc.) will then operate
+All operations that affect the random number state (`rand`, `randn`, `seed!`, etc.) will then operate
 on only the current thread's RNG state.
 This way, multiple independent code sequences that seed and then use random numbers will individually
 work as expected.
@@ -409,7 +409,7 @@ There are many possible approaches to this, with different tradeoffs.
 As we often do, we tried to pick a method that would maximize throughput
 and reliability.
 We have a shared pool of stacks allocated by `mmap` (`VirtualAlloc` on
-windows), defaulting to 4MiB each (2MiB on 32-bit systems).
+Windows), defaulting to 4MiB each (2MiB on 32-bit systems).
 This can use quite a bit of virtual memory, so don't be alarmed if `top`
 shows your shiny new multi-threaded Julia code using 100GiB of address
 space.
@@ -428,7 +428,7 @@ Using it is not recommended, since it is hard to predict how much stack
 space will be needed, for instance by the compiler or called libraries.
 
 A thread can switch to running a given task just by adjusting its registers to
-appear to “return from” the previous task switch.
+appear to "return from" the previous switch away from that task.
 We allocate a new stack out of a local pool just before we start running it.
 As soon as a task is done running, we can immediately release its stack back
 to the pool, avoiding excessive GC pressure.
@@ -488,7 +488,7 @@ recently-blocked task.
 That works fine, but it means a task can exist in a strange intermediate state
 where it is considered not to be running, and yet is in fact running the
 scheduler.
-In particular, it means we might pull a task out of the scheduler queue just
+One implication of that is that we might pull a task out of the scheduler queue just
 to realize that we don't need to switch away at all.
 
 ### Classic bugs

From 86b71c90f3606c7423f8161f4907832db03be74e Mon Sep 17 00:00:00 2001
From: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Tue, 23 Jul 2019 11:25:01 -0400
Subject: [PATCH 21/21] add downloads link

---
 ...9-07-14-multithreading.md => 2019-07-23-multithreading.md} | 4 ++++
 1 file changed, 4 insertions(+)
 rename blog/_posts/{2019-07-14-multithreading.md => 2019-07-23-multithreading.md} (99%)

diff --git a/blog/_posts/2019-07-14-multithreading.md b/blog/_posts/2019-07-23-multithreading.md
similarity index 99%
rename from blog/_posts/2019-07-14-multithreading.md
rename to blog/_posts/2019-07-23-multithreading.md
index 6468c50e65..cf2fbd1de2 100644
--- a/blog/_posts/2019-07-14-multithreading.md
+++ b/blog/_posts/2019-07-23-multithreading.md
@@ -17,6 +17,8 @@ general task parallelism, inspired by parallel programming systems
 like [Cilk][], [Intel Threading Building Blocks][] (TBB) and [Go][].
 Task parallelism is now available in the v1.3.0-alpha release, an early preview
 of Julia version 1.3.0 likely to be released in a couple months.
+You can find binaries with this feature on the [downloads page][], or build
+the [master branch][] from source.
 
 In this paradigm, any piece of a program can be marked for execution in parallel,
 and a "task" will be started to run that code automatically on an available thread.
@@ -557,3 +559,5 @@ our [Discourse][] forum!
 [relationalAI]: http://relational.ai/
 [GitHub]: https://github.com/JuliaLang/julia/issues
 [Discourse]: https://discourse.julialang.org/
+[downloads page]: https://julialang.org/downloads/
+[master branch]: https://github.com/JuliaLang/julia