
Streams
=======

Processes use streams for all of their I/O (input/output) operations.
Streams are an abstraction created by the operating system. A **stream**
represents a sequence of bytes. The bytes can represent any kind of data,
for example, text, images, video, audio. Processes use streams to move
data into and out of hardware I/O devices (like the keyboard), files,
or even other processes.

Streams are one directional. A process can only read from or write to
a specific stream. If a process can read from a stream we say it is
an "input stream". If a process can write to a stream, we say it is
an "output stream".

In introductory programming courses, streams are mostly associated with
files. A program reads or writes a stream of data from a file on a storage
device. But we will see that streams are much more versatile. We will show
how programs can read or write streams of data from other programs. In other
words, we will see that streams can be used to implement Inter-process
Communication.

* <https://en.wikipedia.org/wiki/Stream_(computing)>
* <https://en.wikipedia.org/wiki/Standard_streams>

The operating system creates the file and stream abstraction and then makes
available to programming languages for writing programs. Here are references
to a few operating systems textbooks that explain how operating systems create
these abstractions.

<ul>
<li><a href="https://pages.cs.wisc.edu/~remzi/OSTEP/file-devices.pdf">I/O Devices</a> from <a href="https://pages.cs.wisc.edu/~remzi/OSTEP/#book-chapters">Operating Systems: Three Easy Pieces</a></li>
<li><a href="https://pages.cs.wisc.edu/~remzi/OSTEP/file-intro.pdf">Files and Directories</a> from <a href="https://pages.cs.wisc.edu/~remzi/OSTEP/#book-chapters">Operating Systems: Three Easy Pieces</a></li>
<li><a href="https://www.greenteapress.com/thinkos/thinkos.pdf#page=35">Chapter 4, Files and file systems</a> from <a href="https://www.greenteapress.com/thinkos/">Think OS</a></li>
<li><a href="https://ia902302.us.archive.org/27/items/osm-rev1.2/osm-rev1.2.pdf#page=360">POSIX File API</a> from <a href="https://open.umn.edu/opentextbooks/textbooks/operating-systems-and-middleware-supporting-controlled-interaction">Operating Systems and Middleware</a></li>
</ul>


In this document we will emphasize how streams are used to build up complex
command-lines. In another document we will look at how we can use the Java
programming language to write code that uses streams.



## Standard I/O Streams

When a process is created by the operating system, the process is always
supplied with three open streams. These three streams are called the
"standard streams". They are

* standard input  (stdin)
* standard output (stdout)
* standard error  (stderr)

We can visualize a process as an object with three "connections" where
data (bytes) can either flow into the process or flow out from the process.

```text
                       process
                 +-----------------+
                 |                 |
        >------->> stdin    stdout >>------->
                 |                 |
                 |          stderr >>------->
                 |                 |
                 +-----------------+
```

A console application will usually have its stdin stream connected to the
computer's keyboard and its stdout and stderr streams connected to the
console window.

```text
                       process
                 +-----------------+
                 |                 |
   keyboard >--->> stdin    stdout >>-----+---> console window
                 |                 |      |
                 |          stderr >>-----+
                 |                 |
                 +-----------------+
```

It is important to realize that the above picture is independent of the
programming language used to write the program which is running in the
process. Every process looks like this. It is up to each programming
language to allow programs, written in that language, to make use of
this setup provided by the operating system.

* <https://en.wikipedia.org/wiki/Standard_streams>

Every **operating system** has its own way of giving each process access
to the internal data structures the operating system uses to keep track
of what each standard stream is "connected" to.

The Linux operating system gives every process three **file descriptors**,

```text
    #define  STDIN_FILENO 0,  STDOUT_FILENO 1,  STDERR_FILENO 2
```

Linux provides the `read()` and `write()` system calls to let a process
read from and write to these file descriptors.

The Windows operating system gives every process three **handles**. We
retrieve the handles using the `GetStdHandle()` function with one of
these input parameters.

```text
     STD_INPUT_HANDLE, STD_OUTPUT_HANDLE, STD_ERROR_HANDLE
```

Windows provides the `ReadFile()` and `WriteFile()` system calls to let
a process read from and write to these handles.


Every **programming language** must have a way of representing the three
standard streams and every language must provide a way to read from the
standard input stream and a way to write to the standard output and
standard error streams.

For example, here is how the three standard I/O streams are represented
by some common programming languages.

```text
    Java uses Stream objects.
      java.io.InputStream  System.in
      java.io.PrintStream  System.out
      java.io.PrintStream  System.err
    These are static fields in the java.lang.System class.

    Standard C uses pointers to FILE objects.
      FILE* stdin;
      FILE* stdout;
      FILE* stderr;
    These are defined in the stdio.h header file.

    Python uses text File objects.
      sys.stdin
      sys.stdout
      sys.stderr
    These are in the sys module.

    C++ uses stream objects.
      istream std::cin;
      ostream std::cout;
      ostream std::cerr;
    These are defined in the <iostream> header.

    .Net uses Stream objects.
      System.IO.TextReader  Console.In
      System.IO.TextWriter  Console.Out
      System.IO.TextWriter  Console.Error
    These are static fields in the System.Console class.
```

Most programming languages define their basic I/O functions to automatically
work with the standard input and output streams. For example, in almost every
programming language, the basic `print` function writes to the standard output
stream. The `print` function itself is written to use the `write()` system
call in Linux or the `WriteFile()` system call in Windows.

The C language provides functions like `getchar()`, `scanf()`, and `fscanf()`
to read from `stdin` and it provides `printf()` and `fprintf()` to write to
`stdout` and `stderr`. On a Windows computer, the C language's `printf()`
function will be implemented using Window's `WriteFile()` system call with
the `STD_OUTPUT_HANDLE` handle. On a Linux computer, the C language's
`printf()` function will be implemented using Linux's `write()` system call
with the `STDOUT_FILENO` file descriptor.


* <https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/System.html#field-summary>
* <https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/io/FileDescriptor.html>
* <https://man7.org/linux/man-pages/man3/stdio.3.html>
* <https://en.cppreference.com/w/c/header/stdio.html>
* <https://cplusplus.com/reference/cstdio/>
* <https://docs.python.org/3/library/sys.html#sys.stdin>
* <https://en.cppreference.com/w/cpp/header/iostream.html>
* <https://cplusplus.com/reference/iostream/>
* <https://learn.microsoft.com/en-us/dotnet/api/system.console>


Every operating system provides a way for processes to **open** new
streams. For example, in the following picture, a process, while it
was running, opened three new stream, two input streams and one output
stream. All three streams are connected to files.

```text
                        process
                +-----------------------+
                |                       |
  keyboard >--->> stdin          stdout >>-----+---> console window
                |                       |      |
                |                stderr >>-----+
                |                       |
                | 1n1     in2     out   |
                +-/|\-----/|\-----\ /---+
                   |       |       |
    input1.txt >---+       |       +----------> output.txt
                           |
            input2.txt >---+
```

This process can now read data from any of its three input streams and
it can write data to any of its three output streams. For example, it
might copy data from the two input files into the output file.

After the process has read all the data it needs from the file
`input1.txt`, the process can **close** the stream.

```text
                        process
                +-----------------------+
                |                       |
  keyboard >--->> stdin          stdout >>-----+---> console window
                |                       |      |
                |                stderr >>-----+
                |                       |
                |         in2     out   |
                +---------/|\-----\ /---+
                           |       |
                           |       +----------> output.txt
                           |
            input2.txt >---+
```

As long as a process is running, it can continue to open and close input
and output streams. Opening and closing streams to files is what most
introductory programming textbooks cover in their chapters on file I/O.


## I/O Redirection

*Every* process is created by the operating system at the request of some
other process, the parent process. When the parent process asks the operating
system to create a child process, the parent must tell the operating system
how to "connect" the child's three standard streams. The parent telling the
operating system how to connect the child's three standard streams is usually
referred to as **I/O redirection**.

At a shell command prompt, if we type a command like this,

```text
    > foo > result.txt
```

then the shell program (`cmd.exe` on Windows, or `bash` on Linux) is the parent
process. The above command tells the shell process to ask the operating system
to create a child process from the `foo` program. But in addition to asking the
operating system to create the child process, the shell process also instructs
the operating system to redirect the child process's standard output to the
file `result.txt`. So when the `foo` process runs, it looks like this.

```text
                       foo
                +-----------------+
                |                 |
   keyboard >-->> stdin    stdout >>----> result.txt
                |                 |
                |          stderr >>----> console window
                |                 |
                +-----------------+
```

Stdin and stderr have their default connections, and stdout is redirected
to the file `result.txt`.

If we type a command like this,

```text
    > foo > result.txt < data.txt
```

then the shell process asks the operating system to create a child process
from the `foo` program and it also asks the operating system to redirect the
child process's standard output to the file `result.txt` and redirect the
child process's standard input to the file `data.txt`. So when the `foo`
process runs, it looks like this.

```text
                        foo
                +-----------------+
                |                 |
   data.txt >-->> stdin    stdout >>----> result.txt
                |                 |
                |          stderr >>----> console window
                |                 |
                +-----------------+
```

It is very important to know that the `foo` process does *not* know that its
standard streams have been redirected. The `foo` process cannot tell if its
standard output stream is connected to the console window (the default
connection) or if it is connected to some file. If standard output is connected
to a file, then `foo` is doing file I/O without even know that. When the `foo`
process calls the `print` function, it is "printing" on a file (which does not,
literally, make sense). The function name "print" is a hold over from many years
ago when a computer's output was always printed on paper. The name "print" for
the default output function is misleading, since the modern `print` function
has nothing to do with printing on paper. But, as we have mentioned many times,
once a name is chosen for something in a programming language (in this case,
an I/O function), the name is never changed, no mater how outdated it becomes.

The order in which we place redirections in the command-line does not matter.
The following two commands are equivalent.

```text
    > foo > result.txt < data.txt
    > foo < data.txt > result.txt
```

When we use the input redirection operator, if the specified input file
does not exist, then we get an error message and the command-line fails.

When we use the output redirection operator, if the specified output file
does not exist, then the operating system creates an empty file for us with
that name. However, be careful. If the specified output file does exist, then
it is emptied of all its contents, and the command-line is given the empty
file, so we lose any data that was in the specified output file.

There is a very useful alternative to the `>` output redirection operator.
The `>>` append output redirection operator will, like `>`, create the
specified output file if it does not exist, but instead of emptying an
existing output file, this operator writes new data at the end of the
previous data in the output file. One important use of this operator is
for one file to accumulate results from several command-lines.

We can have the shell process redirect the standard error stream of
a process. The following command-line,

```text
    > bar < data.txt 2> errors.txt
```

tells the shell process to ask the operating system to create a child process
from the `bar` program, redirect the child's standard input stream to the
file `data.txt`, and redirect the child's standard error stream to the file
`errors.txt`. The child's standard output stream will be connected to the
console window. When the `bar` process runs, it looks like this.

```text
                        bar
                +-----------------+
                |                 |
   data.txt >-->> stdin    stdout >>----> console window
                |                 |
                |          stderr >>----> errors.txt
                |                 |
                +-----------------+
```

The order of the redirections in the command-line does not matter. The
following two commands are equivalent.

```text
    > bar < data.txt 2> errors.txt
    > bar 2> errors.txt < data.txt
```

In fact, the following command-lines are all equivalent.

```text
    > bar < data.txt > output.txt 2> errors.txt
    > bar > output.txt 2> errors.txt < data.txt
    > bar > output.txt < data.txt 2> errors.txt
```

What if we want to redirect both the standard output and standard error
streams to a single file? The following command-line does not work.

```text
    > bar > allOutput.txt 2> allOutput.txt
```

The Linux `bash` shell allows us to use the `&>` redirection operator.

```text
    $ bar &> allOutput.txt
```

This creates the following picture.

```text
                        bar
                +-----------------+
                |                 |
   keyboard >-->> stdin    stdout >>----+----> allOutput.txt
                |                 |     |
                |          stderr >>----+
                |                 |
                +-----------------+
```

With the Windows `cmd` shell, we need to use this slightly more complex
command (which also works with `bash`).

```text
   > bar > allOutput.txt 2>&1
```

This command says to redirect the standard output stream to the file
`allOutput.txt` and then, in addition, redirect the standard error
stream to the same place as the standard output stream. The `2>&1`
operator must be at the end of the command-line.

Where do the numbers 1 and 2 in the I/O redirection operators come from?
They are from the Unix operating system's implementation of file I/O. In
Unix (and in Linux) every open file is given a positive integer number
called a **file descriptor**. The file descriptor numbers are used by all
the Unix (and Linux) file I/O functions. When a process is created, its
standard input, output, and error streams are given the file descriptors
0, 1, and 2, respectively.

```text
                       process
                 +-----------------+
                 |                 |
        >------->> 0             1 >>------->
                 |                 |
                 |               2 >>------->
                 |                 |
                 +-----------------+
```

The `bash` and `cmd` shells use these file descriptor numbers as part of
their I/O redirection operators. This is an example of a "leaky abstraction".
The shell program is supposed to let us manipulate processes and files with
out knowing about the details of how the underlying operating system handles
processes and files. The Windows operating system does not even use file
descriptors, but it still exposes them in the syntax of the `cmd` shell (in
order to be consistent with `bash`). A leaky abstraction is when a lower level
implementation detail appears in the interface of a higher level abstraction.

* <https://en.wikipedia.org/wiki/File_descriptor>
* <https://man7.org/linux/man-pages/man2/open.2.html>
* <https://en.wikipedia.org/wiki/Leaky_abstraction>
* <https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/>


Do not confuse I/O redirection with the idea of opening a new stream to
a file. The above `foo` process, that has its stdin redirected to the file
`data.txt`, and its stdout redirected to the file `result.txt`, can still
open new streams connected to other files.

```text
                          foo
                +-----------------------+
                |                       |
  data.txt >--->> stdin          stdout >>-----> result.txt
                |                       |
                |                stderr >>------> console window
                |                       |
                |     in         out    |
                +-----/|\--------\ /----+
                       |          |
                       |          |
         input.txt >---+          +----------> output.txt
```

Opening (and closing) new file streams does not change the fact that this
process has had its standard input and output streams redirected.


* <https://en.wikipedia.org/wiki/Redirection_(computing)>
* <https://www.linfo.org/redirection.html>
* <https://linuxcommand.org/lc3_lts0070.php>
* <https://www.gnu.org/software/bash/manual/html_node/Redirections.html>
* <https://man7.org/linux/man-pages/man1/bash.1.html#REDIRECTION>
* <https://ss64.com/nt/syntax-redirection.html>
* <https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-xp/bb490982(v=technet.10)>


## Shared streams

At a shell command prompt, if we type this command-line,

```text
    > foo
```

then we are asking the shell process to create and run a `foo` process. The
shell process (cmd.exe on Windows, or bash on Linux) is the parent process
and `foo` is its child process. The shell process causes the `foo` process
to have its standard streams connected in the following, usual, way.

```text
                        foo
                 +-----------------+
                 |                 |
   keyboard >--->> stdin    stdout >>----+----> console window
                 |                 |     |
                 |          stderr >>----+
                 |                 |
                 +-----------------+
```

But this picture is incomplete. It does not show the relationship between
the `foo` process and the shell process, its parent process. The shell
process is itself a command-line program, so it uses the keyboard for its
input and the console window for its output.

Here is how the two processes are related to each other. The two process
"share" the input stream for the keyboard and they share the output
stream to the console window.

```text
                               shell
                         +-----------------+
                         |                 |
                  +----->> stdin    stdout >>-----+
                  |      |                 |      |
                  |      |          stderr >>-----+
                  |      |                 |      |
                  |      +-----------------+      |
    keyboard >----+                               +----> console window
                  |            foo                |
                  |      +-----------------+      |
                  |      |                 |      |
                  +----->> stdin    stdout >>-----+
                         |                 |      |
                         |          stderr >>-----+
                         |                 |
                         +-----------------+
```

If, at a shell command prompt, we type this command-line,

```text
    > foo > result.txt
```

then the shell process is the parent process and the `foo` process is the
child process. The child has its standard output stream redirected to a
file, but it uses the default input stream (and default error stream),
which it shares with the shell process. The two processes and their
streams will look like this.

```text
                               shell
                         +-----------------+
                         |                 |
                  +----->> stdin    stdout >>-----+----> console window
                  |      |                 |      |
                  |      |          stderr >>-----+
                  |      |                 |      |
                  |      +-----------------+      |
    keyboard >----+                               |
                  |                foo            |
                  |      +-----------------+      |
                  |      |                 |      |
                  +----->> stdin    stdout >>----------> result.txt
                         |                 |      |
                         |          stderr >>-----+
                         |                 |
                         +-----------------+
```

If, at a shell command prompt, we type this command-line,

```text
    > foo 2> errors.txt
```

then we get the following picture. The `foo` process shares its standard
input and output streams with the shell process.

```text
                               shell
                         +-----------------+
                         |                 |
                  +----->> stdin    stdout >>-----+----> console window
                  |      |                 |      |
                  |      |          stderr >>-----+
                  |      |                 |      |
                  |      +-----------------+      |
    keyboard >----+                               |
                  |                foo            |
                  |      +-----------------+      |
                  |      |                 |      |
                  +----->> stdin    stdout >>-----+
                         |                 |
                         |          stderr >>----------> errors.txt
                         |                 |
                         +-----------------+
```

When two processes share a stream, it is usually the case that one of the
two processes is idle while the other process uses the shared stream (the
idle process will often be waiting for the other process to terminate). If
two processes are simultaneously using a shared stream, the results can be
confusing and unpredictable.

If two processes simultaneously use an output stream, then their outputs will
be, more or less, randomly intermingled in the stream's final destination.
This can lead to unusable results.

If two processes simultaneously use an input stream, as in the following
picture, then it is **not** the case that every input byte flows into each
process. Each input byte can only be consumed by *one* of the two processes.
Which process gets a particular byte of input depends on the ordering of
when each process calls its `read()` function on the input stream. This is
almost never a desirable situation. Processes almost never simultaneously
use a shared input stream. Shared input streams are very common, but the
two processes almost always have a way to synchronize their use of the
stream so that they are never reading from it simultaneously. The most
common way for two processes to share an input stream is for the parent
process to wait for the child process to terminate. Then the parent
process can resume reading from the input stream.

```text
                        parent
                  +-----------------+
                  |                 |
           +----->> stdin    stdout >>-------->
           |      |                 |
           |      |          stderr >>----->
           |      |                 |
           |      +-----------------+
      >----+
           |
           |               child
           |         +-----------------+
           |         |                 |
           +-------->> stdin    stdout >>------>
                     |                 |
                     |          stderr >>---->
                     |                 |
                     +-----------------+
```


## Pipes

So far, we have seen that streams can connect a process to either a file
or an I/O device (like the keyboard or a console window).

It would be useful if the output stream of one process could be connected
to the input stream of another process, something like this.

```text
                 foo                            bar
          +-----------------+            +-----------------+
          |                 |            |                 |
    >---->> stdin    stdout >>---------->> stdin    stdout >>----->
          |                 |            |                 |
          |          stderr >>--->       |          stderr >>---->
          |                 |            |                 |
          +-----------------+            +-----------------+
```

This picture is supposed to represent the idea that the `foo` process can
send information to the `bar` process by `foo` printing to its standard
output stream and `bar` reading from its standard input stream.

The above picture is not possible. The operating system does not allow the
output stream of one process to be connected directly to the input stream of
another process. But the idea is very useful, so the operating system provides
an object, called a **pipe**, that can be placed between two processes, and
can allow the output from one process to be used as input to another process.

Consider the following command-line.

```text
    > foo | bar
```

The `|` character is (in the context of a command-line) called the **pipe
symbol**. This command-line asks the shell process to create two child
processes, one from the `foo` program and the other from the `bar` program.
In addition, the shell process will ask the operating system to create a
**pipe** object and have the standard output stream of the `foo` process
redirected to the input of the pipe, and have the standard input stream
of the `bar` process redirected to the output of the pipe. This create a
picture that looks like the following. Notice that `foo` shares the keyboard
with the shell, and `bar` shares the console window with the shell. Also
notice that the error stream from `foo` is combined with the output and
errors streams from both the shell and `bar`.

```text
                                      shell
                               +-----------------+
                               |                 |
                +------------->> stdin    stdout >>--------------------+
                |              |                 |                     |
                |              |          stderr >>--------------------+
                |              |                 |                     |
                |              +-----------------+                     |
                |                                                      |
   keyboard >---+                                                      +---> console window
                |                                                      |
                |          foo                           bar           |
                |   +----------------+            +----------------+   |
                |   |                |    pipe    |                |   |
                +-->> stdin   stdout >>--======-->> stdin   stdout >>--+
                    |                |            |                |   |
                    |         stderr >>--+        |         stderr >>--+
                    |                |   |        |                |   |
                    +----------------+   |        +----------------+   |
                                         |                             |
                                         |                             |
                                         +-----------------------------+
```

The shell process will wait for *both* child processes to terminate before
the shell will resume using the shared keyboard and console window.

If we type a command-line like this,

```text
    > foo < data.txt | bar > result.txt
```

then the shell process will ask the operating system to create two child
processes, one from the `foo` program and the other from the `bar` program.
In addition, the shell process will ask the operating system to create a
**pipe** object and have `stdout` of the `foo` process redirected to the
input of the pipe, and have `stdin` of the `bar` process redirected to the
output of the pipe. Finally, the shell process will ask the operating system
to redirect the `foo` process's standard input to the file `data.txt` and
redirect the `bar` process's standard output to the file `result.txt`. While
this command is executing, it looks like the following picture (this picture
doesn't show the parent shell process and its streams).

```text
                      foo                           bar
               +----------------+            +----------------+
               |                |    pipe    |                |
  data.txt >-->> stdin   stdout >>--======-->> stdin   stdout >>-----> result.txt
               |                |            |                |
               |         stderr >>--+        |         stderr >>---+-> console window
               |                |   |        |                |    |
               +----------------+   |        +----------------+    |
                                    |                              |
                                    +------------------------------+
```

In the above command, the two processes, `foo` and `bar`, are running
simultaneously (in parallel) with each other. The pipe object acts as
a "buffer" between the two processes. Whenever the `foo` process writes
something to its output stream, that something gets put in the pipe
"buffer". Then when the `bar` process wants to read some input data,
it reads whatever is currently in the pipe "buffer".

If the `foo` process writes data into the pipe buffer faster than the `bar`
process can read  data out of the pipe buffer, then data accumulates in
the buffer. If the `foo` process writes data so fast that it fills up the
buffer, then the operating system makes the `foo' process "block" and wait
for the `bar` process to read some data from the pipe buffer. When the `bar`
process reads some data from the buffer, then the operating system "unblocks"
the `bar` process so that it can resume writing data into the buffer.

If the `bar` process reads data out of the pipe buffer faster than the `foo`
process can write data into the buffer, then the `bar` process will often
find the pipe empty when `bar` wants to read some data. In that case, the
operating system "blocks" the `bar` process and makes it wait until some
data shows up in the pipe. When the `foo` process writes some data to the
pipe, then the operating system "unblocks" the `bar` process so that it
can resume reading data from the pipe buffer. (You should compare this to
what happens when a process tries to `pop()` and empty stack data structure.)

When `foo` terminates, it may be that data still remains in the pipe. In
that case `bar` will continue to run until it has emptied the pipe. When
`bar` reads the last byte of data from the pipe buffer, then the operating
systems tells `bar` that it has reached the end-of-file on its input stream.

It is possible for the `bar` process to terminate before the `foo` process
does. In that case, it is not a good idea to let the `foo` process fill up
the pipe buffer and then block (forever). If the `bar` process terminates
and the `foo` process then writes data into the pipe buffer, the operating
system sends an I/O exception to the `foo` process. Any data left in the
pipe buffer is considered lost.

This coordination that we just described, between the two processes on the
ends of a pipe, is referred to in computer science as "bounded buffer
synchronization" or the "producer-consumer problem".

* <https://en.wikipedia.org/wiki/Bounded-buffer_problem>
* <https://pages.cs.wisc.edu/~remzi/OSTEP/threads-cv.pdf#page=6>


In the Linux `bash` shell there is another version of the pipe operator,
the `|&` operator. If we type this command-line,

```text
    $ foo |& bar
```

then the `bash` process will ask the operating system to create `foo` and
`bar` child processes, then `bash` will ask the operating system to create
a `pipe` object and have `stdout` *and* `stderr` of the `foo` process
redirected to the input of the pipe, and have `stdin` of the `bar` process
redirected to the output of the pipe. While this command is executing, it
looks like the following picture (this picture doesn't show the `bash`
shell process and its streams).

```text
                      foo                             bar
               +----------------+              +----------------+
               |                |      pipe    |                |
  keyboard >-->> stdin   stdout >>--+-======-->> stdin   stdout >>---+-> console window
               |                |   |          |                |    |
               |         stderr >>--+          |         stderr >>---+
               |                |              |                |
               +----------------+              +----------------+
```

This would be useful if the `bar` process needs to know about and handle
errors from the `foo` process. The Windows `cmd` shell does not have this
version of the pipe operator but it can be implemented with this slightly
more complex command-line (which works on Linux too).

```text
    > foo 2>&1 | bar
```


Here is another way to think about the shell's pipeline operator. The shell
process could run the two programs, `foo` and `bar`, sequentially, one after
the other. In other words, the shell process could interpret this command,

```text
    > foo < data.txt | bar > result.txt
```

as the following three commands.

```text
    > foo < data.txt > temp
    > bar < temp > result.txt
    > del temp
```

These three commands would have a picture that looks like this.

```text
                        foo
                +-----------------+
                |                 |
   data.txt >-->> stdin    stdout >>----> temp
                |                 |
                |          stderr >>----> console window
                |                 |
                +-----------------+

                        bar
                +-----------------+
                |                 |
       temp >-->> stdin    stdout >>----> result.txt
                |                 |
                |          stderr >>----> console window
                |                 |
                +-----------------+
```

First the `foo` process runs with its output stored in a temporary file
called `temp`. Then the `bar` process runs with its input coming from
the `temp` file. Then the `temp` file is deleted.

Notice that this sequential interpretation of the pipeline command might
be considerably slower than the parallel interpretation. And since the
sequential interpretation needs to store all the intermediate data in
a temp file, the sequential interpretation may require far more storage
space than the parallel interpretation.


One final remark. Do not confuse the shell's pipe operator, the `|`
character, with the operating system's `pipe` object. The operating
system's `pipe` object is an object provided by the OS to efficiently
implement one kind of interprocess communication. The shell's pipe
operator is a way for the shell's user to request that two processes
communicate. The shell may or may not implement its pipe operator using
an OS `pipe` object (see the last few paragraphs).

Here is the documentation for the Linux and Windows operating system
functions that create `pipe` objects.

* <https://man7.org/linux/man-pages/man2/pipe.2.html>
* <https://learn.microsoft.com/en-us/windows/win32/api/namedpipeapi/nf-namedpipeapi-createpipe>

Here are references to the `bash` and `cmd.exe` pipe operators.

* <https://en.wikipedia.org/wiki/Pipeline_(Unix)>
* <https://en.wikipedia.org/wiki/Vertical_bar#Pipe>
* <https://www.gnu.org/software/bash/manual/html_node/Pipelines.html>
* <https://linuxcommand.org/lc3_lts0070.php#:~:text=Pipelines>
* <https://ss64.com/nt/syntax-redirection.html#:~:text=Pipes%20and%20CMD.exe>
* <https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-xp/bb490982(v=technet.10)>
* <https://en.wikipedia.org/wiki/Foobar>



## Filters and Pipelines

All the example code mentioned this section is in the sub folder called
`filter_programs` from the following zip file.

* <http://cs.pnw.edu/~rlkraft/cs33600/for-class/streams_and_processes.zip>


Pipes are extremely useful. Their usefulness comes from combining them with
a kind of program called a filter. When pipes and filters are combined
together, we call these systems **data pipelines**.

* <https://en.wikipedia.org/wiki/Pipeline_(software)>
* <https://aws.amazon.com/what-is/data-pipeline/>
* <https://www.ibm.com/think/topics/data-pipeline>
* <https://dataengineering.wiki/Concepts/Data+Pipeline>

A **filter** is a program that reads data from its standard input stream, does
some kind of operation on the data, and then writes the converted data to its
standard output stream.

* <https://en.wikipedia.org/wiki/Filter_(software)>

Data pipelines are usually implemented on a very large scale, processing
gigabytes of data. But pipelines can also be useful at a small scale, while
working with files on your personal computer. The Windows and Linux operating
systems both come with many filter programs installed. Filter programs can be
used, for example, to sort, search, format, or convert files.

* <https://linuxcommand.org/lc3_lts0070.php#:~:text=Filters>

To get a feel for working with pipes and filters, it helps to experiment with
actual filter programs. In this section we will work with a collection of
simple filter programs, written in Java and C, contained in the folder
`filter_programs`.

In the `filter_programs` folder there are Java programs that act as filters.
They are all short programs that do simple manipulations of the their input
characters. Look at the source code to these programs. Compile them and then
run them using command-lines like the following.

```text
    > java Reverse < Readme.txt > result.txt
    > java Double < Readme.txt | java Reverse
    > java Double | java ToUpperCase | java Reverse
    > java ShiftN 2 | java ToUpperCase | java Reverse
    > java Twiddle < Readme.txt | java ToUpperCase | java Double | java RemoveVowels > result2.txt
    > java Find pipe < Readme.txt | java CountLines
    > java OneWordPerLine < Readme.txt | Find pipe | java CountLines
```

Run a couple of the programs by themselves, without any I/O redirection or
pipes, to see how they manipulate input data (from the keyboard) to produce
output data (in the console window).

```text
    > java ToUpperCase
    > java Double
    > java Reverse
    > java MakeOneLine
```

Notice that you need to tap the `Enter` key to send input from the keyboard
to the process. Sometimes you see immediate output. Sometimes, for example
`CountLines.java' or `LongestLine.java`, there is no output until the input
is terminated (end-of-file). You denote the end of your input to the process
by typing `Control-z` on Windows or `Control-d` on Linux. **Do not* use
`Control-C`. That terminates the process (instead of terminating just the
process's input) and causes the process's output to be lost. Sometimes
(for example `MakeOneLine.java`) the results in the console window seem
not to explain how the program works as a filter.



## Command-line Syntax

We have seen that command-lines can be made up of, among other things,
program names, command-line arguments, file names, I/O redirection
operators (the `<` and `>` characters), and pipes (the `|` character).
In this section we will look at the syntax of building complex
command-lines that combine all of these elements along with a few
new elements.

We need to be careful when we use the phrase "command-line argument".
Here is why.

Consider the following "command-line". It uses the Java program `Find.java`
from the `filter_programs` directory. How many "command-line arguments" are
there? The answer is, of course, "It depends!".

```text
    > java Find pipe < Readme.txt > temp.txt
```

One the one hand, we can say that there are "no command-line arguments"
because this is just an input string that the shell process reads from its
standard input stream. The shell process parses this string and then builds
a command-line to give to the operating system. The command-line for the
operating system asks the OS to create a `java` process with two command-line
arguments, "Find" and "pipe". The rest of the input string is used by the
*shell process* to decide to ask the OS to redirect the standard input and
output streams for the `java` process. From the point of view of the `java`
process we can say that there are "two command-line arguments". But there is
still a third point of view. The `java` process implements the Java Virtual
Machine (JVM) and the `Find.class` file an executable file from the point of
view of the JVM. The JVM (virtually) executes a `Find` process. The `main()`
method of the `Find` process is passed "one command-line argument", the
string "pipe".

So the answer to the question, "How many command-line arguments are there?"
is none, from the point of view of the shell process, two from the point of
view of the `java` process, and one from the point of view of the (virtual)
`Find` process.

Anther way to say this is that the tokens "Find" and "pipe" are definitely
command-line arguments, and the "java", "<", "Readme.txt", ">", and
"temp.txt" tokens are not command-line arguments, they are tokens used by
the shell process.


Here are some references for the CMD shell syntax.

* <https://ss64.com/nt/syntax-redirection.html>
* <https://ss64.com/nt/syntax-conditional.html>
* <https://ss64.com/nt/syntax-esc.html>
* <https://ss64.com/nt/syntax.html>
* <https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/cmd>
* <https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/command-line-syntax-key>
* <https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-xp/bb490954(v=technet.10)>
* <https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-xp/bb490982(v=technet.10)>

Here are some references for the Bash shell syntax.

* <https://man7.org/linux/man-pages/man1/bash.1.html#SHELL_GRAMMAR>
* <https://www.gnu.org/software/bash/manual/bash.html#Shell-Syntax>
* <https://catonmat.net/bash-redirections-cheat-sheet>


Here are some review problems that ask you to use the materail discussed
in this document.

### Problem 1
Explain what each of the following possible command-lines mean.
In each problem, you need to associate an appropriate meaning to the
symbols `a`, `b` and `c`. Each symbol can represent either a program,
a file, or a command-line argument.

For example "a is the name of a program, b and c are the names of files",
or "a and b are the names of programs and c is the name of a file",
or "a is the name of a program, b and c are arguments to the program".
Also give a specific example of a runnable command-line with the given
format using Windows command-line programs like `dir`, `more`, `sort`,
`find`, `echo`, etc.

* <https://ss64.com/nt/>

```text
    > a > b < c
    > a < b > c
    > a | b > c
    > a < b | c
    > a   b   c
    > a   b > c
    > a   b | c
    > a & b < c
    > a < b   c
    > a < b & c
    > a & b | c
    > a &(b | c)
    >(a & b)| c
    >(a & b)> c
    > a & b   c
    > a & b & c
```


### Problem 2
Draw a picture illustrating the processes, streams, pipes, and files
in each of the following command-lines.

(a)
```text
    > b < a | c > d
```

(b)
```text
    > a < b | c 2> d | e > f 2> d
```


### Problem 3
Draw a picture that illustrates all the processes, pipes, files, and
(possibly shared) streams in the following situation. A process p1 opens
the file a.txt for input and then it opens the file b.txt for output. Then
process p1 creates a pipe. Then p1 creates a child process p2 with p2
inheriting a.txt, the pipe's input, and p1's stderr as p2's stdin, stdout
and stderr streams. Then p1 creates another child process p3 with p3
inheriting the pipe's output, b.txt, and p1's stderr as p3's stdin, stdout
and stderr streams. Then p1 closes its stream to a.txt and the pipe's output.


### Problem 4
For the Windows cmd.exe shell, the `dir` command is a builtin command
so the cmd.exe process does all the work for the directory listing (there is
no `dir.exe` program). On the other hand, the `sort` and `find` commands are
not builtin (so there are `sort.exe` and `find.exe` programs).

* <https://ss64.com/nt/syntax-internal.html>

For each of the following cmd.exe command-lines, draw a picture of all the
relevant processes that shows the difference between a pipeline with a
buitin command and a pipeline with non builtin commands.

(a)
```text
    > dir | find "oops"
```

(b)
```text
    > sort /? | find "oops"
```


### Problem 5
What problem is there with each of the following two command lines?
Hint: Try to draw a picture of all the associated processes, streams,
pipes, and files.

```text
    > a | b < c
    > a > b | c
```



## Creating a pipe

For CS 33600, this section is optional.

Using Java to create a pipe is an interesting topic and is useful for
solving a number of practical problems (in fact, we will use it later in
the course when we implement an HTTP application server), but we need to
move on to other topics.

In this section we will show how a Java process can create a pipeline of two
other processes (the other two processes need *not* be Java processes). We
will approach this in two steps. In the first step, we will show how a Java
process can start a child process and feed data into the child and draw data
from the child. In the second step, we will show how a Java process can start
a pipeline of two child processes and feed data into the beginning of the
pipeline and draw data from the end of the pipeline.

For the first step, we want a Java process to create the following picture.

```text
                      Java process
                +-----------------------+
                |                       |
   keyboard >-->> stdin          stdout >>-----+---> console window
                |                       |      |
                |                stderr >>-----+
                |                       |      |
                |    out          in    |      |
                +----\ /---------/|\----+      |
                      |           |            |
                      |           |            |
               +------+           +------+     |
               |          child          |     |
               |    +---------------+    |     |
               |    |               |    |     |
               +--->> stdin  stdout >>---+     |
                    |               |          |
                    |        stderr >----------+
                    |               |
                    +---------------+
```

The Java process should create a new input stream, a new output stream,
and a child process, and then redirect the child's standard input to the
new output stream, and redirect the child's standard output to the new
input stream.

For the second step, we want a Java process to create the following picture.

```text
                      Java process
                +-----------------------+
                |                       |
   keyboard >-->> stdin          stdout >>---------+---> console window
                |                       |          |
                |                stderr >>---------+
                |                       |          |
                |    out          in    |          |
                +----\|/---------/|\----+          +------------+
                      |           |                             |
                      |           |                             |
    +-----------------+           +------------------------+    |
    |                                                      |    |
    |        child1                        child2          |    |
    |   +----------------+            +----------------+   |    |
    |   |                |    pipe    |                |   |    |
    +-->> stdin   stdout >>--======-->> stdin   stdout >>--+    |
        |                |            |                |        |
        |         stderr >>--+        |         stderr >>-------+
        |                |   |        |                |        |
        +----------------+   |        +----------------+        |
                             |                                  |
                             |                                  |
                             +----------------------------------+
```

The Java process should create a new input stream, a new output stream,
two child processes, and a pipe, and then redirect the first child's standard
input to the new output stream, redirect the second child's standard output to
the new input stream, and connect the two child processes with the pipe.

