Execute a computation

This section explains how to manually perform the steps that would normally be performed by a framework bridge to execute a computation. nGraph graphs are targeted toward automatic construction; it is far easier for a processor (a CPU, GPU, or purpose-built silicon) to execute a computation than it is for a human to map out how that computation happens. Unfortunately, things that make by-hand graph construction simpler tend to make automatic construction more difficult, and vice versa.

Nevertheless, it can be helpful to break down what is happening during graph construction. The documetation that follows explains two approaches frameworks can use to compile with nGraph operations:

The nGraph Intermediate Representation uses a strong, dynamic type system, including static shapes. This means that at compilation, every tensor (or, equivalently, every node output) in the graph is assigned complete shape information; that is, one and only one shape. The static process by which this assignment takes place is called shape propagation.

In the first scenario, the model description walk-through is based on the abc.cpp code in the /doc/examples/abc directory, and it deconstructs the steps that must happen (either programmatically or manually) in order to successfully execute a computation given complete shape information.

Scenario One: Using Complete Shapes

A step-by-step example of how a framework might execute with complete shape information is provided here. For a step-by-step example using dynamic shapes, see Scenario Two: Known Partial Shape.

Define the computation

To a framework, a computation is simply a transformation of inputs to outputs. While a bridge can programmatically construct the graph from a framework’s representation of the computation, graph construction can be somewhat more tedious when done manually. For anyone interested in specific nodes (vertices) or edges of a computation that reveal “what is happening where”, it can be helpful to think of a computation as a zoomed-out and stateless data-flow graph where all of the nodes are well-defined tensor operations and all of the edges denote use of an output from one operation as an input for another operation.

Most of the public portion of the nGraph API is in the ngraph namespace, so we will omit the namespace. Use of namespaces other than std will be namespaces in ngraph. For example, the op::v1::Add is assumed to refer to ngraph::op::v1::Add. A computation’s graph is constructed from ops; each is a member of a subclass of op::Op, which, in turn, is a subclass of Node. Not all graphs are computation, but all graphs are composed entirely of instances of Node. Computation graphs contain only op::Op nodes.

We mostly use shared pointers for nodes, i.e. std::shared_ptr<Node>, so that they will be automatically deallocated when they are no longer needed. More detail on shared pointers is given in the glossary.

Every node has zero or more inputs, zero or more outputs, and zero or more attributes.

The specifics for each type permitted on a core Op-specific basis can be discovered in our List of Core ops docs. For our purpose to define a computation, nodes should be thought of as essentially immutable; that is, when constructing a node, we need to supply all of its inputs. We get this process started with ops that have no inputs, since any op with no inputs is going to first need some inputs.

op::v0::Parameter specifes the tensors that will be passed to the computation. They receive their values from outside of the graph, so they have no inputs. They have attributes for the element type and the shape of the tensor that will be passed to them.

    // Build the graph
    Shape s{2, 3};
    auto a = std::make_shared<op::v0::Parameter>(element::f32, s);
    auto b = std::make_shared<op::v0::Parameter>(element::f32, s);
    auto c = std::make_shared<op::v0::Parameter>(element::f32, s);

The above code makes three parameter nodes where each is a 32-bit float of shape (2, 3) and a row-major element layout.

To create a graph for (a + b) * c, first make an op::v1::Add node with inputs from a and b, and an op::v1::Multiply node from the add node and c:

    auto t0 = std::make_shared<op::v1::Add>(a, b);
    auto t1 = std::make_shared<op::v1::Multiply>(t0, c);

When the op::v1::Add op is constructed, it will check that the element types and shapes of its inputs match; to support multiple frameworks, ngraph does not do automatic type conversion or broadcasting. In this case, they match, and the shape of the unique output of t0 will be a 32-bit float with shape (2, 3). Similarly, op::v1::Multiply checks that its inputs match and sets the element type and shape of its unique output.

Once the graph is built, we need to package it in a Function:

    auto f = std::make_shared<Function>(OutputVector{t1},
                                        ParameterVector{a, b, c});

The first argument to the constuctor specifies the nodes that the function will return; in this case, the product. An OutputVector is a vector of references to outputs of op::v0::Node. The second argument specifies the parameters of the function, in the order they are to be passed to the compiled function. A ParameterVector is a vector of shared pointers to op::v0::Parameter.

Important

The parameter vector must include every parameter used in the computation of the results.

Specify the backend upon which to run the computation

For a framework bridge, a backend is the environment that can perform the computations; it can be done with a CPU, GPU, or purpose-built silicon. A transformer can compile computations for a backend, allocate and deallocate tensors, and invoke computations.

Factory-like managers for classes of backend managers can compile a Function and allocate backends. A backend is somewhat analogous to a multi-threaded process.

There are two backends for the CPU: the optimized "CPU" backend, which uses the DNNL, and the "INTERPRETER" backend, which runs reference versions of kernels that favor implementation clarity over speed. The "INTERPRETER" backend can be slow, and is primarily intended for testing. See the documentation on runtime options for various backends for additional details.

To continue with our original example and select the "CPU_Backend":

    // Create the backend
    auto backend = runtime::Backend::create("CPU");

Compile the computation

Compilation triggers something that can be used as a factory for producing a CallFrame which is a function and its associated state that can run in a single thread at a time. A CallFrame may be reused, but any particular CallFrame must only be running in one thread at any time. If more than one thread needs to execute the function at the same time, create multiple CallFrame objects from the ExternalFunction.

Allocate backend storage for the inputs and outputs

At the graph level, functions are stateless. They do have internal state related to execution, but there is no user-visible state. Variables must be passed as arguments. If the function updates variables, it must return the updated variables.

To invoke a function, tensors must be provided for every input and every output. At this time, a tensor used as an input cannot also be used as an output. If variables are being updated, you should use a double-buffering approach where you switch between odd/even generations of variables on each update.

Backends are responsible for managing storage. If the storage is off-CPU, caches are used to minimize copying between device and CPU. We can allocate storage for the three parameters and the return value.

    // Allocate tensors for arguments a, b, c
    auto t_a = backend->create_tensor(element::f32, s);
    auto t_b = backend->create_tensor(element::f32, s);
    auto t_c = backend->create_tensor(element::f32, s);
    // Allocate tensor for the result
    auto t_result = backend->create_tensor(element::f32, s);

Each tensor is a shared pointer to a Tensorview, which is the interface backends implement for tensor use. When there are no more references to the tensor view, it will be freed when convenient for the backend. See the Backend APIs documentation for details on how to work with Tensor.

Initialize the inputs

Next we need to copy some data into the tensors.

    // Initialize tensors
    float v_a[2][3] = {{1, 2, 3}, {4, 5, 6}};
    float v_b[2][3] = {{7, 8, 9}, {10, 11, 12}};
    float v_c[2][3] = {{1, 0, -1}, {-1, 1, 2}};

    t_a->write(&v_a, sizeof(v_a));
    t_b->write(&v_b, sizeof(v_b));
    t_c->write(&v_c, sizeof(v_c));

The runtime::Tensor interface has write and read methods for copying data to/from the tensor.

Invoke the computation

To invoke the function, we simply pass argument and resultant tensors to the call frame:

    // Invoke the function
    auto exec = backend->compile(f);

Access the outputs

We can use the read method to access the result:

    // Get the result
    float r[2][3];
    t_result->read(&r, sizeof(r));

    std::cout << "[" << std::endl;
    for (size_t i = 0; i < s[0]; ++i)
    {
        std::cout << " [";
        for (size_t j = 0; j < s[1]; ++j)
        {
            std::cout << r[i][j] << ' ';
        }
        std::cout << ']' << std::endl;
    }
    std::cout << ']' << std::endl;

    return 0;

Compiling with Complete Shape Information

“The (a + b) * c example for executing a computation on nGraph”
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
//*****************************************************************************
// Copyright 2017-2020 Intel Corporation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//*****************************************************************************

#include <iostream>

#include <ngraph/ngraph.hpp>

using namespace ngraph;

int main()
{
    // Build the graph
    Shape s{2, 3};
    auto a = std::make_shared<op::v0::Parameter>(element::f32, s);
    auto b = std::make_shared<op::v0::Parameter>(element::f32, s);
    auto c = std::make_shared<op::v0::Parameter>(element::f32, s);

    auto t0 = std::make_shared<op::v1::Add>(a, b);
    auto t1 = std::make_shared<op::v1::Multiply>(t0, c);

    // Make the function
    auto f = std::make_shared<Function>(OutputVector{t1},
                                        ParameterVector{a, b, c});

    // Create the backend
    auto backend = runtime::Backend::create("CPU");

    // Allocate tensors for arguments a, b, c
    auto t_a = backend->create_tensor(element::f32, s);
    auto t_b = backend->create_tensor(element::f32, s);
    auto t_c = backend->create_tensor(element::f32, s);
    // Allocate tensor for the result
    auto t_result = backend->create_tensor(element::f32, s);

    // Initialize tensors
    float v_a[2][3] = {{1, 2, 3}, {4, 5, 6}};
    float v_b[2][3] = {{7, 8, 9}, {10, 11, 12}};
    float v_c[2][3] = {{1, 0, -1}, {-1, 1, 2}};

    t_a->write(&v_a, sizeof(v_a));
    t_b->write(&v_b, sizeof(v_b));
    t_c->write(&v_c, sizeof(v_c));

    // Invoke the function
    auto exec = backend->compile(f);
    exec->call({t_result}, {t_a, t_b, t_c});

    // Get the result
    float r[2][3];
    t_result->read(&r, sizeof(r));

    std::cout << "[" << std::endl;
    for (size_t i = 0; i < s[0]; ++i)
    {
        std::cout << " [";
        for (size_t j = 0; j < s[1]; ++j)
        {
            std::cout << r[i][j] << ' ';
        }
        std::cout << ']' << std::endl;
    }
    std::cout << ']' << std::endl;

    return 0;
}

Scenario Two: Known Partial Shape

The second scenario involves the use of dynamic tensors. A dynamic tensor is a tensor whose shape can change from one “iteration” to the next. When a dynamic tensor is created, a framework bridge might supply only partial shape information: it might be all the tensor dimensions, some of the tensor dimensions, or none of the tensor dimensions; furthermore, the rank of the tensor may be left unspecified. The “actual” shape of the tensor is not specified until some function writes some value to it. The actual shape can change when the value of the tensor is overwritten. It is the backend’s responsibility to set the actual shape. The model description for the second scenario based on the partial_shape.cpp code in the /doc/examples/dynamic_tensor directory, and it deconstructs the steps that must happen (either programmatically or manually) in order to successfully retreive shape data.

Create and compile a graph where the provided info of shape x is (2,?):

    auto x_shape_info = PartialShape{2, Dimension::dynamic()};
    auto x = make_shared<op::v0::Parameter>(element::i32, x_shape_info);
    auto a = x + x;
    auto f = make_shared<Function>(OutputVector{a}, ParameterVector{x});
    auto be = runtime::Backend::create("CPU", true);
    auto ex = be->compile(f);

Create a dynamic tensor

Create a dynamic tensor of shape (2,?)

    auto t_out = be->create_dynamic_tensor(element::i32, x_shape_info);
    execute(be, ex, t_out, 3);
    execute(be, ex, t_out, 11);
    execute(be, ex, t_out, 20);

At this point, t_out->get_shape() would throw an exception, while t_out->get_partial_shape() would return "(2,?)".

Initialize input of shape

    auto t_in = be->create_tensor(element::i32, Shape{2, n});
    {
        vector<int32_t> t_val(2 * n);
        iota(t_val.begin(), t_val.end(), 0);
        t_in->write(&t_val[0], t_val.size() * sizeof(t_val[0]));
    }

At this point, t_out->get_shape() would return Shape{2,3}, while t_out->get_partial_shape() would return "(2,?)".

Get the result

    ex->call({t_out}, {t_in});

    auto s = t_out->get_shape();
    vector<int32_t> r(s[0] * s[1]);
    t_out->read(&r[0], r.size() * sizeof(r[0]));
    cout << "[" << endl;
    for (size_t i = 0; i < s[0]; ++i)
    {
        cout << " [";
        for (size_t j = 0; j < s[1]; ++j)
        {
            cout << r[i * s[1] + j] << ' ';
        }
        cout << ']' << endl;
    }
    cout << ']' << endl;
}

At this point, t_out->get_shape() would return Shape{2,20}, while t_out->get_partial_shape() would return "(2,?)".

Compiling with Known Partial Shape

“Full code for compiling with dynamic tensors and partial shape”
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
//*****************************************************************************
// Copyright 2017-2020 Intel Corporation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//*****************************************************************************

#include <iostream>
#include <numeric>
#include <vector>

#include <ngraph/ngraph.hpp>

using namespace std;
using namespace ngraph;

void execute(shared_ptr<runtime::Backend> be,
             shared_ptr<runtime::Executable> ex,
             shared_ptr<runtime::Tensor> t_out,
             uint32_t n);

int main()
{
    // Create and compile a graph where the provided info of shape of x is
    // (2,?)
    auto x_shape_info = PartialShape{2, Dimension::dynamic()};
    auto x = make_shared<op::v0::Parameter>(element::i32, x_shape_info);
    auto a = x + x;
    auto f = make_shared<Function>(OutputVector{a}, ParameterVector{x});
    auto be = runtime::Backend::create("CPU", true);
    auto ex = be->compile(f);

    // Create a dynamic tensor of shape (2,?)
    auto t_out = be->create_dynamic_tensor(element::i32, x_shape_info);
    execute(be, ex, t_out, 3);
    execute(be, ex, t_out, 11);
    execute(be, ex, t_out, 20);

    return 0;
}

void execute(shared_ptr<runtime::Backend> be,
             shared_ptr<runtime::Executable> ex,
             shared_ptr<runtime::Tensor> t_out,
             uint32_t n)
{
    // Initialize input of shape (2, n)
    auto t_in = be->create_tensor(element::i32, Shape{2, n});
    {
        vector<int32_t> t_val(2 * n);
        iota(t_val.begin(), t_val.end(), 0);
        t_in->write(&t_val[0], t_val.size() * sizeof(t_val[0]));
    }
    // Get the result
    ex->call({t_out}, {t_in});

    auto s = t_out->get_shape();
    vector<int32_t> r(s[0] * s[1]);
    t_out->read(&r[0], r.size() * sizeof(r[0]));
    cout << "[" << endl;
    for (size_t i = 0; i < s[0]; ++i)
    {
        cout << " [";
        for (size_t j = 0; j < s[1]; ++j)
        {
            cout << r[i * s[1] + j] << ' ';
        }
        cout << ']' << endl;
    }
    cout << ']' << endl;
}