--reduce_shards
flag.
For example, if your cluster has 3 Alpha with 3 replicas per group, then there
is 1 group and --reduce_shards
should be set to 1. If your cluster has 6
Alphas with 3 replicas per group, then there are 2 groups and
--reduce_shards
should be set to 2.
--map_shards
option must be set to at least whatβs set
for --reduce_shards
. A higher number helps the Bulk Loader evenly distribute
predicates between the reduce shards.
out
directory by default. Hereβs the bulk load
output from the preceding example:
--reduce_shards
was set to 2
, two sets of p
directories are
generated:
./out/0
folder./out/1
folderAlpha1
, Alpha2
, Alpha3
) should have a
copy of ./out/0/p
Alpha4
, Alpha5
, Alpha6
) should have a
copy of ./out/1/p
, and so on.p
directory output.--schema
, -s
: set the location of the schema file.
--graphql_schema
, -g
(optional): set the location of the GraphQL schema
file.
--badger
superflagβs compression
option: Configure the compression of data
on disk. By default, the Snappy compression format is used, but you can also
use Zstandard compression. Or, you can choose no compression to minimize CPU
usage. To learn more, see
Data Compression on Disk.
--new_uids
: (default: false): Assign new UIDs instead of using the existing
UIDs in data files. This is useful to avoid overriding the data in a DB
already in operation.
-f
, --files
: Location of *.rdf(.gz)
or *.json(.gz)
files to load. It
can load multiple files in a given path. If the path is a directory, then all
files ending in .rdf
, .rdf.gz
, .json
, and .json.gz
are loaded.
--format
(optional): Specify file format (rdf
or json
) instead of
getting it from filenames. This is useful if you need to define a strict
format manually.
--store_xids
: Generate a xid edge for each node. It stores the XIDs (The
identifier / Blank-nodes) in an attribute named xid
in the entity itself.
--xidmap
(default: disabled
. Need a path): Store xid to uid mapping to a
directory. Dgraph saves all identifiers used in the load for later use in
other data import operations. The mapping is saved in the path you provide and
you must indicate that same path in the next load. It is recommended to use
this flag if you have full control over your identifiers (Blank-nodes).
Because the identifier is mapped to a specific UID.
--vault
superflag (and its options): specify the Vault server address, role
id, secret id, and field that contains the encryption key required to decrypt
the encrypted export.
Environment Variable | Description |
---|---|
AWS_ACCESS_KEY_ID or AWS_ACCESS_KEY | AWS access key with permissions to write to the destination bucket. |
AWS_SECRET_ACCESS_KEY or AWS_SECRET_KEY | AWS access key with permissions to write to the destination bucket. |
Environment Variable | Description |
---|---|
MINIO_ACCESS_KEY | MinIO access key with permissions to write to the destination bucket. |
MINIO_SECRET_KEY | MinIO secret key with permissions to write to the destination bucket. |
--replicas
flag greater than 1
p
directory has been created by the Bulk Loader, then start
only the first Alpha replica
--replicas
flag value set in the Zero nodes). Now the Alpha node (the one started in
step 2) logs similar messages:
p
directory (by the Bulk Loader) among all the Alphas
nodes. You can follow these steps:
rsync
) the p
directory to the other servers (the servers you
are using to start the other Alpha nodes)snapshot at index
value must be the same within the same Alpha group
and ReadTs
must be the same value within and among all the Alpha groups.
--force-namespace
flag, you can load all the data into a specific
namespace. In that case, the namespace information from the data and schema
files are ignored.
For example, to force the bulk data loading into namespace 123
:
p
directory
to a new Alpha server.
Hereβs an example to run Bulk Loader with a key used to write encrypted data:
vault_*
options can be used to
decrypt the encrypted export.
--encryption key-file=value
option was previously used to
encrypt the output p
directory. This same option is also used to decrypt the
encrypted export data and schema files.
Another option, --encrypted
, indicates whether the input rdf
/json
data and
schema files are encrypted or not. With this switch, we support the use case of
migrating data from unencrypted exports to encrypted import.
So, with the preceding two options there are four cases:
--encrypted=true
and no encryption key-file=value
.
Error: if the input is encrypted, a key file must be provided.
--encrypted=true
and encryption key-file=path-to-key
.
Input is encrypted and output p
dir is encrypted as well.
--encrypted=false
and no encryption key-file=value
.
Input isnβt encrypted and the output p
dir is also not encrypted.
--encrypted=false
and encryption key-file=path-to-key
.
Input isnβt encrypted but the output is encrypted. (This is the migration use
case mentioned previously).
vault_*
options can be used instead
of the --encryption key-file=value
option to achieve the same effect except
that the keys are sitting in a Vault server.
You can also use Bulk Loader, to turn off encryption. This generates a new
unencrypted p
thatβs used by the Alpha process. In this, case you need to pass
--encryption key-file
, --encrypted
and --encrypted_out
flags.
--encrypted=true
as the exported
data has been taken from an encrypted Dgraph cluster and weβre also specifying
the flag --encrypted_out=false
to specify that we want the p
directory (that
is generated by the Bulk Loader process) to be unencrypted.
dgraph bulk --help
. In
particular, you should tune the flags so that Bulk Loader doesnβt use more
memory than is available as RAM. If it starts swapping, it becomes incredibly
slow.
In the map phase, tweaking the following flags can reduce memory usage:
--num_go_routines
flag controls the number of worker threads. Lowering
reduces memory consumption.
--mapoutput_mb
flag controls the size of the map output files. Lowering
reduces memory consumption.
.rdf.gz
files (e.g. 256MB each). This
has a negligible impact on memory usage.
The reduce phase is less memory heavy than the map phase, although can still
use a lot. Some flags may be increased to improve performance, but only if you
have large amounts of RAM:
--reduce_shards
flag controls the number of resultant Dgraph Alpha
instances. Increasing this increases memory consumption, but in exchange
allows for higher CPU utilization.
--map_shards
flag controls the number of separate map output shards.
Increasing this increases memory consumption but balances the resultant Dgraph
Alpha instances more evenly.