COPY FROM

Copy data from files into a table.

Synopsis

COPY table_ident FROM uri [ WITH ( option = value [, ...] ) ]

where option can be one of:

  • bulk_size integer
  • concurrency integer
  • shared boolean
  • num_readers integer
  • compression string

Description

COPY FROM copies data from a URI to the specified table.

The nodes in the cluster will attempt to access the resources available under the URI and import the data.

The file(s) must contain one JSON formatted row per line.

For examples see: Importing data.

Supported URI Schemes

file

The default scheme used if the scheme isn’t part of the used URI.

The filepath given must be an absolute path and accessable by the crate process.

By default each node will attempt to read the files specified. If the URI points to a shared folder the shared option must be set to true in order to avoid duplicate imports.

s3

Can be used to access buckets on the Amazon AWS S3 Service:

s3://[<accesskey>:<secretkey>]@<bucketname>/<path>

If the accesskey and secret key is ommited an attempt to load the credentials from the environment or java settings is made.

Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY Java System Properties - aws.accessKeyId and aws.secretKey

Using a s3 URI will set the shared option implicitly.

Parameters

table_ident:The name (optionally schema-qualified) of an existing table where the data should be put.
uri:An expression which evaluates to a uri as defined in RFC2396. The supported schemes are listed above. The last part of the path may also contain * wildcards to match multiple files.

WITH Clause

The optional WITH clause can specify options for the COPY FROM statement.

[ WITH ( option = value [, ...] ) ]

Options

bulk_size

Crate will process the lines it reads from the path in bulks. This option specifies the size of such a bulk. The default is 10000.

Must be set to a number greater than 0

Warning

A bulk_size that is too high will cause a very high load on the cluster and some lines that should be inserted might even be dropped.

concurrency

The number of parallel bulk actions that should be executed. Default is 4. Must be set to a number greater than 0.

Warning

Similar to a high bulk_size a high concurrency setting will cause a very high load on the cluster.

shared

This option should be set if the URI points to a shared storage. It will prevent multiple nodes/readers from importing the same file.

The default value depends on the used URI scheme.

num_readers

The number of nodes that will read the resources specified in the URI. Defaults to the number of nodes available in the cluster. If the option is set to a number greater than the number of available nodes it will still only use each node only once to do the import.

Must be an integer that is greater than 0.

compression

The default value is null. Can be set to gzip to read gzipped files.