In my role at MongoDB, I spend a lot of time thinking about how to transform relational data models into efficient MongoDB document structures. My workflow is to import relational data as-is, and then use the aggregation framework on the data to create various document structures. Given the ubiquity of CSV as an efficient way to organize and share data, in this post, I’m going to show how to quickly get CSV files into MongoDB.
The simplest way to import data is to use mongoimport. The tool supports importing several file types with options to manage field names, and field data types. There are way too many mongoimport options to discuss here. I’ll focus on the specific options I use to work with CSV data.
The first thing I work on is defining the correct field names. This can be done in one of two ways:
- Include a headerline with field names as the first line in the CSV itself
- Create a separate file with header names
Your choice of method really comes down to personal preference. Although, if dealing with a large CSV file, defining header names in a separate files may be the best approach. I generally don’t work with files that large, so I generally define headers in the CSV files itself.
In addition, you can also set the MongoDB datatype for each imported header/field. By default, mongoimport will make a best effort to define datatypes based on file content. But this doesn’t always produce the exact results we are looking for.
So let’s say I want to import historical stock price data. To get the desired field names and data types, I simply add the following as the first line in the CSV file:
symbol.string(), date.date(2006-01-02), open.decimal(), high.decimal(), low.decimal(), close.decimal, volume.int32()
Assuming my CSV file name is stocks.csv, and that I want to import data into the stocks database within the equities collection, I would issue the following:
mongoimport -d stocks -c equities --type csv --headerline --columnsHaveTypes --file stocks.csv
Notice that I use several options of note here:
|–type csv||specifies import file type as a csv file|
|–headerline||indicates that headers are the top line of the csv file|
|–columnsHaveTypes||headers have defined datatypes|
|–file||specifies the import file name|
When I want to import more than just a few files, I use a for loop to import multiple files in sequence:
for i in `ls *csv`; do mongoimport -d armchair --type csv --headerline --file $i; done