Plugging USDA Nutrition Data into MongoDB

by Valeri Karpov@code_barbarianMarch 28, 2014

View the Code on GithubView the Code for this Post on Github

As much as I love geeking out about basketball stats, I want to put a MongoDB data set out there that's a bit more app-friendly: the USDA SR25 nutrient database. You can download this data set from my S3 bucket here, and plug it into your MongoDB instance using mongorestore. I'm very meticulous about nutrition and have, at times, kept a food journal, but sites like FitDay and DailyBurn have far too much spam and are far too poorly designed to be a viable option. With this data set, I plan on putting together an open source web-based food journal in the near future. However, I encourage you to use this data set to build your own apps.

Data Set Structure

The data set contains one collection, 'nutrition'. The documents in this collection contain merged data from the SR25 database's very relational FOOD_DES, NUTR_DEF, NUT_DATA, and WEIGHT files. In more comprehensible terms, the documents contain a description of a food item, a list of nutrients with measurements per 100g, and a list of common serving sizes for that food. Here's what the top level document for grass-fed ground bison looks like in RoboMongo, a simple MongoDB GUI:

The top level document is fairly simple: the description is a human-readable description of the food, the manufacturer is the company that manufactures the product, and survey is whether or not the data set has values for the 65 nutrients used for some government survey. However, the real magic happens in the nutrients and weights subdocuments. Lets see what happens when we open up nutrients:

You'll see that there are an incredible amount of nutrients. The nutrients data is in an array, where each subdocument in the array has a tagname, which is a common scientific abbreviation for the nutrient, a human-readable description, and an amountPer100G with corresponding units. In the above example, you'll see that 100 grams of cooked grass-fed ground bison contains about 25.45 g of protein.

(Note: the original data set includes some more detailed data, including standard deviations and sample sizes for the nutrient measurements, but that's outside the scope of what I want to do with this data set. If you want that data, feel free to read through the government data set's documentation and fork my converter on Github.)

Finally, the weights subdocument is another array which contains sub-documents that describe common serving sizes for the food item and their mass in grams. In the grass-fed ground bison example, the weights list contains a single serving size, 3 oz, which approximately 85 grams:

Exploring the Data Set

First things first: since the nutrients for each food are in an array, its not immediately obvious what nutrients this data set has. Thankfully, MongoDB's distinct command makes this very easy:

There are a lot of different nutrients in this data set. In fact, there are 145:

So how are we going to find nutrient data for a food that we're interested in? Suppose we're looking to find how many carbs are in raw kale. Pretty easy to do because MongoDB's shell supports JavaScript regular expressions, so lets just find documents where the description includes 'kale':

Of course, this doesn't include the carbohydrate content, so lets add a $elemMatch to the projection to limit output to the carbohydrates in raw kale:

Running Aggregations to Test Nutritional Claims

My favorite burger joint in Chelsea, brgr, claims that grass-fed beef has as much omega-3 as salmon. Lets see if this advertising claim holds up to scrutiny:

Right now, this is a bit tricky. Since I imported the data from the USDA as-is, total omega-3 fatty acids is not tracked as a single nutrient. The amounts for individual omega-3 fatty acids, such as EPA and DHA, are recorded separately. However, the different types of omega-3 fatty acids all have n-3 in the description, so it should be pretty easy to identify which nutrients we need to sum up to get total omega-3 fatty acids. Of course, when you need to significantly transform your data, its time to bust out the MongoDB aggregation framework.

The first aggregation we're going to do is find the salmon item that has the least amount of total omega-3 fatty acids per 100 grams. To do that, we first need to transform the documents to include the total amount of omega-3's, rather than the individual omega-3 fats like EPA and DHA. With the $group pipeline state and the $sum operator, this is pretty simple. Keep in mind that the nutrient descriptions for omega-3 fatty acids are always in grams in this data set, so we don't have to worry about unit conversions.

You can get a text version of the above aggregation on Github. To verify brgr's claim, lets run the same aggregation for grass-fed ground beef, but reversing the sort order.

Looks like brgr's claim doesn't quite hold up to a cursory glance. I'd be curious to see what the basis for their claim is, specifically if they assume a smaller serving size for salmon than for grass-fed beef.

Conclusion

Phew, that was a lot of information to cram into one post. The data set, as provided by the USDA, is a bit complex and could really benefit from some simplification. Thankfully, MongoDB 2.6 is coming out soon, and, with it, the $out aggregation operator. The $out operator will enable you to pipe output from the aggregation framework to a separate collection, so I'll hopefully be able to add total omega-3 fatty acids as a nutrient, among other things. Once again, feel free to download the data set here (or check out the converter repo on Github) and use it to build some awesome nutritional apps.

Found a typo or error? Open up a pull request! This post is available as markdown on Github