There’s big data and then there’s really big data — like, BIG DATA, yo.
NASA’s JPL deals in all-caps.
Emphasis in red added by me.
Brian Wood, VP Marketing
NASA on managing the big data deluge from space
When it comes to big data, NASA is in the top 10 of users. The space agency collects hundreds of terabytes of data every hour–“the equivalent of all the data flowing on the Internet every two days,” according to JPL–which creates extreme challenges in data storage, processing and accessing. Here’s how they manage it all and stay abreast of the deluge.
Eric De Jong at NASA’s Jet Propulsion Laboratory is the principal investigator for NASA’s Solar System Visualization project. He and his team convert NASA mission science into visualization products for researchers. He sums up that enormous challenge in a JPL post this way:
“Scientists use big data for everything from predicting weather on Earth to monitoring ice caps on Mars to searching for distant galaxies. We are the keepers of the data, and the users are the astronomers and scientists who need images, mosaics, maps and movies to find patterns and verify theories.”
While that post covers the basics of how JPL deals with big data from the cloud, open source coding and automated visualization tools to data accessing tools, you’ll find the presentation–complete with videos–from JPL’s May 2013 symposium on big data visualization to be far more detailed.
It’s a good tutorial on how to make big data work that many government agencies and corporations can easily learn from–and it’s free. Take the time to review their work and you’re bound to find some answers to perplexing but common big data problems.
You’ll also learn valuable tips if you’re just getting started with big data.
“Look for the tasty low-hanging Big Data fruit,” said Tom Soderstrom, IT chief technology officer at JPL, in an article on MeriTalk. “These are problems that have significant business impact if solved. An end user, who is facilitated by the data scientist, articulates these business problems. They are short enough that they can be prototyped and demonstrated within a three-month period and with a low budget. Learn on easy problems. Then you will know where to make the next round of investments and what ROI you could expect.”
From time to time I’ll point you to other examples of successful big data projects to help you along your way. If you know of a successful project, please share it with me. The way forward for us all is to collectively share information on what works and what doesn’t in big data.
Managing the Deluge of ‘Big Data’ From Space
For NASA and its dozens of missions, data pour in every day like rushing rivers. Spacecraft monitor everything from our home planet to faraway galaxies, beaming back images and information to Earth. All those digital records need to be stored, indexed and processed so that spacecraft engineers, scientists and people across the globe can use the data to understand Earth and the universe beyond.
At NASA’s Jet Propulsion Laboratory in Pasadena, Calif., mission planners and software engineers are coming up with new strategies for managing the ever-increasing flow of such large and complex data streams, referred to in the information technology community as “big data.”
How big is big data? For NASA missions, hundreds of terabytes are gathered every hour. Just one terabyte is equivalent to the information printed on 50,000 trees worth of paper.
“Scientists use big data for everything from predicting weather on Earth to monitoring ice caps on Mars to searching for distant galaxies,” said Eric De Jong of JPL, principal investigator for NASA’s Solar System Visualization project, which converts NASA mission science into visualization products that researchers can use. “We are the keepers of the data, and the users are the astronomers and scientists who need images, mosaics, maps and movies to find patterns and verify theories.”
Building Castles of Data
De Jong explains that there are three aspects to wrangling data from space missions: storage, processing and access. The first task, to store or archive the data, is naturally more challenging for larger volumes of data. The Square Kilometer Array (SKA), a planned array of thousands of telescopes in South Africa and Australia, illustrates this problem. Led by the SKA Organization based in England and scheduled to begin construction in 2016, the array will scan the skies for radio waves coming from the earliest galaxies known.
JPL is involved with archiving the array’s torrents of images: 700 terabytes of data are expected to rush in every day. That’s equivalent to all the data flowing on the Internet every two days. Rather than build more hardware, engineers are busy developing creative software tools to better store the information, such as “cloud computing” techniques and automated programs for extracting data.
“We don’t need to reinvent the wheel,” said Chris Mattmann, a principal investigator for JPL’s big-data initiative. “We can modify open-source computer codes to create faster, cheaper solutions.” Software that is shared and free for all to build upon is called open source or open code. JPL has been increasingly bringing open-source software into its fold, creating improved data processing tools for space missions. The JPL tools then go back out into the world for others to use for different applications.
“It’s a win-win solution for everybody,” said Mattmann.
In Living Color
Archiving isn’t the only challenge in working with big data. De Jong and his team develop new ways to visualize the information. Each image from one of the cameras on NASA’s Mars Reconnaissance Orbiter, for example, contains 120 megapixels. His team creates movies from data sets like these, in addition to computer graphics and animations that enable scientists and the public to get up close with the Red Planet.
“Data are not just getting bigger but more complex,” said De Jong. “We are constantly working on ways to automate the process of creating visualization products, so that scientists and engineers can easily use the data.”
Data Served Up to Go
Another big job in the field of big data is making it easy for users to grab what they need from the data archives.
“If you have a giant bookcase of books, you still have to know how to find the book you’re looking for,” said Steve Groom, manager of NASA’s Infrared Processing and Analysis Center at the California Institute of Technology, Pasadena. The center archives data for public use from a number of NASA astronomy missions, including the Spitzer Space Telescope, the Wide-field Infrared Survey Explorer (WISE) and the U.S. portion of the European Space Agency’s Planck mission.
Sometimes users want to access all the data at once to look for global patterns, a benefit of big data archives. “Astronomers can also browse all the ‘books’ in our library simultaneously, something that can’t be done on their own computers,” said Groom.
“No human can sort through that much data,” said Andrea Donnellan of JPL, who is charged with a similarly mountainous task for the NASA-funded QuakeSim project, which brings together massive data sets — space- and Earth-based — to study earthquake processes.
QuakeSim’s images and plots allow researchers to understand how earthquakes occur and develop long-term preventative strategies. The data sets include GPS data for hundreds of locations in California, where thousands of measurements are taken, resulting in millions of data points. Donnellan and her team develop software tools to help users sift through the flood of data.
Ultimately, the tide of big data will continue to swell, and NASA will develop new strategies to manage the flow. As new tools evolve, so will our ability to make sense of our universe and the world.
NASA’s Jet Propulsion Lab Tackles Big Data
NASA’s Jet Propulsion Laboratory, like many large organizations, is taking on the Big Data problem: the task of analyzing enormous data sets to find actionable information.
In JPL’s case, the job involves collecting and mining data from 22 spacecraft and 10 instruments including the Mars Science Laboratory’s Curiosity rover and the Kepler space telescope. Tom Soderstrom, IT chief technology officer at JPL, joked that his biggest Big Data challenge is more down to Earth: dealing effectively with his email inbox. But kidding aside, JPL now confronts Big Data as a key problem and a key opportunity.
“If we define the Big Data era as beginning where our current systems are no longer effective, we have already entered this epoch,” Soderstrom explained.
The Problem Defined
Soderstrom defines a Big Data problem as having one or more of the following “V”s:
“We are already overflowing with radar data and earth science data,” Soderstrom said. “Once we get optical communications into space, the problem will increase by orders of magnitude.”
JPL finds its data arriving at a faster and faster rate, an issue for both spacecraft and ground-based systems. Soderstrom said JPL faces the question of how best to deal with an ever-increasing data speed in realtime.
JPLs engineering data once consisted of structured data. Today, JPL needs to combine both structured SQL data with unstructured NoSQL data. Soderstrom said reconciling the two can be very time consuming.
Here, the issue is data discovery and manipulation. Soderstrom notes that data is becoming more difficult to detect, extract and combine.
An organization must be able build business cases for Big Data if they are to determine which problem to take on first.
“It is challenging to come up with crisp business value and return on investment (ROI) for a Big Data problem,” Soderstrom said. “However, this is the key to prioritizing and solving Big Data problems.”
The payoff is potentially huge. Soderstrom said Big Data can yield scientific advances without the need to invest in big-ticket items.
“If we could effectively combine data from various sources — such as oceans data with ozone data with hurricane data — we could detect new science without needing to build new instruments or launch new spacecraft,” Soderstrom said.
Solving the Problem
Outdated IT systems represent one aspect of JPL’s Big Data challenge. System upgrades and the use of cloud computing will help address that issue, Soderstrom said. But new systems aren’t the only issue. JPL also needs to cultivate people with the skills to manage and analyze the data.
“Training our current workforce and augmenting with new personnel skilled in the new Big Data IT systems can solve this,” Soderstrom added.
One sought-after Big Data specialist is the data scientist. This role combines a range of skills in fields including statistics, programming, machine learning/data mining, structured and unstructured data, Big Data tools and modeling. A data scientist should also possess domain knowledge — science or engineering, for example — and the ability to provide a data narrative.
“Simply put, the data scientist teaches the data to tell an interesting story we didn’t already know,” Soderstrom said.
Data scientists won’t be expected to become experts in every Big Data field, but will need a high-level of proficiency across the board, he noted. A data scientist who is 80 percent good in many disciplines is better than one who is 100 percent good in any single discipline.
Other qualities include a penchant for exploring data and finding patterns.
“Because much of the exploration will be demonstrated via rapid prototyping, the data scientist will need to use visualization to help tell the story,” Soderstrom said.
Data scientists work together in teams, which could include student contributors who supplement the workforce. The team approach is characteristic of JPL’s testing of emerging technologies.
“We do this in a highly collaborative fashion by establishing working groups and testing the interesting technologies in actual useful prototypes,” Soderstrom said.
“This is part of our journey to redefine IT from the traditional Information Technology definition into Innovating Together.”
Advice for Agencies
Soderstrom suggested that JPL and other federal entities contend with similar Big Data challenges. He said agencies will need to upgrade IT tools and their staffers’ skills, noting that strategic recruiting will play a role.
As for getting started, Soderstrom recommended hiring or appointing a data scientist. That person can come from outside the agency or within it, he said, noting that the latter option will prove easier and less expensive.
Soderstrom also advised agencies to go for some quick wins and avoid analysis paralysis.
“Look for the tasty low-hanging Big Data fruit,” Soderstrom said. “These are problems that have significant business impact if solved. An end user, who is facilitated by the data scientist, articulates these business problems. They are short enough that they can be prototyped and demonstrated within a three-month period and with a low budget.”
Soderstrom advocates a learn-by-doing approach that helps organizations set the stage for tackling additional Big Data projects. The ability to learn from a Big Data experiment is the key success metric, he said.
“Learn on easy problems,” Soderstrom said. “Then you will know where to make the next round of investments and what ROI you could expect.”