Architectural BackgroundHadoop is a lot of things, and one of those is a distributed, abstracted file system. It's called HDFS (for "hadoop distributed file system," maybe), and it has its uses.
HDFS isn't a file system in the interacts-with-OS sense. It's more of a file system on top of file systems: the underlying (normal) file systems each run on one computer, while HDFS spans several computers. Within HDFS, files are divided into blocks; blocks are scattered across multiple machines, usually stored on more than one for redundancy.
There's one NameNode (computer) that knows where everything is, and several core nodes (Amazon's term) that hold and serve data. You can log in to any of these nodes and do ordinary filesystem commands like ls and df, but those are reflecting the local filesystem. It knows nothing about files in HDFS. The distributed file system is a layer above; to query it, you have to go through hadoop. A whole 'nother file manager, with its own hierarchy of what's where.
Why? The main purpose is: stream one file faster. Several machines can read and process one file at the same time, because parts of the file are scattered across machines. Also, HDFS can back up files to multiple machines. This means there is redundancy in storage, and also in access: if one machine is busy it could read from the other. In the end, we use it at Outpace because it can store files that are too big to put all in one place.
Negatives? HDFS files are write-once or append-only. This sounds great: they're immutable, right? until I do need to make a small change, and copy-on-mod means copying hundreds of gigabytes. We don't have the space for that!
How much space do we have?In our case (using Amazon EMR), all the core nodes are the same, and they all use the local drives (instance stores) to keep HDFS files. In this case, the available space is
number of core nodes * space per node / replication factor.
I can find the number of core nodes and the space on each one, along with the total disk space that HDFS finds available, by logging in to the NameNode (master node, in Amazon terms) and running
hadoop dfsadmin -report
Here, one uses hadoop as a top-level command, then dfsadmin as a subcommand, and then -report to tell dfsadmin what to do. This seems to be typical of dealing with hadoop.
This prints a summary for the whole cluster, and then details for each node. The summary looks like:
Configured Capacity: 757888122880 (705.84 GB)
Present Capacity: 704301940736 (655.93 GB)
DFS Remaining: 363997749248 (339.00 GB)
DFS Used: 340304191488 (316.93 GB)
DFS Used%: 48.32%
It's evident from 48% Used that I'm going to have problems when I make a copy of my one giant data table. When HDFS is close to full, errors happen.
Here's the trick though: the DFS Remaining number does not reflect how much data I can store. It does not take into account the replication factor. Find that out by running
hadoop fsck /
This prints, among other things, the default replication factor and the typical replication factor. (It can be overridden for a particular file, it seems.) Divide your remaining space by your default replication factor to see how much new information you can store. Then round down generously - because Hadoop stores files in blocks, and any remainder gets a whole block to itself.
The hadoop fs subcommand supports many typical unix filesystem commands, except they have a dash in front of them. For instance, if you're wondering where your space is going
hadoop fs -du /
will show you the top-level directories inside HDFS and their accumulated sizes. You can then drill down repeatedly into the large directories (with hadoop fs -du <dir>) to find the big fat files that are eating your disk space.
As with any abstraction, try to make friends with the concepts inside HDFS before doing anything interesting with it. Nodes, blocks, replication factors ... there's more to worry about than with a typical filesystem. Great power, great responsibility, and all that.