Disclosure Statements This presentation has been prepared by NantHealth, Inc. (the “Company”) for informational purposes only and not for any other purpose. Nothing contained in this presentation is, or should be construed as, a recommendation, promise or representation by the presenter or the Company or any director, employee, agent, or adviser of the Company. This presentation does not purport to be all-inclusive or to contain all of the information you may desire. Information provided in this presentation speaks only as of the date hereof. The Company assumes no obligation to update any information or statement after the date of this presentation as a result of new information, subsequent events or any other circumstances. These materials and related materials and discussions may contain forward-looking statements that are based on the Company’s current expectations, and projections and forecasts about future events and trends that the Company believes may affect its business, financial condition, operating results and growth prospects. Forward-looking statements are subject to substantial risks, uncertainties and other factors, including but not limited to (1) the structural change in the market for healthcare in the United States, including uncertainty in the healthcare regulatory framework and regulatory developments in the United States and foreign countries; (2) the evolving treatment paradigm for cancer, including physicians’ use of molecular information and targeted oncology therapeutics and the market size for molecular information products; (3) physicians’ need for precision medicine products and any perceived advantage of our solutions over those of our competitors, including the ability of our comprehensive platform to help physicians treat their patients’ cancers; (4) our ability to generate revenue from sales of products enabled by our molecular and biometric information platforms to physicians in clinical settings; (5) our ability to increase the commercial success of our sequencing and molecular analysis solution; (6) our plans or ability to obtain reimbursement for our sequencing and molecular analysis solution, including expectations as to our ability or the amount of time it will take to achieve successful reimbursement from third-party payors, such as commercial insurance companies and health maintenance organizations, and government insurance programs, such as Medicare and Medicaid; (7) our ability to effectively manage our growth, including the rate and degree of market acceptance of our solutions; and (8) our ability to offer new and innovative products and services, attract new partners and clients, estimate the size of our target market, and maintain and enhance our reputation and brand recognition. The Company undertakes no obligation to update any forward-looking statements, whether as a result of new information, future events or otherwise, except as required by applicable law. No representation or warranty, express or implied, is given as to the completeness or accuracy of the information or opinions contained in this document and neither the Company nor any of its directors, members, officers, employees, agents or advisers accepts any liability for any direct, indirect or consequential loss or damage arising from reliance on such information or opinions. Past performance should not be taken as an indication or guarantee of future performance, and no representation or warranty, express or implied, is made regarding future performance. We own or have rights to trademarks and service marks that we use in connection with the operation of our business. NantHealth, Inc. and our logo as well as other protected brands. Solely for convenience, our trademarks and service marks referred to in this presentation are listed without the (sm) and (TM) symbols, but we will assert, to the fullest extent under applicable law, our rights or the rights of the applicable licensors to these trademarks, service marks and trade names. Additionally, we do not intend for our use or display of other companies’ trade names, trademarks, or service marks to imply a relationship with, or endorsement or sponsorship of us by, these other companies. We have indicated with (TM) symbols where these third party trademarks are referred to in this presentation. This presentation includes certain financial measures not based on accounting principles generally accepted in the United States, or non-GAAP measures. These non-GAAP measures are in addition to, not a substitute for or superior to, measures of financial performance prepared in accordance with GAAP. Confidential Copyright © Do not distribute 6
View full slide show




Activity 4: Data Governance Maturity Model Data Governance Level 1 Level 2 Level 3 Level 4 Level 5 Informal Developing Adopted and Implemented Managed and Repeatable Integrated and Optimized Attention to Data Governance is informal and incomplete. There is no formal governance process. Data Governance Program is forming with a framework for purpose, principles, structures and roles. Culture Limited awareness about the value of dependable data. General awareness of the data issues and needs for business decisions. Data Quality Limited awareness that data quality problems affect decision-making. Data cleanup is ad hoc. General awareness of data Data issues are captured quality importance. Data quality proactively through standard data procedures are being developed. validation methods. Data assets are identified and valuated. Expectations for data quality are actively monitored and remediation is automated. Data quality efforts are regular, coordinated and audited. Data are validated prior to entry into the source system wherever possible. Communication Information regarding data is limited through informal documentation or verbal means. Written policies, procedures, data Data standards and policies are standards and data dictionaries communicated through written may exist but communication and policies, procedures and data knowledge of it is limited. dictionaries. Data standards and policies are completely documented, widely communicated and enforced. All employees are trained and knowledgeable about data policies and standards and where to find this information. Roles & Responsibilities Roles and responsibilities for data management are informal and loosely defined. Expectations of data ownership and valuation of data are clearly defined. Roles, responsibilities for data governance are well established and the lines of accountability are clearly understood. Roles and responsibilities for data management are forming. Focus is on areas where data issues are apparent. Data Governance structures, roles and processes are implemented and fully operational. Data Governance structures, roles and processes are managed and empowered to resolve data issues. Data Governance Program functions with proven effectiveness. Data is viewed as a critical, There is active participation and shared asset. There is Data governance structures acceptance of the principles, widespread support, and participants are integral structures and roles required to participation and endorsement to the organization and implement a formal Data of the Data Governance critical across all functions. Governance Program. Program. Roles and responsibilities are well-defined and a chain of command exists for questions regarding data and processes. 49
View full slide show




Starbucks Corporate Mission Statement Our Starbucks Mission Statement Our mission: to inspire and nurture the human spirit – one person, one cup and one neighborhood at a time. Here are the principles of how we live that every day: Our Coffee It has always been, and will always be, about quality. We’re passionate about ethically sourcing the finest coffee beans, roasting them with great care, and improving the lives of people who grow them. We care deeply about all of this; our work is never done. Our Partners We’re called partners, because it’s not just a job, it’s our passion. Together, we embrace diversity to create a place where each of us can be ourselves. We always treat each other with respect and dignity. And we hold each other to that standard. Our Customers When we are fully engaged, we connect with, laugh with, and uplift the lives of our customers – even if just for a few moments. Sure, it starts with the promise of a perfectly made beverage, but our work goes far beyond that. It’s really about human connection. Our Stores When our customers feel this sense of belonging, our stores become a haven, a break from the worries outside, a place where you can meet with friends. It’s about enjoyment at the speed of life – sometimes slow and savored, sometimes faster. Always full of humanity. Our Neighborhood Every store is part of a community, and we take our responsibility to be good neighbors seriously. We want to be invited in wherever we do business. We can be a force for positive action – bringing together our partners, customers, and the community to contribute every day. Now we see that our responsibility – and our potential for good – is even larger. The world is looking to Starbucks to set the new standard, yet again. We will lead. Our Shareholders We know that as we deliver in each of these areas, we enjoy the kind of success that rewards our shareholders. We are fully accountable to get each of these elements right so that Starbucks – and everyone it touches – can endure and thrive. © 2014 Cengage Learning. All rights reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 4–9
View full slide show




Springer, May 2015 Charu C. Aggarwal. Comprehensive textbook on data mining (see our secret site) The emergence of data science as a discipline requires the development of a book that goes beyond the focus of books on fundamental data mining problems. More emphasis needs to be placed on the advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. This comprehensive book explores the different aspects of data mining, from the fundamentals to the complex data types and their applications. Therefore, this book may be used for both introductory and advanced data mining courses. The chapters fall into one of three categories: 1. Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. 2. Domain chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. 3. Application chapters study applications: stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation. About the Author: Charu Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996. He has worked extensively in the field of data mining, with particular interests in data streams, privacy, uncertain data and social network analysis. He has published 14 (3 authored and 11 edited) books, over 250 papers in refereed venues, and has applied for or been granted over 80 patents. His h-index is 70. Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research. He has received two best paper awards and an EDBT Test-of-Time Award (2014). He has served as the general or program co-chair of the IEEE Big Data Conference (2014), the ICDM Conference (2015), the ACM CIKM Conference (2015), and the KDD Conference (2016). He also co-chaired the data mining track at the WWW Conference 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate editor of the ACM Transactions on Knowledge Discovery and Data Mining Journal , an action editor of the Data Mining and Knowledge Discovery Journal , an associate editor of the IEEE Transactions on Big Data, and an associate editor of the Knowledge and Information Systems Journal. He is editor-in-chief of the ACM SIGKDD Explorations. He is a fellow of the SIAM (2015), ACM (2013) and the IEEE (2010) for "contributions to knowledge discovery and data mining techniques." Mohammad Zaki’s Data Mining book (See our secret site) Bipartite Communities Matthew P. Yancey April 15, 2015 A recent trend in data-mining is finding communities in a graph. A community is a vertex set s.t. # edges inside it is > expected. (cliques in social networks, families of proteins in protein-protein interaction networks, constructing groups of similar products in recommendation systems… ) An up-to-the moment survey on community detection: S. Fortunato, “Community Detection in Graphs.“ arXiv 0906.0612v2. S. Fortunato, “Community Detection in Graphs.“ arXiv 0906.0612v2. In graph clustering, look for a quantitative defi of community. No definition is universally accepted. Intuitively, community has more edges “ inside” than linked to the outside. Algorithmically defined (final product of an algorithm, without a precise a priori def.) Let Subgraph C have nc verticies and G having n vertices. Internall, external degree of v∈C, kvint [kvext] # of edges connecting v to other vertices of C [to the rest of graph] If kvext=0, vertex has nbrs only in C. If kvint=0, instead, the vertex is disjoint from C and it should be better assigned to a different cluster. internal degree kintC of C =sum of internal vertex degrees. external degree kextC of C =sum of vertex external degrees. total degree kC =sum of degrees of the vertices of C. intra-cluster density  δint(C) = # internal Cedges / # possible internal edges, [=#int_edges_C / (nc(nc−1)/2) ] inter-cluster density  δext(C) =# inter-cluster_edges_C/(nc(n−nc). Finding the best tradeoff between large δint(C) and small δext(C) is implicitly or explicitly the goal of most clustering algorithms.
View full slide show




Existing Strategies & Performance Our core business is our trading and investing customer franchise. Building on the strengths of this franchise, our growth strategy is focused on four areas: retail brokerage, corporate services and market making, wealth management, and banking.     • Our retail brokerage business is our foundation. We believe a focus on these key factors will position us for future growth in this business: growing our sales force with a focus on long-term investing, optimizing our marketing spend, continuing to develop innovative products and services and minimizing account attrition.     • Our corporate services and market making businesses enhance our strategy by allowing us to realize additional economic benefit from our retail brokerage business. Our corporate services business is a leading provider of software and services for managing equity compensation plans and is an important source of new retail brokerage accounts. Our market making business allows us to increase the economic benefit on the order flow from the retail brokerage business as well as generate additional revenues through external order flow.     • We also plan to expand our wealth management offerings. Our vision is to provide wealth management services that are enabled by innovative technology and supported by guidance from professionals when needed.     • Our retail brokerage business generates a significant amount of customer cash and we plan to continue to utilize our bank to optimize the value of these customer deposits.
View full slide show




Redundant Array of Independent Disks (RAID) Data organization on multiple disks Data disk 0 Data disk 1 Data disk 2 Mirror disk 0 Mirror disk 1 RAID0: Multiple disks for higher data rate; no redundancy Mirror disk 2 RAID1: Mirrored disks RAID2: Error-correcting code DataA disk 0 DataB disk 1 DataC disk 2 Data D disk 3 Parity P disk Spare disk RAID3: Bit- or b yte-level striping with parity/checksum disk ABCDP=0 B=ACDP Data 0 Data 1 Data 2 Data 0’ Data 1’ Data 2’ Data 0” Data 1” Data 2” Data 0’” Data 1’” Data 2’” Parity 0 Parity 1 Parity 2 Spare disk RAID4: Parity/checksum applied to sectors,not bits or bytes Data 0 Data 1 Data 2 Data 0’ Data 1’ Data 2’ Data 0” Data 1” Parity 2 Data 0’” Parity 1 Data 2” Parity 0 Data 1’” Data 2’” Spare disk RAID5: Parity/checksum distributed across several disks RAID6: Parity and 2nd check distributed across several disks Fig. 19.5 RAID levels 0-6, with a simplified view of data organization. Computer Architecture, Memory System Design Slide 50
View full slide show




8.4. Metadata Disk Failure: The FsImage and EditLog are central data structures. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use. The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported. 8.5. Snapshots: Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. HDFS does not currently support snapshots but will in a future release. 9. Data Organization 9.1. Data Blocks: HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode. 9.2. Staging: A client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost. The above approach has been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes to files. If a client writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement has been relaxed to achieve higher performance of data uploads. 9.3. Replication Pipelining: When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.
View full slide show




Iterative Computation is Difficult • System is not optimized for iteration: Iterations Data Data Data Data CPU 2 Data Data Data Data CPU 3 Data CPU 1 Data Data CPU 2 Data Data Data Data CPU 3 Disk Penalty Data CPU 3 CPU 1 Startup Penalty Data Data Disk Penalty Data CPU 2 Startup Penalty Data CPU 1 Disk Penalty Startup Penalty Data Data Data Data Data Data Data Data
View full slide show




Curse of the Slow Job Iterations Data Data CPU 1 Data CPU 1 Data CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data Data Barrier Data Data Data CPU 3 Data Barrier Data CPU 3 Barrier Data CPU 3 Data http://www.www2011india.com/proceeding/proceedings/p607.pdf
View full slide show




HDFS (Hadoop Distributed File System) is a distr file sys for commodity hdwr. Differences from other distr file sys are few but significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides hi thruput access to app data and is suitable for apps that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS originally was infrastructure for Apache Nutch web search engine project, is part of Apache Hadoop Core http://hadoop.apache.org/core/ 2.1. Hardware Failure Hardware failure is the normal. An HDFS may consist of hundreds or thousands of server machines, each storing part of the file system’s data. There are many components and each component has a non-trivial prob of failure means that some component of HDFS is always non-functional. Detection of faults and quick, automatic recovery from them is core arch goal of HDFS. 2.2. Streaming Data Access Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. 2.3. Large Data Sets Apps on HDFS have large data sets, typically gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It provides high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It supports ~10 million files in a single instance. 2.4. Simple Coherency Model: HDFS apps need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in future [write once read many at file level] 2.5. “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the app is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. 2.6. Portability Across Heterogeneous Hardware and Software Platforms: HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. 3. NameNode and DataNodes: HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is 1 blocks stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction The NameNode and DataNode are pieces of software designed to run on commodity machines, typically run GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode. 4. The File System Namespace: HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This info is stored by NameNode. 5. Data Replication: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode
View full slide show




OUR BELIEFS The success of our students is always our first priority We must perform our jobs admirably, giving our best service and support every day, for everyone Teamwork is founded upon people bringing different gifts and perspectives We provide educational opportunities for those who might otherwise not have them In providing employees with a safe and fulfilling work environment, as well as an opportunity to grow and learn Our progress must be validated by setting goals and measuring our achievements We must make decisions that are best for the institution as a whole Building and maintaining trusting relationships with each other is essential Competence and innovation are essential means of sustaining our values in a competitive marketplace We make a positive difference in the lives of our students, our employees, and our communities In the principles of integrity, opportunity and fairness We must prepare our students to be successful in a global environment Our work matters 6
View full slide show




One Parallel Iteration Distributed Memory • Odd Processors: sendRecv(pr data, pr-1 data); mergeHigh(pr data, pr-1 data) if(r<=P-2) { sendRecv(pr data, pr+1 data); mergeLow(pr data, pr+1 data) } • Even Processors: sendRecv(pr data, pr+1 data); mergeLow(pr data, pr+1 data) if(r>=1) { sendrecv(pr data, Pr-1 data); mergeHigh(pr data, pr-1 data) } Shared Memory • Odd Processors: mergeLow(pr data, pr-1 data) ; Barrier if (r<=P-2) mergeHigh(pr data,pr+1 data) Barrier • Even Processors: mergeHigh(pr data, pr+1 data) ; Barrier if (r>=1) mergeLow(pr data, pr-1 data) Barrier Notation: r = Processor rank, P = number of processors, pr data is the block of data belonging to processor, r Note: P/2 Iterations are necessary to complete the sort
View full slide show




HDFS (Hadoop Distributed File System) is a distr file sys for commodity hdwr. Differences from other distr file sys are few but significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS originally was infrastructure for Apache Nutch web search engine project, is part of Apache Hadoop Core http://hadoop.apache.org/core/ 2.1. Hardware Failure Hardware failure is the normal. An HDFS may consist of hundreds or thousands of server machines, each storing part of the file system’s data. There are many components and each component has a non-trivial prob of failure means that some component of HDFS is always non-functional. Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. 2.2. Streaming Data Access Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. 2.3. Large Data Sets Apps on HDFS have large data sets, typically gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It provides high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It supports ~10 million files in a single instance. 2.4. Simple Coherency Model: HDFS apps need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in future [write once read many at file level] 2.5. “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the app is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. 2.6. Portability Across Heterogeneous Hardware and Software Platforms: HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. 3. NameNode and DataNodes: HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is 1 blocks stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode
View full slide show