In addition to 3 science and eng perspectives on big data described above, all proposals must also include a desc. of how project will build capacity: Capacity-building Req activities are critical to growth and health of this emerging area of research and education. There are three broad types of CB activities: 1. appropriate models, policies and technologies to support responsible and sustainable big data stewardship; 2. training and communication strategies, targeted to the various research communities and/or the public; and 3. sustainable, cost-effective infrastructure for data storage, access and shared services. 4. To develop a coherent set of stewardship, outreach and education activities in big data discovery, each research proposal must focus on at least one capacity-building activity. Examples include, but are not limited to: 5. Novel, effective frameworks of roles/responsibilities for big data stakeholders (i.e., researchers, collaborators, research communities/inst., fund agencies); 6. Efficient/effective DM models, considering structure/formatting data, terminology standards, metadata and provenance, persistent identifiers, data quality.. 7. Development of accurate cost models and structures; 8. Establishing appropriate cyberinfrastructure models, prototypes and facilities for long-term sustainable data; 9. Policies and processes for evaluating data value and balancing cost with value in an environment of limited resources; 10. Policies and procedures to ensure appropriate access and use of data resources; 11. Economic sustainability models; 12. Community standards, provenance tracking, privacy, and security; 13. Communication strategies for public outreach and engagement; 14. Education and workforce development; and 15. Broadening participation in big data activities. It is expected that at least one PI from each funded project will attend a BIGDATA PI meeting in year two of the initiative to present project research findings and capacity building or community outreach activities. Requested budgets should include funds for travel to this event. An overarching goal is to leverage all the BIGDATA investments to build a successful science and engineering community that is well trained in dealing with and analyzing big data from various sources. Finally, a project may choose to focus its science and engineering big data project in an area of national priority, but this is optional: National Priority Domain Area Option. In addition to the research areas described above, to fully exploit the value of the investments made in large-scale data collection, BIGDATA would also like to support research in particular domain areas, especially areas of national priority, including health IT, emergency response and preparedness, clean energy, cyberlearning, material genome, national security, and advanced manufacturing. Research projects may focus on the science and eng of big data in one or more of these domain areas while simultaneously engaging in the foundational research necessary to make general advances in "big data." B. Sponsoring Agency Mission Specific Research Goals NATIONAL SCIENCE FOUNDATION NSF intends to support excellent research in the three areas mentioned above in this solicitation. It is important to note that this solicitation represents the start of a multi-year, multi-agency initiative, which at NSF is part of the Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21). Innovative information technologies are transforming the fabric of society and data is the new currency for science, education, government and commerce. High performance computing (HPC) has played a central role in establishing the importance of simulation and modeling as the third pillar of science (theory and experiment being the first two), and the growing importance of data is creating the fourth pillar. Science and engineering researchers are pushing beyond the current boundaries of knowledge by addressing increasingly complex questions, which often require sophisticated integration of massive amounts of highly heterogeneous data from theoretical, experimental, observational, and simulation and modeling research programs. These efforts, which rely heavily on teams of researchers, observing and sensor platforms and other data collection efforts, computing facilities, software, advanced networking, analytics, visualization, and models, lead to critical breakthroughs in all areas of
Big Data Characteristics • Volume – The quantity of data that is generated is very important in this context. It is the size of the data which determines the value and potential of the data under consideration and whether it can actually be considered Big Data or not. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic. • Variety - The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also an essential fact that needs to be known by the data analysts. This helps the people, who are closely analyzing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data. • Velocity - The term ‘velocity’ in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development. • Variability - This is a factor which can be a problem for those who analyse the data. This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively. • Veracity - The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data. • Complexity - Data management can become a very complex process, especially when large volumes of data come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed by these data. This situation, is therefore, termed as the ‘complexity’ of Big Data. 14-12
Activity 4: Data Governance Maturity Model Data Governance Level 1 Level 2 Level 3 Level 4 Level 5 Informal Developing Adopted and Implemented Managed and Repeatable Integrated and Optimized Attention to Data Governance is informal and incomplete. There is no formal governance process. Data Governance Program is forming with a framework for purpose, principles, structures and roles. Culture Limited awareness about the value of dependable data. General awareness of the data issues and needs for business decisions. Data Quality Limited awareness that data quality problems affect decision-making. Data cleanup is ad hoc. General awareness of data Data issues are captured quality importance. Data quality proactively through standard data procedures are being developed. validation methods. Data assets are identified and valuated. Expectations for data quality are actively monitored and remediation is automated. Data quality efforts are regular, coordinated and audited. Data are validated prior to entry into the source system wherever possible. Communication Information regarding data is limited through informal documentation or verbal means. Written policies, procedures, data Data standards and policies are standards and data dictionaries communicated through written may exist but communication and policies, procedures and data knowledge of it is limited. dictionaries. Data standards and policies are completely documented, widely communicated and enforced. All employees are trained and knowledgeable about data policies and standards and where to find this information. Roles & Responsibilities Roles and responsibilities for data management are informal and loosely defined. Expectations of data ownership and valuation of data are clearly defined. Roles, responsibilities for data governance are well established and the lines of accountability are clearly understood. Roles and responsibilities for data management are forming. Focus is on areas where data issues are apparent. Data Governance structures, roles and processes are implemented and fully operational. Data Governance structures, roles and processes are managed and empowered to resolve data issues. Data Governance Program functions with proven effectiveness. Data is viewed as a critical, There is active participation and shared asset. There is Data governance structures acceptance of the principles, widespread support, and participants are integral structures and roles required to participation and endorsement to the organization and implement a formal Data of the Data Governance critical across all functions. Governance Program. Program. Roles and responsibilities are well-defined and a chain of command exists for questions regarding data and processes. 49
Springer, May 2015 Charu C. Aggarwal. Comprehensive textbook on data mining (see our secret site) The emergence of data science as a discipline requires the development of a book that goes beyond the focus of books on fundamental data mining problems. More emphasis needs to be placed on the advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. This comprehensive book explores the different aspects of data mining, from the fundamentals to the complex data types and their applications. Therefore, this book may be used for both introductory and advanced data mining courses. The chapters fall into one of three categories: 1. Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. 2. Domain chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. 3. Application chapters study applications: stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation. About the Author: Charu Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996. He has worked extensively in the field of data mining, with particular interests in data streams, privacy, uncertain data and social network analysis. He has published 14 (3 authored and 11 edited) books, over 250 papers in refereed venues, and has applied for or been granted over 80 patents. His h-index is 70. Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research. He has received two best paper awards and an EDBT Test-of-Time Award (2014). He has served as the general or program co-chair of the IEEE Big Data Conference (2014), the ICDM Conference (2015), the ACM CIKM Conference (2015), and the KDD Conference (2016). He also co-chaired the data mining track at the WWW Conference 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate editor of the ACM Transactions on Knowledge Discovery and Data Mining Journal , an action editor of the Data Mining and Knowledge Discovery Journal , an associate editor of the IEEE Transactions on Big Data, and an associate editor of the Knowledge and Information Systems Journal. He is editor-in-chief of the ACM SIGKDD Explorations. He is a fellow of the SIAM (2015), ACM (2013) and the IEEE (2010) for "contributions to knowledge discovery and data mining techniques." Mohammad Zaki’s Data Mining book (See our secret site) Bipartite Communities Matthew P. Yancey April 15, 2015 A recent trend in data-mining is finding communities in a graph. A community is a vertex set s.t. # edges inside it is > expected. (cliques in social networks, families of proteins in protein-protein interaction networks, constructing groups of similar products in recommendation systems… ) An up-to-the moment survey on community detection: S. Fortunato, “Community Detection in Graphs.“ arXiv 0906.0612v2. S. Fortunato, “Community Detection in Graphs.“ arXiv 0906.0612v2. In graph clustering, look for a quantitative defi of community. No definition is universally accepted. Intuitively, community has more edges “ inside” than linked to the outside. Algorithmically defined (final product of an algorithm, without a precise a priori def.) Let Subgraph C have nc verticies and G having n vertices. Internall, external degree of v∈C, kvint [kvext] # of edges connecting v to other vertices of C [to the rest of graph] If kvext=0, vertex has nbrs only in C. If kvint=0, instead, the vertex is disjoint from C and it should be better assigned to a different cluster. internal degree kintC of C =sum of internal vertex degrees. external degree kextC of C =sum of vertex external degrees. total degree kC =sum of degrees of the vertices of C. intra-cluster density δint(C) = # internal Cedges / # possible internal edges, [=#int_edges_C / (nc(nc−1)/2) ] inter-cluster density δext(C) =# inter-cluster_edges_C/(nc(n−nc). Finding the best tradeoff between large δint(C) and small δext(C) is implicitly or explicitly the goal of most clustering algorithms.
Big Data and the Cloud 0 Big Data describes large and complex data that cannot be managed by traditional data management tools 0 From Petabytes to Zettabytes to Exabytes of data 0 Need tools for capture, storage, search, sharing, analysis, visualization of big data. 0 Examples include - Web logs, RFID and surveillance data, sensor networks, social network data (graphs), text and multimedia, data pertaining to astronomy, atmospheric science, genomics, biogeochemical, biological fields, video archives 0 Big Data Technologies 0 Hadoop/MapReduce Platform, HIVE Platform, Twitter Storm Platform, Google Apps Engine, Amazon EC2 Cloud, Offerings from Oracle and IBM for Big Data Management, Other: Cassandra, Mahut, PigLatin, - - - 0 Cloud Computing is emerging a critical tool for Big Data Management 0 Critical to maintain Security and Privacy for Big Data
Environmental data mining, analysis & management, and integration PROTECT: Puerto Rico Testsite for Exploring into GIS and groundwater modeling Contamination Threats C. Butscher, C. Yegen, R. Ghasemizadeh, C. Irizarry, J. Howard, I. Padilla, D. Kaeli, A. Alshawabkeh Centralized repository for effective data sharing between all program projects to support environmental data analysis. Scope: • Unified system for data entry, cleaning and management across the PROTECT program (environmental & biomedical data) • Allows multi-layered data management inquiries • Provides all data via the Internet • Supports contamination transport modeling and GIS analyses Environmental data: • Historical data: groundwater data from approx. 1000 wells and springs, include water levels & contaminant concentrations • Newly collected field data: 10 wells, 2 springs and 120 water taps. Two samples per year are analyzed for water quality data (e.g., pH, dissolved oxygen, common ions), CVOCs and phthalates • NPL superfund sites (21) and RCRA corrective action sites (49) • Continuous records of meteorological, surface water and groundwater data From data mining to analysis Data mining • Field collection • Literature research & data bases • Lab analysis Live data in online repository (status): • 865 wells • 34 springs • 124 environmental samples The PROTECT online data repository collects environmental data from the field campaigns and lab analyses of the program, as well as historical data provided in reports and data bases by different agencies. Collected data are shared across the PROTECT program via the Internet and are a basic resource for statistical analysis, GIS analysis and groundwater modeling. The PROTECT online data repository also collects biomedical data (e.g., urine, blood, pregnancy tissues, demographic data), which is not shown here. Data dictionary • Data entry in unified format Poster presented at the retreat of the PROTECT program in Dorado (PR), Feb. 24-25, 2012 Data repository • Data cleaning • Data upload • Data storage This program is supported by Award Number P42ES017198 from the National Institute of Environmental Health Sciences. Web interface • Queries • Download/Export Analysis • GIS • Groundwater modeling • Statistics Application: Groundwater models and GIS analyses rely on effective access to collected and cleaned environmental data. Data structure: Tables and relations of environmental data within the PROTECT online data repository. } www.neu.edu/protect
Identification and Visualization of Magnetic Conjunctions for MultiSM53A-0401 Altitude Cusp Studies W. Keith , M. Goldstein , T. Stubbs , D. Winningham , A. Fazakerley , H. Reme , and A. Balogh 1 1 1 NASA Goddard Space Flight Center, Code 692, Greenbelt, MD 20771, USA 4 CESR BP 4346, 9 Ave Colonel Roche, Cedex, Toulouse, 31029, France 1 2 3 4 5 2 Southwest Research Institute, P. O. Drawer 28510, San Antonio, TX 78228, USA 3 Mullard Space Science Lab, Holmbury St Mary, Dorking, Surrey, RH5 6NT, UK 5 Imperial College Space and Atmospheric Physics group, The Blackett Laboratory, London, UK Abstract Multi-satellite studies of the magnetosphere often require that the missions be on similar magnetic field lines and/or in the same magnetospheric region. An automated process has been developed to locate such conjunctions and plot the results for inspection. New visualization tools such as SDDAS/Orbit, ViSBARD, and OVT can then be used to further study the data. These techniques are being utilized to study the cusp at low- and mid-altitudes with the DMSP and Cluster missions. The comparison of particle spectrograms from different altitudes in the cusp is important to our understanding of magnetospheric entry processes. The magnetospheric cusps act as conduits through which shocked solar wind plasma can penetrate to low altitudes. Although this plasma entry persists under all magnetospheric conditions, it is a very dynamic and complex process, strongly affected by external solar wind conditions. Low altitude measurements detect a smaller cusp that is crossed relatively quickly, while at mid altitudes the extent of the cusp is larger and the traversals much slower, blurring the distinction between temporal and spatial features and complicating conjunction studies with low altitude data. Orbit and particle data from both missions have been searched over a thirteenmonth period for near-simultaneous cusp region crossings. This technique has found a total of 25 good quality conjunctions, one of which will be shown, making use of the visualization tools mentioned above. These conjunctions show similar complex structures that may lead to a greater understanding of particle entry in the cusp. OVT ViSBARD Orbit Orbit Visualization Tool is primarily useful for plotting spacecraft orbits in various coordinate systems and for magnetic field line tracing using one of several field models with user-supplied solar wind parameters. The OVT development team is lead by Kristof Stasiewicz at IRFU in Uppsala Sweden (http://ovt.irfu.se). Despite some quirkiness in using the software, it is pretty much the only game in town if you want a 3D visualization of Tsyganenko 2001 field lines displayed along with satellite positions (Figure 3). The aptly named Visual System for Browsing, Analysis, and Retrieval of Data is under active development at NASA Goddard Space Flight Center, lead by Aaron Roberts. ViSBARD provides 2D and 3D displays of scalar and vector data at the spacecraft location (http://nssdcftp.gsfc.nasa.gov/selected_software/visbard/). While the “retrieval” portion of the system is still in the very early stages and the loading of data can at times be tedious, the software will read CDF and ASCII data (via Resource Description Files) and robustly display them in a variety of useful ways in GSE coordinates (Figure 5-6). Orbit is one of many applications within the Southwest Data Display and Analysis System. SDDAS development is lead by David Winningham at Southwest Research Institute (http://www.sddas.org/). Like ViSBARD, Orbit can plot a variety of vector and scalar data along an orbital path. Because it does little bookkeeping with regard to units and coordinates, it is very versatile, however the user must be careful that the resulting display makes sense. Orbit has been used in this study primarily to visualize the magnetic ground tracks of the various missions to give a common context to the various altitudes (Figure 8). Identification Due to the precession of the Cluster orbit throughout the year, cusp crossings take place primarily at high-altitudes (dayside apogee) in Feb-March-April and at mid-altitudes (dayside perigee) in Aug-Sept-Oct. To date, January through October 2001 and the fall season of 2002 have been searched in this study. Because of the much longer orbital period of Cluster, it was the initial driver in identifying cusp crossings. An automated process searched for times when the low-altitude mapped AACGM position of Cluster was within a pre-determined “cusp box”. These crossing times were checked and refined by browsing the particle data of the CIS and PEACE experiments. Output generated from these crossing times (see Figure 1) give an overview of the Cluster particle data (spectra), as well as an indication of possible magnetic close-approaches with the various DMSP satellites (lines). These times were then searched by an automated process for times when the lowaltitude mapped magnetic position of the Cluster centroid (center of mass position) was within 5 degrees (~550 km) of the magnetic footprint of one of the DMSP spacecraft (F12, F13, F14, or F15). Each of these closestapproaches was then plotted +/- 2 minutes and ranked according to the quality of the cusp in the particle data (Figure 2). The best quality DMSP passes combined with the corresponding Cluster passes could then be compared in detail to determine the quality of the conjunction. This has resulted in 165 total Cluster cusp crossings over which the DMSP data has been searched. This has lead to the identification of 299 DMSP magnetic close-approaches, of which 80 have been rated as “good” by visual inspection of the particle signatures. Forty-six of the Cluster cusp passes over the periods so far covered contained at least one viable DMSP crossing, and many contained several. In some cases, the times when the spacecraft were sampling the cusp and the time of closest approach did not coincide, but there was good coincidence on 25 of the 46 days. A representative example of the 25 is presented here from 14 August, 2001 when the Cluster trajectory was a roughly perpendicular mid-altitude crossing (Figure 3). Item OVT ViSBARD SDDAS/Orbit Supported OS Windows Via Cygwin Linux Macintosh Source UNIX Source Figure 5 Figure 3 Figure 8 Of particular interest in this study is the ability to plot magnetic footprints at the Earth’s surface (Figure 4) and the ability to show an intersecting field line over it’s entire length. Models Tsyganenko Field 87,89,96,01 planned Magnetopause IMF, pressure Not time-dependent Bowshock Pressure, M Electric potential IMF, velocity Not time-dependent Earth Figure 1 Figure 2 Visualization Once a conjunction has been identified, the problem of understanding and showing how these data fit together becomes largely a problem of data visualization. Three different software tools have been employed for various tasks; OVT for field-line and footprint mapping, ViSBARD for plotting multiple data sets at the spacecraft location, and SDDAS/Orbit for showing flux along ground tracks. In our example pass, Clusters 1,2, and 4 pass poleward through the cusp at about the same time, with Cluster-3 lagging behind about 30 minutes. In Figure 3 (from OVT, edited to add labels), the field lines passing through the Cluster spacecraft are shown at the time of closest approach with DMSP F13. Note that the DMSP field line remains very close to the three Cluster lines throughout its length, indicating a good magnetic conjunction. Figure 8 (from Orbit, labels added) shows the footprint tracings with ion energy flux magnitude coloration. F14 and F15 see the cusp at about the same place at opposite ends of the time range (until F14 data drops out just poleward of the location of the Cluster centroid). F13 samples the cusp along approximately constant Invariant Latitude at the beginning of the period, passing very close to the mapped Cluster location. Ion and electron density data for Cluster are shown in Figure 7 (from ViSBARD, labels added) and indicate a poleward drift of the cusp during this interval (between Cluster-3 and the others), which is consistent with the northward turning of the IMF during this time. Although a lot of data can be shown together in this way (i.e., Figure 5, where electron and ion temperature (glyphs), electron and ion density (arrow colors), ion velocity (arrow 1) and magnetic field (arrow 2) are shown), it can be difficult to assimilate so much data all at once. The relative altitudes also makes it difficult to view similar data from the two missions (as in Figures 6 and 7) in the same frame. The summaries presented at right of the three visualization packages reflect their particular application in this conjunction study and are not intended to be comprehensive reviews. The opinions expressed are those of the primary author. Conclusions This search method has resulted in the identification of 25 good conjunctions within the magnetospheric cusps between the Cluster and DMSP missions. As the example shown here indicates, there is much that can be learned with such conjunctions about the nature of plasma entry in the cusps that would not be possible with the individual measurements.The visualization tools used in this work each bring unique abilities to the problem, however, none are well enough developed yet to do all that is required (Table 1). The best result comes from knowing when to use each one and exploit its strengths, while avoiding the pitfalls that each also bring. planned Non-rotating Figure 6 Mars Non-rotating Other User-defined ViSBARD can be “tricked” into displaying other coordinates, although the Earth position becomes inaccurate. In Figure 7, some of the data from Figure 5 (density) is shown in SM coordinates. Some of the cusp displacement at Cluster-3 can now be seen to be due to diurnal motion of the magnetic pole. Sun Display Time range Orbit track Dynamic Time step Selection or midpoint All steps shown Scalar Data Vector Data Figure 4 Magnitude Multiple Vectors Data Sources ASCII This is an example of when the user must be careful. When first plotting ground-tracks, the non-rotating Earth map gave the wrong impression regarding geographic location, so the user-designed grid-map shown was used to accurately give context to the magnetic position data. Figure 9 shows the Cluster PEACE electron density in GSE (top) and SM (bottom), analogous to Figures 5 and 7. Orbit can only show one scalar or vector per satellite (for up to four satellites), which may be a relief to those still trying to sort out Figure 5. Single format Vba or Via RDF CDF Via RDF IDFS Via CDF export Long Term Orbit File 2-line Orbit Elements Derived quantities Via SCF Online retrieval CDAWeb Beta Features 2D data display Other SDDAS Save session Not combined Coordinates 5 geocentric GSE (others planned) Arbitrary XYZ Image Output pnm, bmp, tif png, gif, jpg, tif Support Table 1 PS, animated gif The line-by-line data entry for the field model parameters is a bit clunky, however the data is actually stored in simple ASCII files. A data dump from a convenient source (in this case time-lagged ACE data from IDFS) into the appropriate format yields an acceptable solution over certain time ranges, although an automated process would be far superior. Parameters for the Magnetopause, Bowshock, and polar cap potential work in the same way and can therefore be as dynamic as the input values they are given. Orbits, calculated from compact orbit element files, may be plotted in GSE, GSM, SMC (SM), GEO or GEI, while magnetic footprints use GEO or SMC. A disadvantage for some is the lack of a version compiled for Macintosh OS X. While it is presumably possible to compile the Unix source for this platform, it has not to my knowledge been done. The Windows version, on the other hand, self-installs easily and is currently the version of choice for usability. Figure 9 Figure 7 Pre-compiled binaries for Windows, Mac and Linux make it simple to install and update. Future work that will make this software much better include Tsyganenko field lines and the ability to read and display in different coordinate systems. The large repository of data in the IDFS format allows for a wide variety of inputs, as well as the ability to plot data after it undergoes mathematical computations via files called SCFs (as in Figure 8 where the ion flux has been integrated on-the-fly). It cannot read data of other formats, however, and there is currently no built-in ability to perform coordinate transforms on the data. Orbit is available as part of a suite of applications within SDDAS, with binaries available for Solaris, HP-UX, Irix, Linux, Mac OS X, and Windows. The Windows version requires UNIX emulation (such as Cygwin) and an Xserver, but is still useable.
An Example Data Set and Decision Tree # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no outlook sunny rainy yes company no med no sailboat small yes big no big yes
Keeping Up With... Big Data Library administration and management should examine what types of big data sets their library could be gathering and analyzing using big data tools. Does your library have an opportunity to measure something new, some massive data set which previously was out of your reach because of software and hardware constraints? From the side of big data curation, could your library, as part of storing your faculty’s scholarly research and making it accessible, also store and mount your faculty’s raw research data for others to use? […] Or you could be the thought leader on big data curation at your institution by providing guidance to storing and making accessible big data sets. Now is the opportunity for your library to understand the issues and opportunities big data offers to researchers, administration, and the librarians at your institution.* http://www.ala.org/acrl/publications/keeping_up_with/big_data XI International Conference on University Libraries November 6-8, 2013
Recommendations • • • • • • • Introduce big-data concepts as integral part of UG curriculum o For example for CS, simple word-count of big-data in CS1, mapreduce algorithm in CS2, cloud storage and big-table in Database systems, Hadoop in distributed systems, the entire big-data analytics in other elective courses such as Machine Learning and Data Mining. Use compelling examples using real world datasets Train the educators: big-data professional development for the academic core is critical Expose the administrators: to use of Big Data applications/tools in all possible areas: institutional analysis, data collected at various educational institutions is a gold-mine for macro-level analytics; “What is the trend?” “Are they learning?” Train the counselors who advise high school students, and college entry level counselors Include the community colleges and four years colleges Need investment from major industries (mentoring, educator days, etc.) Symposium on Big Data Science and Engineering 10/19/2012 15
Big Data creates both challenges and opportunities for education • Challenges for education: Education for Big Data – Educate many data scientists & engineers quickly and affordably • Opportunities for education: Big Data for Education – Leverage Big Data technology to scale up and improve education • Big Data and education are mutually beneficial 翟 Integration! – Education supplies workforce for developing innovative Big Data technology and applications – Big Data supplies technology for scaling up and improving quality of education 4
Redundant Array of Independent Disks (RAID) Data organization on multiple disks Data disk 0 Data disk 1 Data disk 2 Mirror disk 0 Mirror disk 1 RAID0: Multiple disks for higher data rate; no redundancy Mirror disk 2 RAID1: Mirrored disks RAID2: Error-correcting code DataA disk 0 DataB disk 1 DataC disk 2 Data D disk 3 Parity P disk Spare disk RAID3: Bit- or b yte-level striping with parity/checksum disk ABCDP=0 B=ACDP Data 0 Data 1 Data 2 Data 0’ Data 1’ Data 2’ Data 0” Data 1” Data 2” Data 0’” Data 1’” Data 2’” Parity 0 Parity 1 Parity 2 Spare disk RAID4: Parity/checksum applied to sectors,not bits or bytes Data 0 Data 1 Data 2 Data 0’ Data 1’ Data 2’ Data 0” Data 1” Parity 2 Data 0’” Parity 1 Data 2” Parity 0 Data 1’” Data 2’” Spare disk RAID5: Parity/checksum distributed across several disks RAID6: Parity and 2nd check distributed across several disks Fig. 19.5 RAID levels 0-6, with a simplified view of data organization. Computer Architecture, Memory System Design Slide 50