Podcast
Questions and Answers
Which factor is most important when choosing between programming languages, statistical solutions, and visualization tools for big data analytics?
Which factor is most important when choosing between programming languages, statistical solutions, and visualization tools for big data analytics?
- Your background in programming and statistical knowledge. (correct)
- The popularity of the tool in the industry.
- The availability of online tutorials and documentation.
- The cost of the software licenses.
In the context of Python, what is the primary role of the NumPy library?
In the context of Python, what is the primary role of the NumPy library?
- Facilitating web-based interactive notebooks.
- Enabling integration with other programming languages like Java and C++.
- Providing high-level data visualization capabilities.
- Serving as the fundamental package for numerical computation and array manipulation. (correct)
What advantage does Cython provide in the context of data science and Python programming?
What advantage does Cython provide in the context of data science and Python programming?
- Improved tools for statistical analysis and modeling.
- Better integration with cloud-based services and platforms.
- The ability to use Python syntax with increased compiling performance by incorporating C code. (correct)
- Enhanced support for web application development.
Why is R considered a versatile language for statistics and data science?
Why is R considered a versatile language for statistics and data science?
What is the primary function of SAS Text Miner?
What is the primary function of SAS Text Miner?
How does MS SQL Server facilitate the integration of relational and non-relational data management?
How does MS SQL Server facilitate the integration of relational and non-relational data management?
Why might data scientists opt to use lower-level statistics or programming languages despite the availability of high-level platforms?
Why might data scientists opt to use lower-level statistics or programming languages despite the availability of high-level platforms?
Which consideration differentiates statistical solutions like SPSS, STATA, and Minitab?
Which consideration differentiates statistical solutions like SPSS, STATA, and Minitab?
What is a notable capability of Tableau regarding R integration?
What is a notable capability of Tableau regarding R integration?
What makes D3 (Data-Driven Documents) a suitable choice for web visualization?
What makes D3 (Data-Driven Documents) a suitable choice for web visualization?
Flashcards
Big Data
Big Data
Big data is a large pool of data that can be captured, communicated, aggregated, stored, and analyzed.
Big data analytics review
Big data analytics review
Systematic review of programming languages, statistical tools, analytical solutions and visualization applications available in the big data analytics area.
Big Data Cloud Platforms
Big Data Cloud Platforms
Open source frameworks, owned by companies to handle big data challenges.
Python
Python
Signup and view all the flashcards
Ndarray
Ndarray
Signup and view all the flashcards
Matplotlib
Matplotlib
Signup and view all the flashcards
R Language
R Language
Signup and view all the flashcards
SAS software
SAS software
Signup and view all the flashcards
SAS Model Manager
SAS Model Manager
Signup and view all the flashcards
MS SQL Server
MS SQL Server
Signup and view all the flashcards
Study Notes
- Big data is a large pool of captured, communicated, aggregated, stored, and analyzed data.
- Big data offers data scientists the opportunity to innovate and practice algorithms for analyzing unstructured data.
- Data scientists require specific knowledge and powerful languages/tools to fully appreciate and carry out tasks using big data.
- This paper systematically reviews programming languages, statistical tools, analytical solutions, and visualization applications for big data analytics.
Big Data Analytics
- To work in big data analytics, data scientists should choose tools that enable them to achieve analyzing tasks effectively.
- Analytical tools for data are categorized into: programming languages, statistical solutions, and visualization tools.
- Choosing a tool depends on one's background in programming and statistics.
- R language requires a strong science programming and statistics background.
- Visualization tools allows the user to interact with data without prior knowledge.
Big Data Cloud Platforms
- Frameworks for handling big data challenges include open-source (Apache Hadoop, SciDB) and proprietary (Google, IBM, Amazon, Microsoft).
- Platforms are built on the cloud based on the features of these frameworks.
- Platforms include Google AppEngine, Microsoft Azur, Amazon EC2, each having its own way to handle big data problems.
- Big data challenges include data storage, analytics, database issues, and machine learning implementation.
Programming Languages
- Numerous languages cater to variant tasks in big data analytics, divided into high-level and low-level languages based on analytical usage.
- High-level analytical languages vary in learning time and expertise required.
Python
- Python is a favored data analysis language, used by data scientists for research.
- Python's high-level nature and scientific libraries make it suitable for developing analytical algorithms
- Its scientific ecosystem is useful for applications in both industry and academia.
- NumPy ("Numerical Python") serves as the base data structure and fundamental package.
- All input data in Python is represented as a NumPy array.
- NumPy provides:
- An efficient, fast multidimensional array object (Ndarray).
- Functions for array manipulation.
- Tools for reading/writing array-based data sets to disk.
- Fourier transform, linear algebra operations, and random number generation.
- Tools to integrate other language code (C, C++, FORTRAN).
- Pandas supports scientists and fastens tasks on structured data via pre-designed functions and rich data structures.
- Matplotlib is popular for data visualization and drawing expressive plots, especially 2D plots.
- IPython is an interactive computing/development environment that maximizes productivity.
- SciPy is a collection of packages containing efficient algorithms for:
- Linear algebra.
- Sparse matrix representation.
- Special functions.
- Basic statistical functions.
- SciPy includes bindings for Fortran-based numerical packages like LAPACK.
- Cython enables data scientists to use Python syntax and high-level operations while increasing compiling performance by combining C in Python.
R Language
- R is a versatile open-source programming language for statistics and data science.
- Statistical computation in R can be done using functional-based syntax or program-based code.
- R features:
- Short, simple syntax.
- Variant formats for loading/storing data (local/internet).
- In-memory task performance using consistent syntax.
- Extensive collection of tools (functions/packages) for data analysis.
- Easy methods of representing statistical results graphically, with graph storage.
- Ability to automate analyses, create new functions, & extend existing features.
- Saves data between sessions and command history.
- GUI options for R include RStudio, ESS, R Commander, JGR Java GUI, and StatET.
- R is available for Windows, Macintosh, and Linux.
- R provides tools for statistical analysis, machine learning, time-series analysis, classification, clustering, and graphical techniques, and is extensible.
- Various built-in/extended functions in R support statistical, machine learning, and visualization tasks, such as:
- Data extraction.
- Data cleaning.
- Data loading.
- Data transformation.
- Statistical analysis.
- Data visualization.
- Predictive modeling.
SAS (Software)
- SAS software, with its language, is a common solution for accessing, transforming, and reporting data via a flexible, extensible, web-based interface.
- The SAS Analytics Platform consists of analytical applications that form the application framework.
- SAS Text Miner can be added to SAS Enterprise Miner to facilitate text mining (prediction aspect) and supporting it with tools.
- SAS forecast server's automation and scalability enable future-oriented decisions and generate large quantities of high-quality, automated forecasts, increasing the efficiency of forecasts for a range of planning challenges.
- SAS Model Manager arranges steps to build analytical model collections, from creating and managing/monitoring to administering, providing decision makers with a web-based tools and support of lifecycle management/governance.
MS SQL Server
- MS SQL Server is a solution for traditional relational databases, and has graphical tools for ERDs.
- MS SQL Server now covers BI concepts as distributed features.
- MS SQL Server has an integrated Analysis Service 2012, related to the Tabular Model, Multidimensional Model, and the Microsoft BI stack.
- There are three main BI services in MS SQL Server 2012:
- SQL Server Integration Service (SSIS) for data collection.
- SQL Server Analysis Service (SSAS) for data analysis.
- SQL Server Representation Service (SSRS) for data viewing (visualization).
- Microsoft SQL Server built a Sqoop-based connector, the SQL Server-Hadoop connector to provide an efficient connection between SQL Server and Hadoop.
- MS SQL Server 2012 provides tools for data integration, visualization, a Business Intelligence suite, and efficient connection to Apache Hadoop and Hive.
Statistical Solutions
- Statistical solutions include SPSS, STATA, MiniTab, Statistica.
- STATA is strong, but not user-friendly for non-statistical users.
SPSS (Statistical Package for the Social Sciences)
- SPSS enables data analysts to make decisions depending on predictive results to know what will happen next.
- The Software known as "IBM SPSS Statistics" current version of this software includes all features and add-ons.
Visualization Tools
- Visualization tools include Python, R, SAS, Julia, MATLAB, Tableau, Qlikview, Spotfire, Congos, D3, Protovis, etc.
- Tableau is for general visualization, and D3 is for web visualization.
- Tableau allows data visualization with perfect manners, allows simple plots (easy comparing) and in one frame with filtering/customization (full control).
- Tableau new feature (starting with tableau 8.1) R integration offers next capabilities:
- User access to R's wide data analysis libraries.
- Tableau users can include R command in any of the four function options from Tableau and use any result for building a graph by Tableau again.
- R users use Tableau for exploring data by applying it on their R code.
- D3 (Data-Driven Documents) is for web visualization purposes, and enables DOM manipulation and has developer tools.
Analysis and Comparison of Analytical Tools
- STATA is the best statistical software, though it requires good knowledge of statistical concepts. Almost new users will prefer SPSS than other statistical software.
- For professional visualization, Tableau is the best for graphs, charts, maps, etc.
- Benchmarking comparison between D3 (faster twice than Protovis and over three times than Flash) and another tools like Protovis and Flash confirms that D3 is better.
Analytical Tools Usage
- KDnuggets poll (2014) found that R, SAS, Python, and SQL are the four main languages for Analytics, Data Mining, and Data Science
Python vs R.
- Both Python and R are similar programming languages, with some differences.
Conclusion
- R language is a common programming language for data scientists.
- SPSS is a good statistical tool for non-statisticians.
- Tableau Public is a perfect visualization tool to present data and analyze it in a graphical way.
- D3 will be the best choice for web visualization purposes.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.