Dataiku Core Designer_ Interface & Data Exploration Notes.pdf
Document Details
Uploaded by SupportiveOliveTree
Full Transcript
Dataiku Core Designer Interface & Data Exploration Projects A project is your central workspace in Dataiku, containing datasets, recipes, models, discussions, and dashboards related to a specific task. Projects are created from the homepage and can be organized into folders. Within a project, yo...
Dataiku Core Designer Interface & Data Exploration Projects A project is your central workspace in Dataiku, containing datasets, recipes, models, discussions, and dashboards related to a specific task. Projects are created from the homepage and can be organized into folders. Within a project, you can check the overall status, view recent activity, see contributors, and manage a to-do list. Key commands include duplicating, exporting, and deleting a project (depending on access level). Flow The Flow is a visual representation of how data, recipes, and models interact in an analytical pipeline. It traces the journey of data from the initial source to the final output, showing dependencies between components. Automatic Layout: Dataiku dynamically positions items in an optimized layout that cannot be manually adjusted. Improving Flow Readability For complex Flows, you can enhance readability using: Flow Zones: Organize the Flow into multiple zones for clarity. Click + Zone to add zones. Tags: Add tags to Flow items and filter views based on attributes like creator, purpose, or status. Filters: Use the View menu to filter the Flow by zones, tags, connections, recipe engines, etc. Building the Flow You can build the entire Flow using the Build all option or build individual items by right-clicking on them and selecting the Build option. Dependency Awareness: Changes to datasets or recipes can trigger dynamic rebuilding of dependent items upstream or downstream in the Flow. Datasets A dataset in Dataiku is any piece of data in a tabular format. Examples include an uploaded Excel spreadsheet, an SQL table, a folder of data files on a Hadoop cluster, or a CSV file in the cloud (e.g., Amazon S3). Datasets are represented in the Flow with a blue square and an icon matching the source type. Interactions with datasets are uniform, regardless of the source. You can read, write, visualize, and manipulate datasets using tabs like Explore, Charts, Statistics, Data Quality, Metrics, History, and Settings. Dataiku decouples data processing logic from the underlying storage infrastructure, enabling the same methods of interaction across different dataset types. Connections to external datasets do not require Dataiku to ingest the entire dataset. Instead, Dataiku stores the connection details and accesses the data from the original source, transferring only a sample as needed. Data Connections Dataiku enables data manipulation on externally stored datasets through connections to sources like SQL databases, cloud storage, and NoSQL sources. You can import new datasets into the Flow by uploading files or using previously established connections. The user interface for exploring and preparing data remains consistent across all dataset types because the processing logic is decoupled from the underlying storage infrastructure. Managing Connections Instance administrators can configure connections via: Dataiku Cloud: Managed through the Connections menu in the Launchpad. Self-managed installations: Managed through the Applications > Administration > Connections menu. Admins control credentials, security settings, and usage parameters, and can add new connections. Many additional connection types are available through plugins. This setup divides responsibility between those managing data connections and those working with the data, streamlining workflows. Explore Your Data Dataset Characteristics Columns are key elements in Dataiku datasets, typically representing the features of the data. Each column has two main attributes: storage type and meaning. Storage Type The storage type defines how Dataiku stores a column's data and dictates how data transformations can be applied. Common storage types include String, Integer, Float, Boolean, and Date. It’s important not to alter the storage type for datasets imported from connections like SQL tables. Meaning The column meaning is a semantic label, such as country, email, or temperature, inferred by Dataiku from the column values. You can adjust meanings to match your data better or create custom meanings. Meanings help with: Auto-detecting possible transformations Measuring data quality Simplifying value searches Dataset Schema A schema lists a dataset’s column names and storage types. You can view or edit the schema in the Schema tab in the right panel, either when uploading a dataset or later in your workflow. Sampling Sampling Benefits When exploring large datasets, Dataiku displays only a sample to reduce the computational load and provide immediate visual feedback. Sampling is used during: Visualization (Charts) Data prep (Prepare recipe) Statistical analyses (Statistics) This allows for quick actions such as sorting, filtering, displaying column distributions, applying conditional formatting, and viewing summary statistics on a smaller, more manageable dataset sample. Sampling Methods By default, Dataiku samples the first 10,000 rows of a dataset, but this method may introduce bias. You can adjust the sampling method by choosing options like random, stratified, or class rebalancing using the Sampling settings panel. While more representative samples provide better insights, they may require more processing time. Analyze Window Column Analysis In the Explore tab, you can investigate any column's values by using the Analyze window, accessible via the context menu of a column header. This tool provides insights into the data sample's quality and distribution. Analyze Window Overview Navigation: Use arrows at the top left to switch between adjacent columns and view metrics for other columns. Sample Management: By default, analysis is based on the dataset sample, but you can opt to calculate statistics for the entire dataset. Tab Descriptions The Analyze window displays different tabs based on the column type: Categorical: Shows a bar chart of the most frequent observations. Numerical: Displays a histogram, boxplot, summary statistics, frequent values, and outliers. Values Clustering: Groups similar values to standardize text fields with unwanted variations. Both tab types include a Summary section that reviews data quality, showing valid, invalid, empty, and unique values. Charts The Charts tab in Dataiku provides a drag-and-drop interface for visualizing data, offering various chart types such as bar charts, line graphs, pivot tables, and scatter plots. Building a Chart To create a chart, select a chart type and drag variables from the Data panel to the desired axis. Customizing a Chart You can configure chart variables by: Selecting the appropriate aggregation Adjusting the sorting of values Modifying value and label formatting Additional features include zooming on time series, changing date intervals, and exploring multiple series side-by-side. Filtering the Data You can filter data by selecting specific groups or combining less-prevalent categories into an "Others" bucket. Filters can also be applied directly from a chart tooltip. Sampling and Execution Engine The Sampling & Engine panel allows configuration of the sampling method and execution engine. By default, charts use the same sample as the Explore tab, but you can adjust this for better performance, especially when using data sources that support SQL queries.