您的位置 > 首页 > 商业智能 > A Beginner’s Guide to Tidyverse – The Most Powerful Collection of R Packages f ...

A Beginner’s Guide to Tidyverse – The Most Powerful Collection of R Packages f ...

来源:分析大师 | 2019-05-13 | 发布:经管之家

Data scientists spend close to 70% (if not more) of their time cleaning, massaging and preparing data. That’s no secret – multiple surveys have confirmed that number. I can attest to it as well – it is simply the most time-taking aspect in a data science project.Unfortunately, it is also among the least interesting things we do as data scientists. There is no getting around it, though. It is an inevitable part of our role. We simply cannot build powerful and accurate models without ensuring our data is well prepared.So how can we make this phase of our job interesting?Welcome to the wonderful world of Tidyverse! It is the most powerful collection of R packages for preparing, wrangling and visualizing data. Tidyverse has completely changed the way I work with messy data – it has actually made data cleaning and massaging fun!Source: tidyverse.orgIf you’re a data scientist and have not yet come across Tidyverse, this article will blow your mind. I will show you the top R packages bundled with in Tidyverse that make data preparation an enjoyable experience. We’ll also look at code snippets for each package to help you get started.You can also check out my pick of the top eight useful R packages you should incorporate into your data science work.Tidyverse is a collection of essential R packages for data science. The packages under the tidyverse umbrella help us in performing and interacting with the data. There are a whole host of things you can do with your data, such as subsetting, transforming, visualizing, etc.Tidyverse was created by the great Hadley Wickham and his team with the aim of providing all these utilities to clean and work with data.Let’s now look at some versatile Tidyverse libraries that the majority of data scientists use to manage and streamline their data workflows.Ready to explore the tidyverse? Go ahead and install it directly from within RStudio:We’ll be working on the food demand forecasting challenge in this article. I have taken a random 10% sample from the train file for faster computation. You can take the entire dataset if you want (and if your machine can support it!).Let’s begin!dplyr is one of my all-time favorite packages. It is simply the most useful package in R for data manipulation. One of the greatest advantages of this package is you can use the pipe function “%>%”to combine different functions in R. From filtering to grouping the data, this package does it all.Here is the complete list of functions dplyr offers:Let’s look at an example to understand how to use these different functions in R.Open up the food forecasting dataset we downloaded earlier. We have 2 other files apart from the training set. We can join them with our train file to add more features. Let’s usedplyrand merge all the files. Again, I’m just using 10% of the overall data to make the computation faster.Output:Note: We see a lot of NAs here. This is because we randomly chose samples from each of the three files and then merged them. If you use the whole dataset, you will not observe this amount of missing values.Next, let’s use three dplyr functions simultaneously to summarise the data. Here, we’ll select ‘TYPE_A’ from the ‘center_type’ variable and calculate the mean of the ‘num_orders’ variable at this particular center:Here,%>% is called the piping operator. This comes in handy when we want to use one or more functions together.Output:Go ahead and try out the other functions. Trust me, they will completely change the way you do data preparation.The tidyr package complements dplyr perfectly. Itboosts the power ofdplyrfor data manipulation and pre-processing. Below is the list of functions tidyroffers:Let’s see a quick example of how to use tidyr. We’ll unite two binary variables and create only one column for both:Output:Here’s another example of how tidyr works:Output:We easily converted the factor variables into a table that can be swiftly interpreted without much pre-processing.Dealing with string variables is a tricky challenge. They can often trip up to our final analysis because we skipped over those variables initially thinking they won’t affect our model. That’s a mistake.stringris my go-to package in R for such situations. It plays a big role in processing raw data into a cleaner and an easily understandable format. stringr contains a variety of functions that make working with string data really easy.Some basic functions that you can perform with the stringr package are:There are many more functions inside the stringr package. Let’s look at a couple of functions:Output:Combine two strings:The forcats package is dedicated to dealing with categorical variables or factors. Anyone who has worked with categorical data knows what a nightmare they can be. forcats feels like a godsend.It is quite frustrating when a factor appears in a place where we least expect it. If we’re using the tibble format, we don’t need to worry about this issue.The aim is to fill in those missing pieces so we can access the power of factors with minimum effort.Use the following example to experiment with factors in your data:Output:Source: effiasoft.comWe have plenty of ways to read data in R. So why use thereadrpackage? The readr package solves the problem of parsing a flat file into a tibble. This provides an improvement over the standard file importing methods and significantly improves the computation speed.You can easily read a .CSV file in the following way:Use this function and you’ll automatically see the difference in the time RStudio takes to read in huge data files.We work with dataframes in R. It’s one of the first things we le
本文已经过优化显示,查看原文请点击以下链接:
查看原文:https://www.analyticsvidhya.com/blog/2019/05/beginner-guide-tidyverse-most-powerful-collection-r-packages-data-science/

院校点评more

京ICP备11001960号  京ICP证090565号 365bet外围体育投注+体育在线投注网址+网上在线足球开户注册+经管之家【信誉网投】 论坛法律顾问:王进律师知识产权保护声明免责及隐私声明   主办单位:人大经济论坛 版权所有
联系QQ:2881989700  邮箱:service@pinggu.org
合作咨询电话:(010)62719935 广告合作电话:13661292478(刘老师)

投诉电话:(010)68466864 不良信息处理电话:(010)68466864