VR Shopping – Christopher Davidson

This is a report to show my findings after investigating the recent experience shoppers had in store when using our new VR shopping experience. The report above is available as an individual pdf on request. There were several questions asked of me and now I can share my findings.

1. Men used VR more often and spent more time looking at the products

2. Users were more likely to be aged 26-35, the younger also enjoyed VR and so did the older equally. Then it wavered off after the age of 50.

3. The professions that enjoyed VR most included 4,0,7 and 1.

4. The city type that enjoyed VR the most was type C, more of them using it and looking at products longer. Followed by B, then at the bottom was A

5. The majority of people trying out VR don’t have kids

6. There are specific products that people with kids prefer to look at but the list is too long to list here. Just ask me for the list.

7. People between the ages of 51 to 55 appear to spend more time on average looking at each product in VR.

The main target group we should be targeting for our improvements are:

Males aged 26-35 living in cities categorised as “C” having lived there for over 1 year but below 2 years, doesn’t have children and profession falls into category 4.

Now I know you asked me to look into how we could tell who had children and who didn’t without the customer telling us. From my findings I have found three things:

1. There were certain products users with kids would spend longer looking at. The list is too long to list here although you can ask me anytime to send you the list.

2. There are products that only people with children looked at. Again, just ask me for this list.

3. People were most likely to have children if they were between the ages of 51-55 and the chance of people having children goes down with age until it's extremely unlikely that people have kids if they are a teenager.

Now, as some of you may or may not know, I had a few tricky questions I couldn’t solve due to lack of data. If I am to be making this report again I would require from you the following:

1. The time and date when the users were trying out the VR headset to pull in

other factors that could contribute to why certain groups were more likely to try VR then others. For example, the elderly are more likely to be in the store midday on a work day than younger working people.

2. I also need to have the data revealing the professions each of the numbers represent to know more certainly about whether they are working in the shop or surrounding shops and other factors.

3. I would like to know what each of the City Types represent, to understand whether they were multiple cities or just one. Whether the type was judged on density of the city, location or something else. All of this data would help build a better profile on the customer. Knowing the types of connections that city has and other variables that may not have been considered when assigning this city type.

I hope this information is useful to you and I look forward for the next time we work together.

Report to Stakeholders

ReadMe File

Report Visual

Shopping in VR

First of all I would like to thank the people who gave me this opportunity to prove my skills a data analyst and be able to showcase them. This is a Markdown document that will show all of the processes I went through whilst programming in R. Later, I exported the dataframe’s I produced in R into PowerBI to be able to visualise the data in an easy way. Then I will be creating a report for the fictional stakeholders which will give you the conclusion of what is being mentioned here. This is the process I went through for my analysis.

Intro

I was asked to do a case study into Users using VR to shop. I was given a CSV containing the data from the time when these users were in VR. The CSV file does not contain time stamps when people used VR, however it does contain the amount of time users spent looking at certain products in VR. This was useful to understand how interested they were in a certain product. I am to work in this fictional team consisting of the Product Owner, a UX designer, and another data analyst. Meaning that I will have to present my findings in a way that everyone understands.

I used R to prepare and analyse the data but then used PowerBI to visualise the data later. I thought it would be easier this way.

Ask

In this setup, I was already given the questions. They are as follows:

Who is the main target group? Which segments do you identify?
What kind of data would I want to improve my analysis and back-up the insights I mentioned and why?
The team wants to develop new features that is personalised for each target market. Which target market should they focus on first?
Some users recorded whether they had children or not and others did not. The team is wanting to increase children product sales. They want to know which characteristics a user has that shows that they have children.

Process

The data collected was already very clean after initially looking at it through dplyr’s glimpse and skimr’s skim_without_charts. Nothing was coming out as unusual.

glimpse(CustomerData)

## Rows: 537,577
## Columns: 12
## $ CustomerID    <dbl> 1000001, 1000001, 1000001, 1000001, 1000002, 1000003, 10…
## $ ItemID        <chr> "P00069042", "P00248942", "P00087842", "P00085442", "P00…
## $ Sex           <chr> "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M", "…
## $ Age           <chr> "0-17", "0-17", "0-17", "0-17", "55+", "26-35", "46-50",…
## $ Profession    <dbl> 10, 10, 10, 10, 16, 15, 7, 7, 7, 20, 20, 20, 20, 20, 9, …
## $ CityType      <chr> "A", "A", "A", "A", "C", "A", "B", "B", "B", "A", "A", "…
## $ YearsInCity   <chr> "2", "2", "2", "2", "4+", "3", "2", "2", "2", "1", "1", …
## $ HaveChildren  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TR…
## $ ItemCategory1 <dbl> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1, 5, 4, 2, 5, …
## $ ItemCategory2 <dbl> NA, 6, NA, 14, NA, 2, 8, 15, 16, NA, 11, NA, NA, 2, 8, 5…
## $ ItemCategory3 <dbl> NA, 14, NA, NA, NA, NA, 17, NA, NA, NA, NA, NA, NA, 5, 1…
## $ Amount        <dbl> 8370, 15200, 1422, 1057, 7969, 15227, 19215, 15854, 1568…

skim_without_charts(CustomerData)

Data summary
Name	CustomerData
Number of rows	537577
Number of columns	12
_______________________
Column type frequency:
character	5
logical	1
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
ItemID	1	8	9	3623
Sex	1	1	1	2
Age	1	3	5	7
CityType	1	1	1	3
YearsInCity	1	1	2	5

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
HaveChildren	20170	0.96	0.41	FAL: 304366, TRU: 213041

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
CustomerID	0	1.00	1002991.85	1714.39	1000001	1001495	1003031	1004417	1006040
Profession	0	1.00	8.08	6.52	0	2	7	14	20
ItemCategory1	0	1.00	5.30	3.75	1	1	5	8	18
ItemCategory2	166986	0.69	9.84	5.09	2	5	9	15	18
ItemCategory3	373299	0.31	12.67	4.12	3	9	14	16	18
Amount	0	1.00	9333.86	4981.02	185	5866	8062	12073	23961

After checking that there wasn’t anything unusual. I decided to continue with two cleaning functions just in case.

clean_names(CustomerData) #This is used to ensure that all the columns names are compatible for R to understand

get_dupes(CustomerData) #This ensures that none of the rows are duplicated and eliminates any that are

## No variable names specified - using all columns.

## No duplicate combinations found of: CustomerID, ItemID, Sex, Age, Profession, CityType, YearsInCity, HaveChildren, ItemCategory1, ... and 3 other variables

Analyse

Afterwards it was time to analyse and this took some time. First, I was wanting to look at the data I had available. Each header telling me what was contained in that column.

glimpse(CustomerData)

## Rows: 537,577
## Columns: 12
## $ CustomerID    <dbl> 1000001, 1000001, 1000001, 1000001, 1000002, 1000003, 10…
## $ ItemID        <chr> "P00069042", "P00248942", "P00087842", "P00085442", "P00…
## $ Sex           <chr> "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M", "…
## $ Age           <chr> "0-17", "0-17", "0-17", "0-17", "55+", "26-35", "46-50",…
## $ Profession    <dbl> 10, 10, 10, 10, 16, 15, 7, 7, 7, 20, 20, 20, 20, 20, 9, …
## $ CityType      <chr> "A", "A", "A", "A", "C", "A", "B", "B", "B", "A", "A", "…
## $ YearsInCity   <chr> "2", "2", "2", "2", "4+", "3", "2", "2", "2", "1", "1", …
## $ HaveChildren  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TR…
## $ ItemCategory1 <dbl> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1, 5, 4, 2, 5, …
## $ ItemCategory2 <dbl> NA, 6, NA, 14, NA, 2, 8, 15, 16, NA, 11, NA, NA, 2, 8, 5…
## $ ItemCategory3 <dbl> NA, 14, NA, NA, NA, NA, 17, NA, NA, NA, NA, NA, NA, 5, 1…
## $ Amount        <dbl> 8370, 15200, 1422, 1057, 7969, 15227, 19215, 15854, 1568…

I figured out what I wanted to do. I wanted to look at the different type of users and what they were up to. That included sepearate dataframes for Age, Sex, Profession,children ,City, and the amount of time lived in that city. Whether that made a difference or not. I knew later I would want to go deeper into analysing users with children but for now I looked at the basics.

"Age" <- `CustomerData` %>% #Have a look to see which age group is using it most
  group_by(`Age`) %>%
  summarise(AgeCount =  n_distinct(CustomerID))
print(Age)

## # A tibble: 7 × 2
##   Age   AgeCount
##   <chr>    <int>
## 1 0-17       218
## 2 18-25     1069
## 3 26-35     2053
## 4 36-45     1167
## 5 46-50      531
## 6 51-55      481
## 7 55+        372

I realised at this point that numbers were so low, so I just wanted to make sure that the data I was looking at was correct.

`CustomerData` %>% #Only 5891 customers out of 537,577 rows. I'll double check these numbers with the slower sheets file I got open. Sheets says the same when I do the function =COUNTUNIQUE()
  summarise(CustCount = n_distinct(CustomerID))

There was only 5891 customers out of 537,577 rows. I double checked the data using the =COUNTUNIQUE function in sheets and found that I got the same number. It was correct, there was just under 6000 people who tested this product.

From the data, it was also clear that it was mostly 26-35 year olds that wanted to use VR. You can see this more clearly in the report I created in PowerBI showing all the visuals.

"AgeTime" <- `CustomerData` %>% #The older people get, the longer they look at the article
  group_by(`Age`) %>%
  summarise(AvgTime = mean(Amount))
print(AgeTime)

## # A tibble: 7 × 2
##   Age   AvgTime
##   <chr>   <dbl>
## 1 0-17    9020.
## 2 18-25   9235.
## 3 26-35   9315.
## 4 36-45   9401.
## 5 46-50   9285.
## 6 51-55   9621.
## 7 55+     9454.

This was interesting. The older people got, the more time they would spend looking at individual products.

"Sex" <- `CustomerData` %>% #Over double the amount of people using this VR experience are male and men spend longer looking at products. Around 70% are men and 30% are women.
  group_by(`Sex`) %>%
  summarise(SexCount =  n_distinct(CustomerID),AvgTime = mean(Amount)) %>%
  mutate(CountPerc = SexCount / sum(SexCount)*100) %>%
  mutate(TimePerc = AvgTime / sum(AvgTime)*100)
print(Sex)

## # A tibble: 2 × 5
##   Sex   SexCount AvgTime CountPerc TimePerc
##   <chr>    <int>   <dbl>     <dbl>    <dbl>
## 1 F         1666   8810.      28.3     48.1
## 2 M         4225   9505.      71.7     51.9

I decided to put a percentage in this one that would compare the sum of the column with the smaller part. Showing that around 70% of the users using VR are men and just under 30% were female. Men even spent a little longer on average on looking at the product than women.

"Prof" <-`CustomerData` %>%
  group_by(`Profession`) %>%
  summarise(Count = n_distinct(CustomerID), AvgTime = (mean(Amount)))%>%
  mutate(CountPerc = (Count / sum(Count)*100)) %>%
  mutate(TimePerc = (AvgTime / sum(AvgTime)*100)) %>%
  arrange(desc(CountPerc))
print(Prof)

## # A tibble: 21 × 5
##    Profession Count AvgTime CountPerc TimePerc
##         <dbl> <int>   <dbl>     <dbl>    <dbl>
##  1          4   740   9279.     12.6      4.74
##  2          0   688   9187.     11.7      4.70
##  3          7   669   9502.     11.4      4.86
##  4          1   517   9018.      8.78     4.61
##  5         17   491   9906.      8.33     5.06
##  6         12   376   9883.      6.38     5.05
##  7         14   294   9569.      4.99     4.89
##  8         20   273   8881.      4.63     4.54
##  9          2   256   9026.      4.35     4.61
## 10         16   235   9457.      3.99     4.84
## # … with 11 more rows

This was a large file, but generally I could see that there were people with certain professions who would be more interested in VR. However, the dataset was so small and it was a test run in the stores, there can be other contributing factors at play. These could include people with shift work not being able to make it on the day the store is showcasing VR or the other way around. Perhaps there could be a lot of co-workers who work in the store and surrounding stores whoc used the VR on their breaks. If I was to look into this again, I would want to know the types of professions these people do and see if the data coincides with professional interests. For example, 3D designers could be interested in VR because the want to look at the technology progression so they could perhaps buy a VR headset for work. Or perhaps game creators are just naturally interested in the technology.

"City" <- `CustomerData` %>%
  group_by(`CityType`) %>%
  summarise(Count = n_distinct(CustomerID), AvgTime = (mean(Amount)))
print(City)

## # A tibble: 3 × 3
##   CityType Count AvgTime
##   <chr>    <int>   <dbl>
## 1 A         1045   8958.
## 2 B         1707   9199.
## 3 C         3139   9844.

If this was a real study, I would ask what they City Types represent. Whether that be particular cities, city density and so on. Or if it is something else. However, the data clearly shows that people are a lot more interested in VR in type C. However, the tricky part is that we don’t know details. Details we would want to understand this would be what City Types represent, the amount of time VR was shown in these cities, the amount of staff available to assist customers with VR and so on.

"Local" <-`CustomerData` %>%
  group_by(`YearsInCity`) %>%
  summarise(CountA = n_distinct(ifelse(CityType == "A",CustomerID, NA),na.rm = T),AvgTimeA = mean(ifelse(CityType == "A",CustomerID, NA), na.rm = T),CountB = n_distinct(ifelse(CityType == "B",CustomerID,NA),na.rm = T),AvgTimeB = mean(ifelse(CityType == "B",CustomerID, NA), na.rm = T) ,CountC = n_distinct(ifelse(CityType == "C",CustomerID,NA),na.rm = T),AvgTimeC = mean(ifelse(CityType == "C",CustomerID, NA), na.rm = T),TCount = n_distinct(CustomerID),TAvgTime = (mean(Amount)))
print(Local)

## # A tibble: 5 × 9
##   YearsInCity CountA AvgTimeA CountB AvgTimeB CountC AvgTimeC TCount TAvgTime
##   <chr>        <int>    <dbl>  <int>    <dbl>  <int>    <dbl>  <int>    <dbl>
## 1 0              147 1002916.    211 1003020.    414 1003140.    772    9247.
## 2 1              370 1002915.    608 1003130.   1108 1003006.   2086    9320.
## 3 2              183 1003183.    342 1003059.    620 1002960.   1145    9398.
## 4 3              180 1002492.    295 1002920.    504 1003158.    979    9351.
## 5 4+             165 1002972.    251 1002918.    493 1002858.    909    9346.

Next was just a little bit more complicated. I wanted to check the time that each user had lived in the city and how that was to effect the likelihood of them wanting to try VR. As you can see from the data frame or in the PowerBI report, you can clearly see that people who have lived in the city for more than a year are more likely to try VR. I know that this is not a report to speculate in, however, this probably is due to people who have just moved in are looking to settle so too busy, people who are 2+ years area already settled so not looking for any new experiences. Each of the counts with letters represent different cities and from the data we can see that it doesn’t matter which city you live in, you are still more likely to try VR if you have lived in that city for 1-2 years.

"TKids" <- `CustomerData` %>%
  group_by(`HaveChildren`) %>%
  drop_na(HaveChildren) %>%
  summarise(Count = n_distinct(CustomerID), AvgTime = (mean(Amount))) %>%
  mutate(CountPerc = (Count / sum(Count)*100)) %>%
  mutate(TimePerc = (AvgTime / sum(AvgTime)*100))
print(TKids)

## # A tibble: 2 × 5
##   HaveChildren Count AvgTime CountPerc TimePerc
##   <lgl>        <int>   <dbl>     <dbl>    <dbl>
## 1 FALSE         3280   9334.      57.8     50.0
## 2 TRUE          2399   9332.      42.2     50.0

Now here comes the simple question of how many people who used VR have kids. I calculated the percentage on the total sum of the other columns since it would give me a more accurate reading.It is clear that just under 60% of people who used VR didn’t have children and just over 40% did.

"KidCount" <- `CustomerData` %>% #Tried with item ID, ended up with 3623 rows, so trying with category instead
  group_by(`ItemCategory1`) %>%
  summarise(WithKids = n_distinct(ifelse(HaveChildren == TRUE,CustomerID, NA),na.rm = T),WOKids = n_distinct(ifelse(HaveChildren == FALSE,CustomerID, NA),na.rm = T)) %>%
  mutate(WithKidsPerc = WithKids / sum(WithKids)*100) %>% 
  mutate(WOKidsPerc = WOKids / sum(WOKids)*100) %>% 
  mutate(KidsPercDif = WithKidsPerc-WOKidsPerc) %>%
  arrange(KidsPercDif)
print(KidCount)

## # A tibble: 18 × 6
##    ItemCategory1 WithKids WOKids WithKidsPerc WOKidsPerc KidsPercDif
##            <dbl>    <int>  <int>        <dbl>      <dbl>       <dbl>
##  1             3     1489   2198        7.00       7.51      -0.518 
##  2            11     1418   2024        6.66       6.92      -0.257 
##  3             2     1713   2417        8.05       8.26      -0.214 
##  4            15      967   1380        4.54       4.72      -0.174 
##  5             4     1337   1870        6.28       6.39      -0.111 
##  6             9      156    235        0.733      0.803     -0.0704
##  7             1     2331   3224       11.0       11.0       -0.0697
##  8            16     1252   1740        5.88       5.95      -0.0660
##  9            13      910   1257        4.28       4.30      -0.0217
## 10             6     1654   2261        7.77       7.73       0.0416
## 11             7      595    805        2.80       2.75       0.0436
## 12             5     2345   3193       11.0       10.9        0.102 
## 13            14      408    524        1.92       1.79       0.126 
## 14             8     2313   3139       10.9       10.7        0.136 
## 15            18      538    689        2.53       2.36       0.172 
## 16            10      968   1260        4.55       4.31       0.241 
## 17            17      202    206        0.949      0.704      0.245 
## 18            12      686    827        3.22       2.83       0.396

This is something I didn’t use in my final report since it is showing a 0.5% difference between people with and without children. I considered this statistically insignificant and thought this data isn’t useful.

"KidTime" <- `CustomerData` %>% #I'm not comfortable with these variables. These numbers are too tight. Maybe the second category can give me some light
  group_by(`ItemCategory1`) %>%
  summarise(WithKids = mean(ifelse(HaveChildren == TRUE,Amount, NA),na.rm = T),WOKids = mean(ifelse(HaveChildren == FALSE,Amount, NA),na.rm = T)) %>%
  mutate(WithKidsPerc = WithKids / (WithKids+WOKids)*100) %>% 
  mutate(WOKidsPerc = WOKids / (WithKids+WOKids)*100) %>%
  mutate(KidsPercDif = WithKidsPerc-WOKidsPerc) %>% #Something has gone wrong with this difference, however I think I will leave it since I can clearly see the data anyway
  arrange(KidsPercDif)
print(KidTime)

## # A tibble: 18 × 6
##    ItemCategory1 WithKids WOKids WithKidsPerc WOKidsPerc KidsPercDif
##            <dbl>    <dbl>  <dbl>        <dbl>      <dbl>       <dbl>
##  1             9   15064. 15919.         48.6       51.4      -2.76 
##  2            11    4614.  4721.         49.4       50.6      -1.14 
##  3            18    2950.  3001.         49.6       50.4      -0.861
##  4            13     720.   725.         49.8       50.2      -0.383
##  5             6   15811. 15867.         49.9       50.1      -0.177
##  6             7   16317. 16364.         49.9       50.1      -0.145
##  7            10   19703. 19663.         50.1       49.9       0.100
##  8            12    1352.  1348.         50.1       49.9       0.154
##  9            15   14790. 14744.         50.1       49.9       0.154
## 10             1   13640. 13596.         50.1       49.9       0.161
## 11            16   14802. 14736.         50.1       49.9       0.223
## 12             8    7521.  7479.         50.1       49.9       0.279
## 13             5    6260.  6222.         50.2       49.8       0.305
## 14            17   10148. 10076.         50.2       49.8       0.356
## 15            14   13188. 13082.         50.2       49.8       0.406
## 16             2   11360. 11178.         50.4       49.6       0.806
## 17             3   10198. 10010.         50.5       49.5       0.927
## 18             4    2358.  2305.         50.6       49.4       1.14

"KidTime2" <- `CustomerData` %>% 
  group_by(`ItemCategory2`) %>%
  summarise(WithKids = mean(ifelse(HaveChildren == TRUE,Amount, NA),na.rm = T),WOKids = mean(ifelse(HaveChildren == FALSE,Amount, NA),na.rm = T)) %>%
  mutate(WithKidsPerc = WithKids / (WithKids+WOKids)*100) %>% 
  mutate(WOKidsPerc = WOKids / (WithKids+WOKids)*100) %>%
  mutate(KidsPercDif = WithKidsPerc-WOKidsPerc) %>%
  arrange(KidsPercDif)
print(KidTime2)

## # A tibble: 18 × 6
##    ItemCategory2 WithKids WOKids WithKidsPerc WOKidsPerc KidsPercDif
##            <dbl>    <dbl>  <dbl>        <dbl>      <dbl>       <dbl>
##  1             7    6791.  6917.         49.5       50.5     -0.922 
##  2            15   10264. 10419.         49.6       50.4     -0.754 
##  3             8   10231. 10323.         49.8       50.2     -0.446 
##  4            16   10256. 10333.         49.8       50.2     -0.371 
##  5            10   15613. 15656.         49.9       50.1     -0.137 
##  6            17    9407.  9402.         50.0       50.0      0.0268
##  7            14    7107.  7094.         50.0       50.0      0.0912
##  8             6   11534. 11503.         50.1       49.9      0.137 
##  9             9    7314.  7273.         50.1       49.9      0.281 
## 10             2   13678. 13600.         50.1       49.9      0.286 
## 11            NA    7737.  7655.         50.3       49.7      0.535 
## 12            18    9426.  9325.         50.3       49.7      0.537 
## 13            12    7008.  6925.         50.3       49.7      0.591 
## 14            11    9005.  8873.         50.4       49.6      0.739 
## 15             4   10332. 10134.         50.5       49.5      0.967 
## 16             3   11364. 11140.         50.5       49.5      0.992 
## 17             5    9157.  8969.         50.5       49.5      1.04  
## 18            13    9879.  9534.         50.9       49.1      1.77

"KidTime3" <-`CustomerData` %>% 
  group_by(`ItemCategory3`) %>%
  summarise(WithKids = mean(ifelse(HaveChildren == TRUE,Amount, NA),na.rm = T),WOKids = mean(ifelse(HaveChildren == FALSE,Amount, NA),na.rm = T)) %>%
  mutate(WithKidsPerc = WithKids / (WithKids+WOKids)*100) %>% 
  mutate(WOKidsPerc = WOKids / (WithKids+WOKids)*100) %>%
  mutate(KidsPercDif = WithKidsPerc-WOKidsPerc) %>%
  arrange(KidsPercDif)
print(KidTime3)

## # A tibble: 16 × 6
##    ItemCategory3 WithKids WOKids WithKidsPerc WOKidsPerc KidsPercDif
##            <dbl>    <dbl>  <dbl>        <dbl>      <dbl>       <dbl>
##  1             3   13749. 14104.         49.4       50.6     -1.28  
##  2            11   12029. 12198.         49.7       50.3     -0.698 
##  3            14    9980. 10113.         49.7       50.3     -0.663 
##  4             6   13129. 13236.         49.8       50.2     -0.403 
##  5             8   13009. 13052.         49.9       50.1     -0.162 
##  6             9   10433. 10438.         50.0       50.0     -0.0249
##  7            NA    8314.  8300.         50.0       50.0      0.0818
##  8            16   12017. 11956.         50.1       49.9      0.252 
##  9            13   13221. 13146.         50.1       49.9      0.287 
## 10            15   12393. 12316.         50.2       49.8      0.309 
## 11            17   11832. 11748.         50.2       49.8      0.359 
## 12            18   11052. 10936.         50.3       49.7      0.527 
## 13             5   12226. 12067.         50.3       49.7      0.653 
## 14             4    9867.  9730.         50.3       49.7      0.700 
## 15            10   13667. 13372.         50.5       49.5      1.09  
## 16            12    8861.  8648.         50.6       49.4      1.21

All these dataframes are looking at the different product catagories and explores the likelihood of people looking at these products categories when they have kids or not. It turns out that there wasn’t a single catagory in any of Category types that really stood out as having one group or the other looking at those items more. The most the difference even got was 2% in KidTime2 which I didn’t consider enough to chase.

"KidTimeArt" <-`CustomerData` %>% #Tried with item ID, ended up with 3623 rows, so trying with category instead, could put this in a line graph, see what we have
  group_by(`ItemID`) %>%
  summarise(WithKids = mean(ifelse(HaveChildren == TRUE,(Amount), NA),na.rm = T),WOKids = mean(ifelse(HaveChildren == FALSE,(Amount), NA),na.rm = T)) %>%
  mutate(WithKidsPerc = (WithKids / (WithKids+WOKids)*100)) %>% 
  mutate(WOKidsPerc = (WOKids / (WithKids+WOKids)*100)) %>% 
  mutate(KidsPercDif = WithKidsPerc-WOKidsPerc)%>% 
  mutate(Dif = WithKids-WOKids) %>%
  arrange(desc(KidsPercDif)) #I want to put this in Ascending to check the Null's
  #arrange(Dif) #I had to scroll down for the Nulls, however the Null's were so small, I thought I should look at the dataframe overall
#Success, the item with ID P00131842 was looked at 14159 more miliseconds by people with kids. I should take the everything where people with kids looked at the item 2000 miliseconds longer than people without kids and add the ones which people with kids looked at. Take that data and be able to assess if the others have kids or not.
print(KidTimeArt)

## # A tibble: 3,623 × 7
##    ItemID    WithKids WOKids WithKidsPerc WOKidsPerc KidsPercDif    Dif
##    <chr>        <dbl>  <dbl>        <dbl>      <dbl>       <dbl>  <dbl>
##  1 P00175142    7989   2193          78.5       21.5        56.9  5796 
##  2 P00077142    7866   2222          78.0       22.0        55.9  5644 
##  3 P00309742    6096.  1729          77.9       22.1        55.8  4368.
##  4 P00161342   15262   4516          77.2       22.8        54.3 10746 
##  5 P00131842   20468.  6308          76.4       23.6        52.9 14160.
##  6 P00138742   12779   4348          74.6       25.4        49.2  8431 
##  7 P00261942   11425   3977          74.2       25.8        48.4  7448 
##  8 P00293442     571.   200          74.1       25.9        48.1   371.
##  9 P00152342    8565   3010.         74.0       26.0        48.0  5554.
## 10 P00247842    6025   2195          73.3       26.7        46.6  3830 
## # … with 3,613 more rows

I instead looked at each individual article. I found there were quite a few articles where people with kids would look at those longer than people without. This could well predict whether people do have kids or not if we don’t have that data.

"KidAgeCount" <-`CustomerData` %>% #The older people get, the more likely they are to have kids
  group_by(`Age`) %>%
  summarise(CountKids = n_distinct(ifelse(HaveChildren == TRUE,CustomerID, NA),na.rm = T),CountWOKids = n_distinct(ifelse(HaveChildren == FALSE,CustomerID,NA),na.rm = T)) %>%
  mutate(WithKidsPerc = (CountKids / (CountKids+CountWOKids))*100) %>% 
  mutate(WOKidsPerc = (CountWOKids / (CountKids+CountWOKids))*100) %>%
  mutate(dif = WithKidsPerc - WOKidsPerc)
print(KidAgeCount)

## # A tibble: 7 × 6
##   Age   CountKids CountWOKids WithKidsPerc WOKidsPerc    dif
##   <chr>     <int>       <int>        <dbl>      <dbl>  <dbl>
## 1 0-17          0         211          0        100   -100  
## 2 18-25       239         785         23.3       76.7  -53.3
## 3 26-35       783        1198         39.5       60.5  -20.9
## 4 36-45       448         675         39.9       60.1  -20.2
## 5 46-50       363         152         70.5       29.5   41.0
## 6 51-55       334         128         72.3       27.7   44.6
## 7 55+         232         131         63.9       36.1   27.8

I then looked at people of differenct age groups and the liklihood of them having kids. It was immediately clear The older the user gets, the more likely they are to have kids. Dropping off when we looked at people over the age of 55. You can clearly see this in the report.

"KidCityYearsCount" <-`CustomerData` %>% #There doesn't seem to be a difference when looking at years in the city
  group_by(`YearsInCity`) %>%
  summarise(CountKids = (n_distinct(ifelse(HaveChildren == TRUE,CustomerID, NA),na.rm = T)),CountWOKids = (n_distinct(ifelse(HaveChildren == FALSE,CustomerID,NA),na.rm = T))) %>%
  mutate(WithKidsPerc = (CountKids / sum(CountKids))*100) %>% 
  mutate(WOKidsPerc = (CountWOKids / sum(CountWOKids))*100) %>%
  mutate(dif = WithKidsPerc - WOKidsPerc)
print(KidCityYearsCount)

## # A tibble: 5 × 6
##   YearsInCity CountKids CountWOKids WithKidsPerc WOKidsPerc      dif
##   <chr>           <int>       <int>        <dbl>      <dbl>    <dbl>
## 1 0                 300         446         12.5       13.6 -1.09   
## 2 1                 891        1120         37.1       34.1  2.99   
## 3 2                 469         641         19.5       19.5  0.00713
## 4 3                 385         554         16.0       16.9 -0.842  
## 5 4+                354         519         14.8       15.8 -1.07

The last dataframe was looking deeper into this idea of the users having children and whether the amount of time the user has lived in the city makes a difference as to whether they had children or not. The result was a very small difference between years in the city and whether or not the people were more likely to have kids. I was going this way because I was thinking whether users who have just moved to the city were more likely to be travellers who never settled. But I couldn’t find any data to back that theory up.

I exported all the DataFrames as CSV files to be pulled into PowerBI to throw together the report that shows all the visuals. I realised that this is probably the most efficent way of doing things since writing out code to show the visuals is very time consuming.

write.csv(Age,"C:/Users/chris/Documents/R/IkeaInterview220922/Ikea Interview/CSV/Age.csv",row.names = FALSE)
write.csv(AgeTime,"C:/Users/chris/Documents/R/IkeaInterview220922/Ikea Interview/CSV/AgeTime.csv",row.names = FALSE)
write.csv(Sex,"C:/Users/chris/Documents/R/IkeaInterview220922/Ikea Interview/CSV/Sex.csv",row.names = FALSE)
write.csv(Prof,"C:/Users/chris/Documents/R/IkeaInterview220922/Ikea Interview/CSV/Prof.csv",row.names = FALSE)
write.csv(City,"C:/Users/chris/Documents/R/IkeaInterview220922/Ikea Interview/CSV/City.csv",row.names = FALSE)
write.csv(Local,"C:/Users/chris/Documents/R/IkeaInterview220922/Ikea Interview/CSV/Local.csv",row.names = FALSE)
write.csv(TKids,"C:/Users/chris/Documents/R/IkeaInterview220922/Ikea Interview/CSV/TKids.csv",row.names = FALSE)
write.csv(KidTimeArt,"C:/Users/chris/Documents/R/IkeaInterview220922/Ikea Interview/CSV/KidTimeArt.csv",row.names = FALSE)
write.csv(KidAgeCount,"C:/Users/chris/Documents/R/IkeaInterview220922/Ikea Interview/CSV/KidAgeCount.csv",row.names = FALSE)
write.csv(CustomerData,"C:/Users/chris/Documents/R/IkeaInterview220922/Ikea Interview/CSV/CustomerData.csv",row.names = FALSE)

Answering Question 1 - Who is the main target group? Which segments do you identify?:

The Segments I identify are:

Men used VR more often and spent more time looking at the products
Aged 26-35, the younger also enjoyed VR and so did the older equally. Then it wavered off after the age of 50.
The professions that enjoyed VR most included 4,0,7 and 1.
The city type that enjoyed VR the most was type C, more of them using it and looking at products longer. Followed by B, then at the bottom was A
The majority of people trying out VR don’t have kids
There are specific products that people with kids prefer to look at but the list is too long to list here
People between the ages of 51 to 55 appear to spend more time on average looking at each product in VR

Answering Question 3 - The team wants to develop new features that is personalised for each target market. Which target market should they focus on first?

The right target group to create VR products for are males aged 26-35 living in cities catagorised as “C” having lived there for over 1 year but below 2 years, doesn’t have children and profession falls into catagory 4.

Answering Question 4 - Some users recorded whether they had children or not and others did not. The team is wanting to increase children product sales. They want to know which characteristics a user has that shows that they have children.

I looked into a variety of different tells that could tell us if someone had kids or not. The two segments that really stood out to me are as follows:

There were certain products users with kids would spend longer looking at. The list is too long to list here.
People were most likely to have kids if they were between the ages of 51-55 and the chance of people having kids goes down with age until its extremely unlikely that people have kids if they are a teenager.

Act

Question 2 asked me what kind of data would I want to improve my analysis and back-up the insights I mentioned and why? There are several bits of data I would like:

I would like to know the time and date when the users were trying out the VR headset to pull in other factors that could contribute to why certain groups were more likely to try VR then others
I would also love to have the data revealing the professions each of the numbers represent to know more certainly about whether they are working in the shop or surrounding shops and other factors.
I would like to know what each of the City Types represent, to understand whether they were multiple cities or just one. Whether the type was judged on density of the city, location or something else. All of this data would help build a better profile on the customer which is using the VR. Knowing the types of connections that city has and other variables that may not have been considered when assigning this city type.

The dataset was quite small with only a couple thousand participants. However, if I was working for this company, I would hope that these experiments would continue before they release the final product. I still feel that there is some more data needed to get a clear picture as to who our target market should be. This includes time and date when the VR was tried and more details on parts assigned codes instead of named. But overall, I do feel that the suggestions I included would help any team trying to improve their product.

Thankyou for reading.