Select Language





221 浏览
0 喜欢
0 次下载
0 条讨论
Business,Arts and Entertainment,Movies and TV Shows,Classification,Data Visualization,Time Series Analysis Classification

数据结构 ? 56.36M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    Context The data set represents movies which were released in the years of xxx up to 2017. It is kept quite general and does not have any real problem / challenge as a background. The whole data set is meant to practice different types of techniques for a data analyst / data scientist. I′d like also to mention that the Dataset is not fully cleaned. Reasoning is that it shall demonstrate you the real life of being an Analyst / Scientist. Get Data - Prep Data - Analyse Data - Visualize Data - Predict Outcomes of different Use Cases ;-) Content I love watching movies and therefore tried to combine this hobby with my current self studies of becoming a data scientist. Therefore I needed a way to obtain a data set which included information of movies so that I could play around and use my learnings. On the first glance I could see that the data set can be used for Regressions, Classifications or potentially even Deep Learning (such as Image Recognition - Post URLs are given) I did aquire this dataset by using different steps. First I did check the internet for a specific API which I may use to receive movie information. After a short time I got to know With the help of this API I was able to fetch information based on the title of the movies. Now I had another problem. I was missing movie titles. The next search had begun. I couldn′t find an API for that but I did see that wikipedia was quite well structured in regards to movie titles. So I did build a scraper to fetch all movie titles from 1990 to 2017. After receiving all the data I could finally start to obtain all movie information of a movie by having the title + year (there might be movies which have the same name). Unfortunately some movie titles have been written differently and so I had a failure rate of 10% for obtaining the movie data. Based on the 10% failed movie titles - I did an Text Analysis and found around 400 000 new Movies / Series. The latest Version should include nearly 200 000 different movies based on the imdbID. Additionally I did clean some of the information such as Genre, Actors and Writer for better analysing. Each of the CSV File can be joined by the **imdbID**. Be aware that some information are missing and declared as *_NOT_GIVEN*. Acknowledgements - Thanks to for providing such a good API and well structured data. Inspiration The inspiration of this data set came from getting into the practical flow of developing an image recognition application. **Recognize the genre of a movie by the given poster.** By request I could also provide the images of the movies. But for the given Dataset I do have the following questions in my mind: 1. Does the Genre correlate with the given Scoring? 2. Can we see a hype of specific genre over the past years? 3. Do the actors or writer prefer a genre? 4. Do the actors or writer have an impact on the imdb scoring? 5. Do the directors have prefered actors for their movies? 6. Do the directors have prefered writers for their movies? 7. How many movies have been produced by the directors? 8. Is there any relation between the director and the imdb rating? 9. .... many more questions :-)



    • 分享你的想法


    所需积分:0 去赚积分?
    • 221浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享