How to use map and reduce efficiently? (2024)

4 views (last 30 days)

Show older comments

Daniel Pinto on 8 Aug 2019

  • Link

    Direct link to this question

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently

  • Link

    Direct link to this question

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently

Edited: Daniel Pinto on 9 Aug 2019

Accepted Answer: Guillaume

Open in MATLAB Online

I have the following table, which has over 40 million rows and 5 columns:

How to use map and reduce efficiently? (2)

The first column is irrelevant. The second column is a YYYYMMDD date, and the frequency of the data is quarterly. The third column is a firmID - some firm IDs include letters as well as numbers. The fourth and fifth columns are values assigned to 2 different variables.

I wish to do 2 things:

1) for every rdate-cusip pair, sum shares across all different identifiers of mgrno that exist for that rdate-cusip combination. Call this value A.

2) for every rdate-cusip pair, obtain the mode value of shrout2 across the different identifiers of mgrno that exist for that rdate-cusip combination. Call this value B.

3) divide A by B.

This would normally be straightforward, but due to the big dimensions of the data, I am struggling to do it. I have tried to use the functions map and reduce, without really loading the file into the workspace, but I believe I am mkaing some kind of mistake. I was getting error messages trying to conduct the division inside the mapping phase, so I decided to skip the division and just have as output a table in which the first column is quarter-CUSIP identifier, second column is A, and third column is B.

ds = datastore('myFile.csv');

ds.TextscanFormats{3} = '%q';

ds.TextscanFormats{4} = '%q';

outds = mapreduce(ds, @gvkeyMapFun2, @gvkeyReduceFun2);

output = readall(outds);

where the functions are defined as

function gvkeyMapFun2(data, ~, intermKVStore)

% gets quarter variables

vQuarter = num2str(data.rdate); % char format

% gets cusip in char format

vNCUSIP = cell2mat(data.cusip);

% creates quarter-ncusip identifer

IDnum = strcat(vQuarter,vNCUSIP);

IDnum = cellstr(IDnum);

% finds unique NCUSIPS-quarter

[intermKeys,~,idx] = unique(IDnum, 'stable'); % intermKeys is cell of characters (some cusips have letters), idx is double

% gets variables of intersst

dataOwnership = cellfun(@(x) str2double(x),data.shares);

dataTotalShares = data.shrout2;

for ii = 1:numel(intermKeys)

totalOwnership = sum(dataOwnership(idx==ii));

totalShares = mode(dataTotalShares(idx==ii));

totalOwnershipInfo(ii,1:3) = [repmat(intermKeys(ii),size(totalOwnership,1),1), totalOwnership,repmat(totalShares,size(totalOwnership,1),1) ];

add(intermKVStore, intermKeys{ii}, totalOwnershipInfo);

end

end

and

function gvkeyReduceFun2(intermKey, intermValIter, outKVStore)

databasereducedFinal = array2table([]);

while hasnext(intermValIter)

databasereducedFinal = [databasereducedFinal; getnext(intermValIter)];

end

add(outKVStore, 'output', databasereducedFinal);

end

I then run

output = readall(outds);

c = vertcat(output{:, 2});

tableBig = vertcat(c{:});

to try and get the table because "output" looks like this:

How to use map and reduce efficiently? (3)

I feel this is still quite inefficient. Is there anyway do this more efficiently? (also, I believe there's some other mistake somewhere, because the final table "tableBig" is larger than I would expect given the possible number of unique CUSIP-quarters.

thank you.

5 Comments

Show 3 older commentsHide 3 older comments

Guillaume on 8 Aug 2019

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733163

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733163

Which version of matlab are you using? This is straightforward to do with groupsummary (R2018a or later required) but for the fact that shares is text instead of numbers. In earlier versions, you'd use splitapply which is only slightly more complicated.

So first, I think you need to fix your import of the data so that shares is numeric. It could be done after the fact but I supect it would be extremely slow. Also, you may want to import rdate as a proper datetime as that would make it easier to calculate monthly/quarterly/yearly/etc. stats.

I'm also not entirely clear if the grouping variables are {'rdate', 'cusip'} or {'mgrno', 'rdate', 'cusip'}.

Daniel Pinto on 8 Aug 2019

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733171

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733171

I am using 2017!

Guillaume on 8 Aug 2019

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733217

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733217

Then splitapply it is

What about my other comment and question? I've just noticed that you explicitly ask that shares be read as text. Why? You can't sum text.

Daniel Pinto on 8 Aug 2019

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733230

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733230

Open in MATLAB Online

Thank you. Im trying splitapply. I ll update here on my progress.

Yes, it's weird imposing shares be read as text. so with map and replace, if I do not set

ds.TextscanFormats{4} = '%q';

I was getting an error message in the mapping phase . so i was doing that to avoid that error message, and then get it back to numeric:

dataOwnership = cellfun(@(x) str2double(x),data.shares);

Daniel Pinto on 8 Aug 2019

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733265

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733265

I am trying splitapply. Before doing the splitapply, I am trying to define a group based on a quarter-cusip identifier. to do that, i try to concatenate quarter and cusip in string format. the problem is that the strcat of those two is taking for hours now... my computer has 16gb of ram, shouldn't that be enough despite the fact that I am dealing with 40 million rows?

Sign in to comment.

Sign in to answer this question.

Accepted Answer

Guillaume on 8 Aug 2019

  • Link

    Direct link to this answer

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#answer_386829

  • Link

    Direct link to this answer

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#answer_386829

Open in MATLAB Online

Assuming your shares variable is numeric and assuming your grouping variables are {'rdate', 'cusip'},

t = tall(ds);

[group, rdate, cusip] = findgroups(t.rdate, t.cusip);

shareratio = splitapply(@(shares, shrout2) sum(shares) / mode(shrout2), t.shares, t.shrout2, group);

result = gather(table(rdate, cusip, shareratio));

I was getting an error message

What was the error message. I would suspect that the a posteriori str2double conversion would really slow things down and really it shouldn't be necessary.

5 Comments

Show 3 older commentsHide 3 older comments

Daniel Pinto on 8 Aug 2019

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733278

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733278

Open in MATLAB Online

Thansk Guillaume. the error message goes away. I just need to apply the %q transformation for the cusip column. Once I do that, the error message for shares vanishes.

Shares is therefore numeric.

I didn't know about tall(.). Thanks for that.

When I run

[group, rdate, cusip] = findgroups(t.rdate, t.cusip);

I get the error message:

Error using tall/findgroups (line 18)

Only a single output argument is supported for FINDGROUPS of tall arrays.

I tried with only one output (group), and got no error message, but the contents of group are question marks ?.

I further getI the error message

Error using tall/splitapply (line 38)

ISEQUAL does not support tall arrays.

I'm reading on tall arrays and trying to find a solution, but if you see a quick solution please share as solving this issue quickly is important to me.

Guillaume on 8 Aug 2019

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733301

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733301

Open in MATLAB Online

Indeed, the support for the additional output arguments of findgroups with tall arrays was introduced in R2018a.

This may work (I don't have R2017 a? b? anymore to test):

t = tall(ds);

group = findgroups(t.rdate, t.cusip);

shareratio = splitapply(@(shares, shrout2) sum(shares) / mode(shrout2), t.shares, t.shrout2, group);

rdate = splitapply(@(x) x(1), t.rdate, group);

cusip = splitapply(@(x) x(1), t.cusip, group);

result = gather(table(rdate, cusip, shareratio));

If it still generates an error, then share the code you're using and the full text of the error message (particularly the line that generates the error).

On the other hand, you could indeed use mapreduce, looking at the algorithm you wrote, I don't think you got it right at all though.

Daniel Pinto on 9 Aug 2019

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733319

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733319

Edited: Daniel Pinto on 9 Aug 2019

Open in MATLAB Online

t = tall(ds2);

group = findgroups(t.rdate, t.cusip);

shareratio = splitapply(@(shares, shrout2) sum(shares / (shrout2*1000)) , t.shares, t.shrout2, group);

rdate = splitapply(@(x) x(1), t.rdate, group);

cusip = splitapply(@(x) x(1), t.cusip, group);

result = gather(table(rdate, cusip, shareratio));

This for instance seems to work - no error message, but I did not let it finish because it was taking some time. The problem is the use of the isequal statement in the mode function as it does not support tall arrays. Therefore if I write a mode function myself without isequal, it should work.

Alternatively, with the map-reduce function, I found this to work perfectly and very fast:

function gvkeyMapFun3(data, ~, intermKVStore)

IDnum = strcat(data.rdate,data.cusip);

dataOwnership = data.shares;

dataTotalShares = data.shrout2;

[intermKeys,~,idx] = unique(IDnum, 'stable');

intermVals1 = accumarray(idx,dataOwnership,size(intermKeys));

intermVals2 = 1000 * accumarray(idx,dataTotalShares,size(intermKeys),@mode);

intermVals = num2cell(intermVals1./intermVals2);

addmulti(intermKVStore,intermKeys,intermVals);

end

function gvkeyReduceFun3(intermKey, intermValIter, outKVStore)

while hasnext(intermValIter)

intermValue = getnext(intermValIter);

end

add(outKVStore,intermKey,intermValue);

end

Thanks a lot for the help Guillaume. I understand this stuff much better now!

Guillaume on 9 Aug 2019

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733444

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733444

Edited: Guillaume on 9 Aug 2019

Open in MATLAB Online

Your mapreduce algorithm is problematic as it assumes that you get all the values (sum and mode) associated with a key (pair of rdate, cusip) in just one map pass. If the data in your table is not sorted by the key, it certainly won't be the case and even if it is, it may be that a key straddles two calls to map.

E.g. let's assume your data is sorted and in one of the call to gvkeyMapFun3, you get:

rdata cusip shares shrout2

---------------------------------------------

... ... ... ...

... ... ... ...

'J' 'KK' 123 88

'X' 'YY' 111 1

'X' 'YY' 123 1

'X' 'YY' 245 2

'X' 'YY' 785 2

'X' 'YY' 951 1

On the next call to gvkeyMapFun3, you get:

rdata cusip shares shrout2

---------------------------------------------

'X' 'YY' 632 2

'X' 'YY' 421 3

'X' 'YY' 248 2

'X' 'YY' 892 3

'X' 'YY' 230 3

'Z' 'NN' 673 45

... ... ... ...

As you can see the mode of your key pair {'X', 'YY'} is 2 which you never got with your current algorithm.

Coping with that significantly complicate the mode calculation, as you can't calculate the mode in the mapping function. Instead you have to keep around the histogram of all the values of shrout2 and calculate the mode yourself in the reduce function, once you've combined all the histograms for the key.

Here is something that should work properly:

function gvkeyMapFun3(data, ~, intermKVStore)

keys = strcat(data.rdate, '-', data.cusip); %may be a better way to build your keys. Will be easier to separate afterwards

[intermKeys, ~, keyid] = unique(keys, 'stable'); %I don't see the point of 'stable'. Probably makes no difference

sharesum = accumarray(keyid, data.shares); %sum up the shares that match the same keyid

shrouthist = accumarray(keyid, data.shrout2, [], @tempmode); %compute histogram of shrout2 for each keyid

values = num2cell([num2cell(sharesum), shrouthist], 2); %for each key store a 1x2 cell array, 1st element is the share sum, 2nd element is a 2 column matrix [values, counts]

addmulti(intermKVStore, intermKeys, values);

end

%can be local in gvKeyMapFun3.m

function modehist = tempmode(shrout2)

%compute the histogram of shrout2, receives values that match just one key

%returns a scalar cell array containing a Nx2 matrix. First column is unique values, 2nd column is their histogram

[shroutval, ~, id] = unique(shrout2);

shroutcount = accumarray(id, 1);

modehist = {[shroutval, shroutcount]};

end

function gvkeyReduceFun3(intermKey, intermValIter, outKVStore)

sharesum = 0;

shrouthist = zeros(0, 2);

while hasnext(intermValIter)

values = getnext(intermValIter);

sharesum = sharesum + values{1};

shrouthist = [shrouthist; values{2}]; %#ok<AGROW>

end

%now that we have the full histogram of shrout2, we can calculate the mode

[allshrout, ~, id] = unique(shrouthist(:, 1));

count = accumarray(id, shrouthist(:, 2));

[~, moderow] = max(count);

shroutmode = allshrout(moderow);

result = sharesum / 1000 / shroutmode; %Not sure where the 1000 comes from. Was never mentioned before

add(outKVStore, intermKey, result);

end

Daniel Pinto on 9 Aug 2019

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733482

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#comment_733482

Edited: Daniel Pinto on 9 Aug 2019

Thank you Guillaume for taking the time. It turns out in this case the correlation between the 2 outputs is 0.999, but you do make a very good point and I changed the code accordingly. This was very helpful and instructive, thank you!

ps: the multiplication by 1000 is just to scale the variable properly.

Sign in to comment.

More Answers (0)

Sign in to answer this question.

See Also

Categories

SciencesPhysics

Find more on Particle & Nuclear Physics in Help Center and File Exchange

Tags

  • map
  • reduce

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

An Error Occurred

Unable to complete the action because of changes made to the page. Reload the page to see its updated state.


How to use map and reduce efficiently? (15)

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list

Americas

  • América Latina (Español)
  • Canada (English)
  • United States (English)

Europe

  • Belgium (English)
  • Denmark (English)
  • Deutschland (Deutsch)
  • España (Español)
  • Finland (English)
  • France (Français)
  • Ireland (English)
  • Italia (Italiano)
  • Luxembourg (English)
  • Netherlands (English)
  • Norway (English)
  • Österreich (Deutsch)
  • Portugal (English)
  • Sweden (English)
  • Switzerland
    • Deutsch
    • English
    • Français
  • United Kingdom(English)

Asia Pacific

  • Australia (English)
  • India (English)
  • New Zealand (English)
  • 中国
  • 日本Japanese (日本語)
  • 한국Korean (한국어)

Contact your local office

How to use map and reduce efficiently? (2024)

FAQs

How do you use map and reduce together? ›

Recall that map and reduce automatically index into the array, and pass this value to their callbacks automatically. The reduce method does the same thing but passes that value as the second argument to its callback. The reduce method takes two arguments.

Is MapReduce efficient? ›

Some of the key advantages of using MapReduce in Hadoop include scalability, fault tolerance, and cost-effectiveness due to its parallel processing capabilities, redundant input data handling, and functional programming approach.

What do map () filter () and reduce () do? ›

In JavaScript, map() , filter() , and reduce() are used to process and manipulate arrays of data. They are functional programming methods that can be used to transform, filter, and aggregate data in a declarative and concise way.

How does the MapReduce approach work? ›

The Map function is applied in parallel to every pair (keyed by k1 ) in the input dataset. This produces a list of pairs (keyed by k2 ) for each call. After that, the MapReduce framework collects all pairs with the same key ( k2 ) from all lists and groups them together, creating one group for each key.

How do I optimize a MapReduce? ›

Optimizing Hadoop MapReduce jobs involves several steps. First, choose efficient data formats like Avro or Parquet to reduce storage and improve serialization. Tweak configuration parameters, adjust memory settings and task parallelism for optimal performance.

What is MapReduce for dummies? ›

MapReduce is a Java-based, distributed execution framework within the Apache Hadoop Ecosystem. It takes away the complexity of distributed programming by exposing two processing steps that developers implement: 1) Map and 2) Reduce. In the Mapping step, data is split between parallel processing tasks.

Does anyone still use MapReduce? ›

As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.

Is MapReduce dead? ›

MapReduce is still a vital tool for big ecommerce companies. It's also becoming more and more popular in machine learning. Read on! MapReduce is a popular programming model widely used in data services and cloud frameworks.

Does Google still use MapReduce? ›

The MapReduce model is now officially obsolete, so the new data processing models we use are called Flume (for the processing pipeline definition) and MillWheel (for the real-time dataflow orchestration). They are known externally as Cloud Dataflow / Apache Beam.

What are the steps in map reduction? ›

Procedure's for reducing map i ) Note and measure the length and width of the original old map. ii ) Divide the original map into small squares of equal size ( grid method), using a pencil. iii ) Draw the length and width of the reduced new plan. Reduce according to number of times requested.

What are the three steps of MapReduce? ›

Learning from imbalanced data is among the most challenging areas in contemporary machine learning. This becomes even more difficult when considered the context of big data that calls for dedicated architectures capable of high-performance processing.

How do I sort using MapReduce? ›

The MapReduce framework automatically sorts the keys generated by mappers. This means that, before starting reducers, all intermediate key-value pairs generated by mappers must be sorted by key (and not by value). Values passed to each reducer are not sorted at all; they can be in any order.

What is the difference between MapReduce and map? ›

The map function applies a function to all items in an input list. The reduce function applies a function of two arguments cumulatively to the items of an iterable to reduce the iterable to a single output.

Top Articles
Drawing Practice Exercises
Jim Ross Shares His Thoughts On A Potential Crossover Between AEW And WWE - PWMania - Wrestling News
Baue Recent Obituaries
Petco Clinic Hours
How Much Is Vivica Fox Worth
Umc Webmail
The Canterville Ghost Showtimes Near Northwoods Cinema 10
Pga Us Open Leaderboard Espn
Relic Gate Nms
Betty Rea Ice Cream
Red Dead Redemption 2 Legendary Fish Locations Guide (“A Fisher of Fish”)
Rugged Gentleman Barber Shop Martinsburg Wv
Rpa Service Charge Debit
Target Stocker Careers
'Kendall Jenner of Bodybuilding' Vladislava Galagan Shares Her Best Fitness Advice For Women – Fitness Volt
The Courier from Waterloo, Iowa
9xMovies: The Ultimate Destination for Free Movie Downloads
Lots 8&amp;9 Oak Hill Court, St. Charles, IL 60175 - MLS# 12162199 | CENTURY 21
To Give A Guarantee Promise Figgerits
Varsity Tutors, a Nerdy Company hiring Remote AP Calculus AB Tutor in United States | LinkedIn
18002226885
Layla Rides Codey
All Obituaries | Dante Jelks Funeral Home LLC. | Birmingham AL funeral home and cremation Gadsden AL funeral home and cremation
Truist Drive Through Hours
Alloyed Trident Spear
Razwan Ali ⇒ Free Company Director Check
We analyzed every QAnon post on Reddit. Here’s who QAnon supporters actually are.
Sdn Upstate 2023
Hyvee Workday
Jockey Standings Saratoga 2023
My Les Paul Forum
9132976760
Who We Are | Kappa Delta Sorority
Sams Gurnee Gas Price
Chrissy Laboy Daughter
Reisen in der Business Class | Air Europa Deutschland
Palm Coast Permits Online
Ridgid Pro Tool Storage System
Wjar Channel 10 Providence
Credit Bureau Contact Information
FedEx in meiner Nähe - Wien
Babyboo Fashion vouchers, Babyboo Fashion promo codes, Babyboo Fashion discount codes, coupons, deals, offers
Kenji Lentil Soup
eCare: Nutzung am PC | BARMER
Alibaba Expands Membership Perks for 88VIP
Milwaukee Zoo Ebt Discount
Ucla Outlook Web Access
The Eye Doctors North Topeka
Henry Ford Connect Email
Cb2 South Coast Plaza
Vimeo Downloader - Download Vimeo Videos Online - VEED.IO
Ravenna Greatsword Arcane Odyssey
Latest Posts
Article information

Author: Barbera Armstrong

Last Updated:

Views: 6032

Rating: 4.9 / 5 (59 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Barbera Armstrong

Birthday: 1992-09-12

Address: Suite 993 99852 Daugherty Causeway, Ritchiehaven, VT 49630

Phone: +5026838435397

Job: National Engineer

Hobby: Listening to music, Board games, Photography, Ice skating, LARPing, Kite flying, Rugby

Introduction: My name is Barbera Armstrong, I am a lovely, delightful, cooperative, funny, enchanting, vivacious, tender person who loves writing and wants to share my knowledge and understanding with you.